Skip to content

Modules

We offer torch modules as well for easy integration into your neural network:

natten.NeighborhoodAttention1D

1-D Neighborhood Attention torch module.

Includes QKV and output linear projections.

Parameters:

Name Type Description Default
dim int

Number of input latent dimensions (channels).

Note

This is not head_dim, but rather head_dim * num_heads.

required
num_heads int

Number of attention heads.

required
kernel_size Tuple[int] | int

Neighborhood window (kernel) size.

Note

kernel_size must be smaller than or equal to seqlen.

required
stride Tuple[int] | int

Sliding window step size. Defaults to 1 (standard sliding window).

Note

stride must be smaller than or equal to kernel_size. When stride == kernel_size, there will be no overlap between sliding windows, which is equivalent to blocked attention (a.k.a. window self attention).

1
dilation Tuple[int] | int

Dilation step size. Defaults to 1 (standard sliding window).

Note

The product of dilation and kernel_size must be smaller than or equal to seqlen.

1
is_causal Tuple[bool] | bool

Toggle causal masking. Defaults to False (bi-directional).

False
qkv_bias bool

Enable bias in the QKV linear projection.

True
qk_scale Optional[float]

Attention scale. Defaults to head_dim ** -0.5.

None
proj_drop float

Dropout score for projection layer. Defaults is 0.0 (no dropout).

0.0

natten.NeighborhoodAttention2D

2-D Neighborhood Attention torch module.

Includes QKV and output linear projections.

Parameters:

Name Type Description Default
dim int

Number of input latent dimensions (channels).

Note

This is not head_dim, but rather head_dim * num_heads.

required
num_heads int

Number of attention heads.

required
kernel_size Tuple[int, int] | int

Neighborhood window (kernel) size/shape. If an integer, it will be repeated for all 2 dimensions. For example kernel_size=3 is reinterpreted as kernel_size=(3, 3).

Note

kernel_size must be smaller than or equal to token layout shape ((X, Y)) along every dimension.

required
stride Tuple[int, int] | int

Sliding window step size/shape. Defaults to 1 (standard sliding window). If an integer, it will be repeated for all 2 dimensions. For example stride=2 is reinterpreted as stride=(2, 2).

Note

stride must be smaller than or equal to kernel_size along every dimension. When stride == kernel_size, there will be no overlap between sliding windows, which is equivalent to blocked attention (a.k.a. window self attention).

1
dilation Tuple[int, int] | int

Dilation step size/shape. Defaults to 1 (standard sliding window). If an integer, it will be repeated for all 2 dimensions. For example dilation=4 is reinterpreted as dilation=(4, 4).

Note

The product of dilation and kernel_size must be smaller than or equal to token layout shape ((X, Y)) along every dimension.

1
is_causal Tuple[bool, bool] | bool

Toggle causal masking. Defaults to False (bi-directional). If a boolean, it will be repeated for all 2 dimensions. For example is_causal=True is reinterpreted as is_causal=(True, True).

False
qkv_bias bool

Enable bias in the QKV linear projection.

True
qk_scale Optional[float]

Attention scale. Defaults to head_dim ** -0.5.

None
proj_drop float

Dropout score for projection layer. Defaults is 0.0 (no dropout).

0.0

natten.NeighborhoodAttention3D

3-D Neighborhood Attention torch module.

Includes QKV and output linear projections.

Parameters:

Name Type Description Default
dim int

Number of input latent dimensions (channels).

Note

This is not head_dim, but rather head_dim * num_heads.

required
num_heads int

Number of attention heads.

required
kernel_size Tuple[int, int, int] | int

Neighborhood window (kernel) size/shape. If an integer, it will be repeated for all 3 dimensions. For example kernel_size=3 is reinterpreted as kernel_size=(3, 3, 3).

Note

kernel_size must be smaller than or equal to token layout shape ((X, Y, Z)) along every dimension.

required
stride Tuple[int, int, int] | int

Sliding window step size/shape. Defaults to 1 (standard sliding window). If an integer, it will be repeated for all 3 dimensions. For example stride=2 is reinterpreted as stride=(2, 2, 2).

Note

stride must be smaller than or equal to kernel_size along every dimension. When stride == kernel_size, there will be no overlap between sliding windows, which is equivalent to blocked attention (a.k.a. window self attention).

1
dilation Tuple[int, int, int] | int

Dilation step size/shape. Defaults to 1 (standard sliding window). If an integer, it will be repeated for all 3 dimensions. For example dilation=4 is reinterpreted as dilation=(4, 4, 4).

Note

The product of dilation and kernel_size must be smaller than or equal to token layout shape ((X, Y, Z)) along every dimension.

1
is_causal Tuple[bool, bool, bool] | bool

Toggle causal masking. Defaults to False (bi-directional). If a boolean, it will be repeated for all 3 dimensions. For example is_causal=True is reinterpreted as is_causal=(True, True, True).

False
qkv_bias bool

Enable bias in the QKV linear projection.

True
qk_scale Optional[float]

Attention scale. Defaults to head_dim ** -0.5.

None
proj_drop float

Dropout score for projection layer. Defaults is 0.0 (no dropout).

0.0