Operations
In this page we list our PyTorch Autograd-compatible operations. These operations come with performance knobs (configurations), some of which are specific to certain backends.
Changing those knobs is completely optional, and NATTEN will continue to be functionally correct in all cases. However, to squeeze out the maximum performance achievable, we highly recommend looking at backends, or just using our profiler toolkit and its dry run feature to navigate through available backends and their valid configurations for your specific use case and GPU architecture. You can also use the profiler's optimize feature to search and find the best configuration.
Neighborhood Attention
natten.na1d
na1d(
query,
key,
value,
kernel_size,
stride=1,
dilation=1,
is_causal=False,
scale=None,
additional_keys=None,
additional_values=None,
attention_kwargs=None,
backend=None,
q_tile_shape=None,
kv_tile_shape=None,
backward_q_tile_shape=None,
backward_kv_tile_shape=None,
backward_kv_splits=None,
backward_use_pt_reduction=False,
run_persistent_kernel=True,
kernel_schedule=None,
torch_compile=False,
)
Computes 1-D neighborhood attention.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
Tensor
|
4-D query tensor, with the heads last layout
( |
required |
key
|
Tensor
|
4-D key tensor, with the heads last layout
( |
required |
value
|
Tensor
|
4-D value tensor, with the heads last layout
( |
required |
kernel_size
|
Tuple[int] | int
|
Neighborhood window (kernel) size. Note
|
required |
stride
|
Tuple[int] | int
|
Sliding window step size. Defaults to Note
|
1
|
dilation
|
Tuple[int] | int
|
Dilation step size. Defaults to Note The product of |
1
|
is_causal
|
Tuple[bool] | bool
|
Toggle causal masking. Defaults to |
False
|
scale
|
float
|
Attention scale. Defaults to |
None
|
additional_keys
|
Optional[Tensor]
|
|
None
|
additional_values
|
Optional[Tensor]
|
Note
|
None
|
Other Parameters:
| Name | Type | Description |
|---|---|---|
backend |
str
|
Backend implementation to run with. Choices are: |
q_tile_shape |
Tuple[int]
|
1-D Tile shape for the query token layout in the forward pass kernel. You can use profiler to find valid choices for your use case, and search for the best combination. |
kv_tile_shape |
Tuple[int]
|
1-D Tile shape for the key-value token layout in the forward pass kernel. You can use profiler to find valid choices for your use case, and search for the best combination. |
backward_q_tile_shape |
Tuple[int]
|
1-D Tile shape for the query token layout in the
backward pass kernel. This is ignored by |
backward_kv_tile_shape |
Tuple[int]
|
1-D Tile shape for the key/value token layout in the
backward pass kernel. This is ignored by |
backward_kv_splits |
Tuple[int]
|
Number of key/value tiles allowed to work in parallel in
the backward pass kernel. Like tile shapes, this is a tuple and not an integer for
neighborhood attention operations, and the size of the tuple corresponds to the number
of dimensions / rank of the layout of tokens. This is only respected by the
|
backward_use_pt_reduction |
bool
|
Whether to use PyTorch eager for computing the |
run_persistent_kernel |
bool
|
Whether to use persistent tile scheduling in the forward pass
kernel. This only applies to the |
kernel_schedule |
Optional[str]
|
Kernel type (Hopper architecture only). Choices are
|
torch_compile |
bool
|
Applies only to the |
attention_kwargs |
Optional[Dict]
|
arguments to the attention operator, if used to implement neighborhood cross-attention, or self attention as a fast path for neighborhood attention. If If for a given use case, the neighborhood attention problem is equivalent to self
attention (not causal, You can override arguments to attention by passing a dictionary here. |
Returns:
| Name | Type | Description |
|---|---|---|
output |
Tensor
|
4-D output tensor, with the heads last layout
( |
natten.na2d
na2d(
query,
key,
value,
kernel_size,
stride=1,
dilation=1,
is_causal=False,
scale=None,
additional_keys=None,
additional_values=None,
attention_kwargs=None,
backend=None,
q_tile_shape=None,
kv_tile_shape=None,
backward_q_tile_shape=None,
backward_kv_tile_shape=None,
backward_kv_splits=None,
backward_use_pt_reduction=False,
run_persistent_kernel=True,
kernel_schedule=None,
torch_compile=False,
)
Computes 2-D neighborhood attention.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
Tensor
|
2-D query tensor, with the heads last layout:
|
required |
key
|
Tensor
|
2-D key tensor, with the heads last layout:
|
required |
value
|
Tensor
|
2-D value tensor, with the heads last layout:
|
required |
kernel_size
|
Tuple[int, int] | int
|
Neighborhood window (kernel) size/shape. If an
integer, it will be repeated for all 2 dimensions. For example Note
|
required |
stride
|
Tuple[int, int] | int
|
Sliding window step size/shape. Defaults to Note
|
1
|
dilation
|
Tuple[int, int] | int
|
Dilation step size/shape. Defaults to Note The product of |
1
|
is_causal
|
Tuple[bool, bool] | bool
|
Toggle causal masking. Defaults to |
False
|
scale
|
float
|
Attention scale. Defaults to |
None
|
additional_keys
|
Optional[Tensor]
|
|
None
|
additional_values
|
Optional[Tensor]
|
Note
|
None
|
Other Parameters:
| Name | Type | Description |
|---|---|---|
backend |
str
|
Backend implementation to run with. Choices are: |
q_tile_shape |
Tuple[int, int]
|
2-D Tile shape for the query token layout in the forward pass kernel. You can use profiler to find valid choices for your use case, and search for the best combination. |
kv_tile_shape |
Tuple[int, int]
|
2-D Tile shape for the key-value token layout in the forward pass kernel. You can use profiler to find valid choices for your use case, and search for the best combination. |
backward_q_tile_shape |
Tuple[int, int]
|
2-D Tile shape for the query token layout in the
backward pass kernel. This is ignored by |
backward_kv_tile_shape |
Tuple[int, int]
|
2-D Tile shape for the key/value token layout in
the backward pass kernel. This is ignored by |
backward_kv_splits |
Tuple[int, int]
|
Number of key/value tiles allowed to work in parallel
in the backward pass kernel. Like tile shapes, this is a tuple and not an integer for
neighborhood attention operations, and the size of the tuple corresponds to the number
of dimensions / rank of the layout of tokens. This is only respected by the
|
backward_use_pt_reduction |
bool
|
Whether to use PyTorch eager for computing the |
run_persistent_kernel |
bool
|
Whether to use persistent tile scheduling in the forward pass
kernel. This only applies to the |
kernel_schedule |
Optional[str]
|
Kernel type (Hopper architecture only). Choices are
|
torch_compile |
bool
|
Applies only to the |
attention_kwargs |
Optional[Dict]
|
arguments to the attention operator, if used to implement neighborhood cross-attention, or self attention as a fast path for neighborhood attention. If If for a given use case, the neighborhood attention problem is equivalent to self
attention (not causal along any dims, You can override arguments to attention by passing a dictionary here. |
Returns:
| Name | Type | Description |
|---|---|---|
output |
Tensor
|
5-D output tensor, with the heads last layout
( |
natten.na3d
na3d(
query,
key,
value,
kernel_size,
stride=1,
dilation=1,
is_causal=False,
scale=None,
additional_keys=None,
additional_values=None,
attention_kwargs=None,
backend=None,
q_tile_shape=None,
kv_tile_shape=None,
backward_q_tile_shape=None,
backward_kv_tile_shape=None,
backward_kv_splits=None,
backward_use_pt_reduction=False,
run_persistent_kernel=True,
kernel_schedule=None,
torch_compile=False,
)
Computes 3-D neighborhood attention.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
Tensor
|
3-D query tensor, with the heads last layout:
|
required |
key
|
Tensor
|
3-D key tensor, with the heads last layout:
|
required |
value
|
Tensor
|
3-D value tensor, with the heads last layout:
|
required |
kernel_size
|
Tuple[int, int, int] | int
|
Neighborhood window (kernel) size/shape. If an
integer, it will be repeated for all 3 dimensions. For example Note
|
required |
stride
|
Tuple[int, int, int] | int
|
Sliding window step size/shape. Defaults to Note
|
1
|
dilation
|
Tuple[int, int, int] | int
|
Dilation step size/shape. Defaults to Note The product of |
1
|
is_causal
|
Tuple[bool, bool, bool] | bool
|
Toggle causal masking. Defaults to |
False
|
scale
|
float
|
Attention scale. Defaults to |
None
|
additional_keys
|
Optional[Tensor]
|
|
None
|
additional_values
|
Optional[Tensor]
|
Note
|
None
|
Other Parameters:
| Name | Type | Description |
|---|---|---|
backend |
str
|
Backend implementation to run with. Choices are: |
q_tile_shape |
Tuple[int, int, int]
|
3-D Tile shape for the query token layout in the forward pass kernel. You can use profiler to find valid choices for your use case, and search for the best combination. |
kv_tile_shape |
Tuple[int, int, int]
|
3-D Tile shape for the key-value token layout in the forward pass kernel. You can use profiler to find valid choices for your use case, and search for the best combination. |
backward_q_tile_shape |
Tuple[int, int, int]
|
3-D Tile shape for the query token layout in
the backward pass kernel. This is ignored by |
backward_kv_tile_shape |
Tuple[int, int, int]
|
3-D Tile shape for the key/value token
layout in the backward pass kernel. This is ignored by |
backward_kv_splits |
Tuple[int, int, int]
|
Number of key/value tiles allowed to work in
parallel in the backward pass kernel. Like tile shapes, this is a tuple and not an
integer for neighborhood attention operations, and the size of the tuple corresponds to
the number of dimensions / rank of the layout of tokens. This is only respected by the
|
backward_use_pt_reduction |
bool
|
Whether to use PyTorch eager for computing the |
run_persistent_kernel |
bool
|
Whether to use persistent tile scheduling in the forward pass
kernel. This only applies to the |
kernel_schedule |
Optional[str]
|
Kernel type (Hopper architecture only). Choices are
|
torch_compile |
bool
|
Applies only to the |
attention_kwargs |
Optional[Dict]
|
arguments to the attention operator, if used to implement neighborhood cross-attention, or self attention as a fast path for neighborhood attention. If If for a given use case, the neighborhood attention problem is equivalent to self
attention (not causal along any dims, You can override arguments to attention by passing a dictionary here. |
Returns:
| Name | Type | Description |
|---|---|---|
output |
Tensor
|
6-D output tensor, with the heads last layout
( |
Standard Attention
natten.attention
attention(
query,
key,
value,
is_causal=False,
scale=None,
seqlens_Q=None,
seqlens_KV=None,
cumulative_seqlen_Q=None,
cumulative_seqlen_KV=None,
max_seqlen_Q=None,
max_seqlen_KV=None,
backend=None,
q_tile_size=None,
kv_tile_size=None,
backward_q_tile_size=None,
backward_kv_tile_size=None,
backward_kv_splits=None,
backward_use_pt_reduction=False,
run_persistent_kernel=True,
kernel_schedule=None,
torch_compile=False,
return_lse=False,
)
Runs standard dot product attention.
This operation is used to implement neighborhood cross attention, in which we allow every
token to interact with some additional context (additional_keys and additional_values
tensors in na1d, na2d, and na3d).
This operator is also used as a fast path for cases where neighborhood attention is equivalent
to self attention (not causal along any dims, and kernel_size is equal to the number of input
tokens).
This operation does not call into PyTorch's SDPA, and only runs one of the NATTEN backends
(cutlass-fmha, hopper-fmha, blackwell-fmha, flex-fmha). Reasons for that include being
able to control performance-related arguments, return logsumexp, and more.
For more information refer to backends.
Causal mask, and Variable length (varlen) Attention are also supported in some backends
(cutlass-fmha and blackwell-fmha).
Varlen Attention is only supported for the sequence-packed layout: QKV tensors have batch size
1, and tokens from different batches are concatenated without any padding along the sequence
dimension. Sequence lengths for different batches can be provided in two ways:
1. seqlens_Q and seqlens_KV (less efficient): only provide the sequence lengths as
integer tensors (must be on the same device as QKV), and NATTEN will compute cumulative
and maximum sequence lengths on each call.
2. cumulative_seqlen_{Q,KV} and max_seqlen_{Q,KV} (more efficient):
compute cumulative and maximum sequence lengths. cumulative_seqlen_{Q,KV} are integer
tensors on the same device as QKV containing the cumulative sum of seqlens_{Q,KV},
with an additional 0 element in the beginning, therefore sized batch+1.
max_seqlen_{Q,KV} are integers (not Tensors) that represent the maximum sequence
lengths for Q and KV among all sequence batches.
You can use natten.utils.varlen.generate_varlen_parameters to generate these
parameters:
from natten.utils.varlen import generate_varlen_parameters
(
cumulative_seqlen_Q,
cumulative_seqlen_KV,
max_seqlen_Q,
max_seqlen_KV,
) = generate_varlen_parameters(q, k, v, seqlens_Q, seqlens_KV)
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
query
|
Tensor
|
4-D query tensor, with the heads last layout
( |
required |
key
|
Tensor
|
4-D key tensor, with the heads last layout
( |
required |
value
|
Tensor
|
4-D value tensor, with the heads last layout
( |
required |
is_causal
|
bool
|
Toggle causal masking. Defaults to |
False
|
scale
|
float
|
Attention scale. Defaults to |
None
|
seqlens_Q
|
Optional[Tensor]
|
(varlen) Optional 1-D tensor with size |
None
|
seqlens_KV
|
Optional[Tensor]
|
(varlen) Optional 1-D tensor with size |
None
|
cumulative_seqlen_Q
|
Optional[Tensor]
|
(varlen) Optional 1-D tensor with size |
None
|
cumulative_seqlen_KV
|
Optional[Tensor]
|
(varlen) Optional 1-D tensor with size |
None
|
max_seqlen_Q
|
Optional[int]
|
(varlen) Optional integer indicating the maximum query
sequence length in all batches. Must be passed together with |
None
|
max_seqlen_KV
|
Optional[int]
|
(varlen) Optional integer indicating the maximum key/value
sequence length in all batches. Must be passed together with |
None
|
Other Parameters:
| Name | Type | Description |
|---|---|---|
backend |
str
|
Backend implementation to run with. Choices are: |
q_tile_size |
int
|
Tile size along query sequence length in the forward pass kernel. You can use profiler to find valid choices for your use case. |
kv_tile_size |
int
|
Tile size along key/value sequence length in the forward pass kernel. You can use profiler to find valid choices for your use case. |
backward_q_tile_size |
int
|
Tile size along query sequence length in the backward pass
kernel. This is ignored by |
backward_kv_tile_size |
int
|
Tile size along key/value sequence length in the backward pass
kernel. This is ignored by |
backward_kv_splits |
int
|
Number of key/value tiles allowed to work in parallel in the
backward pass kernel. This is only respected by the |
backward_use_pt_reduction |
bool
|
Whether to use PyTorch eager for computing the |
run_persistent_kernel |
bool
|
Whether to use persistent tile scheduling in the forward pass
kernel. This only applies to the |
kernel_schedule |
Optional[str]
|
Kernel type (Hopper architecture only). Choices are
|
torch_compile |
bool
|
Applies only to the |
return_lse |
bool
|
Whether or not to return the |
Returns:
| Name | Type | Description |
|---|---|---|
output |
Tensor
|
4-D output tensor, with the heads last layout
( |
logsumexp |
Tensor
|
only returned when |
natten.merge_attentions
Takes multiple attention outputs originating from the same query tensor, and their corresponding logsumexps, and merges them as if their context (key/value pair) had been concatenated.
This operation is used to implement cross-neighborhood attention, and can also be used for distributed setups, such as context-parallelism.
This operation also attempts to use torch.compile to fuse the elementwise operations. This
can be disabled by passing torch_compile=False.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
outputs
|
List[Tensor]
|
List of 4-D attention output tensors, with the heads last layout
( |
required |
lse_tensors
|
List[Tensor]
|
List of 3-D logsumexp tensors, with the heads last layout
( |
required |
torch_compile
|
bool
|
Attempt to use |
True
|
Returns:
| Name | Type | Description |
|---|---|---|
output |
Tensor
|
merged attention output. |