Skip to content

Install NATTEN

Newest release: 0.21.0 | Changelog.

Starting version 0.20.0, NATTEN only supports PyTorch 2.7 and newer. However, you can still attempt to install NATTEN with PyTorch >= 2.5 at your own risk. For earlier NATTEN releases, please refer to Older NATTEN builds.

We strongly recommend NVIDIA GPU users to install NATTEN with our CUDA kernel library, libnatten. libnatten packs our FNA kernels built with CUTLASS, which are specifically designed for delivering all the functionality available in NATTEN, and the best possible performance.

If you're a Linux user (or WSL), and you have NVIDIA GPUs, refer to NATTEN + libnatten via PyPI for instructions on installing NATTEN with wheels (fastest option).

If you're using a PyTorch version we don't support, or you need to use NATTEN with non-official PyTorch builds (nightly builds, NGC images, etc), you have to build NATTEN + libnatten.

If you're a Windows user with NVIDIA GPUs, you can try to build NATTEN from source with MSVC. However, we note that we are unable to regularly test and support Windows builds. Pull requests and patches are strongly encouraged.

All non-NVIDIA GPU users can only use our PyTorch (Flex Attention) backend, and not libnatten. Refer to NATTEN via PyPI for more information.

NATTEN + libnatten via PyPI

We offer pre-built wheels (binaries) for the most recent official PyTorch builds. To install NATTEN using wheels, please first check your PyTorch version, and select it below.

torch==2.7.0+cu128
pip install natten==0.21.0+torch270cu128 -f https://whl.natten.org
torch==2.7.0+cu126
pip install natten==0.21.0+torch270cu126 -f https://whl.natten.org

Warning

Blackwell FNA/FMHA kernels are not available in this build. Blackwell support was introduced in CUDA Toolkit 12.8.

Don't know your PyTorch build? Other questions?

If you don't know your PyTorch build, simply run this command:

python -c "import torch; print(torch.__version__)"

2.7.0+cu126

In the above example, we're seeing PyTorch 2.7.0, built with CUDA Toolkit 12.6.

If you see a different version tag pattern, you might be using a nightly build, or your environment/container built PyTorch from source. If you see an older PyTorch version, sadly we don't support it and therefore don't ship wheels for it.

In either case, your only option is to build NATTEN + libnatten locally.

If you see a CUDA Toolkit build older than 12.0, we unfortunately don't support those moving forward.

If you see something else, or have further questions, feel free to open an issue.

NATTEN via PyPI

If you can't install NATTEN with libnatten, you can still use our Flex Attention backend. However, note that you can only use this backend as long as Flex Attention is supported on your device. Also note that Flex Attention with torch.compile is not allowed by default for safety. For more information refer to Flex Attention + torch.compile.

torch>=2.7.0
pip install natten==0.21.0

Build NATTEN + libnatten

First make sure that you have your preferred PyTorch build installed, since NATTEN's setup script requires PyTorch to even run.

Dependencies & Requirements

Obviously, Python and PyTorch.

Builds with libnatten require cmake (1), CUDA runtime (2), and CUDA Toolkit (3).

  1. Ideally install system-wide, but you can also install via PyPI:

    pip install cmake==3.20.3
    
  2. Usually ships with the CUDA driver. As long as you can run SMI successfully, you should be good:

    nvidia-smi
    
  3. Mainly for the CUDA compiler. Confirm you have it by running:

    which nvcc
    

    /usr/bin/nvcc
    

    to make sure it exists, and then confirming the CUDA toolkit version reported is 12.0 or higher:

    nvcc --version
    

    nvcc: NVIDIA (R) Cuda compiler driver
    Copyright (c) 2005-2023 NVIDIA Corporation
    Built on Tue_Aug_15_22:02:13_PDT_2023
    Cuda compilation tools, release 12.2, V12.2.140
    Build cuda_12.2.r12.2/compiler.33191640_0
    

libnatten depends heavily on CUTLASS, which is a header only open source library that gets fetched as a submodule, so you don't need to install anything for that.

To build NATTEN with libnatten, you can still use PyPI, or build from source.

Build NATTEN with libnatten on CUDA-supported devices
pip install natten==0.21.0

By default, NATTEN will detect your GPU architecture and build libnatten specifically for that architecture. However, note that this will not work in container image builders (i.e. Dockerfile), as they do not typically expose the CUDA driver. If the setup does not detect a CUDA device, it will skip building libnatten.

If you want to specify your target GPU architecture(s) manually, which will allow building without the CUDA driver present (still requires the compiler in CUDA Toolkit), set the NATTEN_CUDA_ARCH environment variable to a semicolon-separated list of the compute capabilities corresponding to your desired architectures.

NATTEN_CUDA_ARCH="8.9" pip install natten==0.21.0 # (1)!

NATTEN_CUDA_ARCH="9.0" pip install natten==0.21.0 # (2)!

NATTEN_CUDA_ARCH="10.0" pip install natten==0.21.0 # (3)!

NATTEN_CUDA_ARCH="8.0;8.6;9.0;10.0" pip install natten==0.21.0 # (4)!
  1. Build targeting SM89 (Ada Lovelace)
  2. Build targeting SM90 (Hopper)
  3. Build targeting SM100 (Blackwell)
  4. Multi-arch build targeting SM80, SM86, SM90 and SM100

By default the setup will attempt to use 1/4th of the number of processor threads available in your system. If possible, we recommend using more to maximize build parallelism, and minimize build time. Using more than 64 workers (if available) should generally not make a difference, since we usually generate around 60 build targets.

You can customize the number of workers by setting the NATTEN_N_WORKERS environment variable:

NATTEN_N_WORKERS=16 pip install natten==0.21.0 # (1)!

NATTEN_N_WORKERS=64 pip install natten==0.21.0 # (2)!
  1. Build with 16 parallel workers
  2. Build with 64 parallel workers

You can also use the --verbose option in pip to monitor the compilation, and set the NATTEN_VERBOSE environment variable to toggle cmake's verbose mode, which prints out all compilation targets and compile flags.

Build from source

Just clone, optionally checkout to your desired version or commit, and run make:

git clone --recursive https://github.com/SHI-Labs/NATTEN
cd NATTEN

make

The setup script is identical to the one we ship with PyPI, therefore you have the same options and the same behavior.

If you want to specify your target GPU architecture(s) manually, instead of letting them get detected via PyTorch talking to the CUDA driver:

make CUDA_ARCH="8.9" # (1)!

make CUDA_ARCH="9.0" # (2)!

make CUDA_ARCH="10.0" # (3)!

make CUDA_ARCH="8.0;8.6;9.0;10.0" # (4)!
  1. Build targeting SM89 (Ada Lovelace)
  2. Build targeting SM90 (Hopper)
  3. Build targeting SM100 (Blackwell)
  4. Multi-arch build targeting SM80, SM86, SM90 and SM100

Customizing the number of workers again works similarly. Using more than 64 workers should generally not make a difference, unless your use case demands it, and you use our autogen scripts (under scripts/autogen*.py in our repository) and increase the number of targets.

make WORKERS=16 # (1)!

make WORKERS=64 # (2)!
  1. Build with 16 parallel workers
  2. Build with 64 parallel workers

You can enable verbose mode to view more details, and compilation progress:

make VERBOSE=1

It is recommended to run all unit tests when you build from source. It may take up to 30 minutes:

make test

If you want to do an editable (development) install, use make dev instead of make.

Build from source on Windows

Build with MSVC

NOTE: Windows builds are experimental and not regularly tested.

First clone NATTEN, and make sure to fetch all submodules. If you're cloning with Visual Studio, it might clone submodules by default. If you're using a command line tool like git bash (MinGW) or WSL, it should be the same as linux.

To build with MSVC, please open the "Native Tools Command Prompt for Visual Studio". The exact name may depend on your version of Windows, Visual Studio, and cpu architecture (in our case it was "x64 Native Tools Command Prompt for VS".)

Once in the command prompt, make sure your correct Python environment is in the system path. If you're using anaconda, you should be able to do conda activate $YOUR_ENV_NAME.

Then simply confirm you have PyTorch installed, and use our Windows batch script to build:

WindowsBuilder.bat install

WindowsBuilder.bat install WORKERS=8 # (1)!

WindowsBuilder.bat install CUDA_ARCH=8.9 # (2)!
  1. Build with 8 parallel workers
  2. Build targeting SM89 (Ada)

Note that depending on how many workers you end up using, build time may vary, and the MSVC compiler tends to throw plenty of warnings at you, but as long as it does not fail and give you back the command prompt, just let it keep building.

Once it's done building, it is highly recommended to run the unit tests to make sure everything went as expected:

WindowsBuilder.bat test

PyTorch issue: nvToolsExt not found

Windows users may come across this issue when building NATTEN from source with CUDA 12.0 and newer. The build process fails with an error indicating "nvtoolsext" cannot be found on your system. This is because nvtoolsext binaries are no longer part of the CUDA toolkit for Windows starting CUDA 12.0, but the PyTorch cmake still looks for it (as of torch==2.2.1).

The only workaround is to modify the following files:

$PATH_TO_YOUR_PYTHON_ENV\Lib\site-packages\torch\share\cmake\Torch\TorchConfig.cmake
$PATH_TO_YOUR_PYTHON_ENV\Lib\site-packages\torch\share\cmake\Caffe2\Caffe2Targets.cmake
$PATH_TO_YOUR_PYTHON_ENV\Lib\site-packages\torch\share\cmake\Caffe2\public\cuda.cmake

find all mentions of nvtoolsext (or nvToolsExt), and comment them out, to get past it.

More information on the issue: pytorch/pytorch#116926

Checking whether you have libnatten

If you installed NATTEN, and want to check whether you have libnatten, or your device supports it, run the following:

python -c "import natten; print(natten.HAS_LIBNATTEN)"

If True, you're good to go. If False, it could mean either you didn't install NATTEN with libnatten, or that your current device doesn't support it. Feel free to open an issue if you have questions.

Older NATTEN builds

We highly recommend using the latest NATTEN builds, but if you need to install older NATTEN versions, you can install them via PyPI. We only offer wheels for >=0.20.0 and 0.17.5 releases. Earlier releases will have to be compiled locally.

0.21.1

Released on 2025-06-14. Changelog.

torch==2.7.0+cu128
pip install natten==0.20.1+torch270cu128 -f https://whl.natten.org
torch==2.7.0+cu126
pip install natten==0.20.1+torch270cu126 -f https://whl.natten.org

Warning

Blackwell FNA/FMHA kernels are not available in this build. Blackwell support was introduced in CUDA Toolkit 12.8.

Compile locally (custom torch build)
pip install natten==0.20.1

Refer to NATTEN via PyPI for more details.

0.20.0

Released on 2025-06-07. Changelog.

torch==2.7.0+cu128
pip install natten==0.20.0+torch270cu128 -f https://whl.natten.org
torch==2.7.0+cu126
pip install natten==0.20.0+torch270cu126 -f https://whl.natten.org

Warning

Blackwell FNA/FMHA kernels are not available in this build. Blackwell support was introduced in CUDA Toolkit 12.8.

Compile locally (custom torch build)
pip install natten==0.20.0

Refer to NATTEN via PyPI for more details.

0.17.5

Released on 2025-03-20. Changelog.

Warning

Wheels will be phased out and removed in the coming months. We strongly recommend upgrading to NATTEN 0.21.0.

torch==2.6.0+cu126
pip install natten==0.17.5+torch260cu126 -f https://whl.natten.org
torch==2.6.0+cu124
pip install natten==0.17.5+torch260cu124 -f https://whl.natten.org
torch==2.6.0+cpu
pip install natten==0.17.5+torch260cpu -f https://whl.natten.org
torch==2.5.0+cu124
pip install natten==0.17.5+torch260cu124 -f https://whl.natten.org
torch==2.5.0+cu121
pip install natten==0.17.5+torch260cu121 -f https://whl.natten.org
torch==2.5.0+cpu
pip install natten==0.17.5+torch260cpu -f https://whl.natten.org
Compile locally (custom torch build)
pip install natten==0.17.5

Refer to NATTEN via PyPI for more details.

< 0.17.5

Refer to our release index on PyPI.