An FFT for every GPU: ferrum-gpu and gpufft
Reach for a GPU FFT in Python and the path of least resistance is cuFFT, which means CUDA, which means NVIDIA. The math of a Fourier transform is vendor-neutral; the libraries are where the silicon gets pinned down. These two packages, both on PyPI today, take the pin out from opposite ends.
gpufft is the pragmatic end: one Python FFT API over two mature engines, cuFFT on NVIDIA and VkFFT on any Vulkan device, so the same call runs on NVIDIA, AMD, Intel, and Apple Silicon. ferrum-gpu is the long end: the FFT kernels written in regular Rust, compiled straight to the GPU with no CUDA C anywhere, betting on a single kernel language that lowers to every accelerator. They share a Python surface, so code written against one runs on the other.
gpufft | ferrum-gpu | |
|---|---|---|
| kernels | cuFFT + VkFFT (vendor libraries) | pure Rust → PTX (cuda-oxide) |
| performance | the vendor libraries themselves | within 1.3-3.7× of cuFFT |
| GPUs today | NVIDIA, AMD, Intel, Apple | NVIDIA |
| the bet | cross-vendor reach you can use now | one kernel language for every accelerator |
gpufft: cross-vendor FFTs today
VkFFT is a single-header GPU FFT that runs on Vulkan, which every modern GPU speaks: AMD, Intel, Apple (through MoltenVK), and NVIDIA. cuFFT is NVIDIA's decade-tuned library. gpufft puts both behind one API and lets you pick the engine with a keyword, with the same numpy array going in and coming out.
pip install gpufftimport numpy as np
import gpufft
x = (np.random.randn(4096) + 1j*np.random.randn(4096)).astype(np.complex64)
X = gpufft.fft.fft_1d_c2c_pow2(x, log_n=12, backend="cuda") # NVIDIA, via cuFFT
X = gpufft.fft.fft_1d_c2c_pow2(x, log_n=12, backend="vulkan") # AMD / Intel / Apple / NVIDIA, via VkFFT
Y = gpufft.fft.fft_1d_c2c(x, n=4096) # arbitrary (non-power-of-two) Ngpufft call, two engines. cuFFT covers NVIDIA; VkFFT covers every Vulkan device, which is to say nearly all of them.The wheel on PyPI is manylinux_2_28, abi3 (Python 3.10+). It bundles the VkFFT shader pipeline (VkFFT plus SPIR-V tooling) but leaves the heavy vendor libraries external: cuFFT comes from your CUDA 13.x install, the Vulkan loader from your driver. This first release ships 1D complex-to-complex transforms on both backends, power-of-two and arbitrary sizes, batched, with persistent gpufft.cuda.Device and gpufft.vulkan.Device handles. 2D/3D, real transforms, and f64 follow.
ferrum-gpu: the kernels in pure Rust
ferrum-gpu takes the other road. The FFT kernels are written in regular Rust, compiled to NVIDIA PTX by cuda-oxide, and dispatched on the GPU through a thin safe-Rust runtime. No CUDA C, no kernel-DSL, no foreign-language bindings on the device side. You install it with pip and call it over numpy arrays in three lines.
pip install ferrum-gpuimport numpy as np
import ferrum_gpu as fgpu
arr = np.array([1+0j, 2+0j, 3+0j, 4+0j], dtype=np.complex64)
print(fgpu.fft.fft_1d_c2c_pow2(arr, log_n=2))
# [10+0j, -2+2j, -2+0j, -2-2j] # matches numpy.fft.fft(arr)A #[kernel] function in Rust is compiled by cuda-oxide's rustc backend, which lowers MIR through a Pliron MLIR-style IR to PTX, the virtual instruction set the NVIDIA driver JIT-compiles into native SASS at load time. The host-side runtime owns the CUDA context, allocates device buffers, packs kernel arguments, and dispatches launches. PyO3 + maturin wrap the whole thing into a Python wheel.
cuda-oxide, dispatched through ferrum-gpu's runtime to NVIDIA GPUs.Performance
v0.2.0 replaces the generic radix-2 fallback at the three power-of-2 sizes that matter with specialised Stockham kernels: radix-4 for and radix-8 for . cuFFT remains the target to beat. Per-FFT microseconds on an RTX 5060 Laptop GPU (sm_120, Blackwell) with the graphics clock locked to 1500 MHz, , medians over 100 trials after a 10-trial warmup, alternating ferrum-gpu and cuFFT launches per trial so DVFS affects both backends symmetrically:
| kernel | ferrum-gpu (µs) | cuFFT (µs) | ratio | |
|---|---|---|---|---|
| 256 | radix-4 | 0.035 | 0.026 | 1.32× |
| 1024 | radix-4 | 0.102 | 0.047 | 2.13× |
| 4096 | radix-8 | 0.509 | 0.137 | 3.69× |
The specialised kernels run 2-5× faster than the radix-2 fallback they replace, and land within 1.32× of cuFFT at and 3.69× at , closing most of the 5-7× gap v0.1.0 carried. The remaining distance is algorithmic. A profiler trace (Nsight Compute) shows cuFFT on Blackwell dispatching a single vector_fft kernel across all three sizes, batching many FFTs per thread block, which is the structure the next kernel redesign has to match. make perf-gate regenerates this table on any host; its --gate flag asserts a 0.9× target (10% faster than cuFFT) and exits non-zero on a miss. That gate is not yet met at any size and stands as the next performance milestone.
Measurement note: Hardware: RTX 5060 Laptop (sm_120), driver 580.159.03, CUDA Toolkit 13.1, measured 2026-05-29. Event-time brackets each kernel launch with CUDA events and excludes the host launch path; cuFFT's plan initialisation is amortised across the loop and not counted.
Why write the kernels in Rust
PyTorch, JAX, and TensorFlow all carry CUDA kernel layers written in CUDA C and coupled to NVIDIA's toolchain. As accelerator vendors diversify (AMD Instinct, Intel Gaudi, custom silicon), the polyglot kernel ecosystem becomes a coordination tax. One Rust source compiled through several lowering pipelines pays it once.
That is the motivation behind the spirv-oxide companion project in development: a Pliron-based rustc backend that emits SPIR-V instead of PTX, sibling to cuda-oxide and sharing the same MIR-to-Pliron front half. When it lands, the same #[kernel] fn will compile to both PTX (NVIDIA, via the driver) and SPIR-V (any Vulkan-compute device). The cross-vendor reach gpufft gets today from VkFFT, ferrum-gpu will get from its own kernels.
cuda-oxide ships today; spirv-oxide unlocks AMD, Intel, and Apple Silicon as it matures.One API, two engines
The two packages expose the same fft_1d_c2c_pow2 signature, so the choice between them is one import line. Start on gpufft for breadth of hardware and library-grade speed today; reach for ferrum-gpu when you want a transparent kernel you can read, edit, and recompile the same afternoon, with no .cu files and no nvcc.
# same call, either engine
import gpufft as fgpu # cuFFT / VkFFT, cross-vendor
# import ferrum_gpu as fgpu # pure-Rust kernels, NVIDIA
X = fgpu.fft.fft_1d_c2c_pow2(x, log_n=12)Where this fits
A GPU FFT in Python is well served already, as long as the GPU is NVIDIA. CuPy's cupy.fft is a drop-in for numpy.fft over cuFFT, and nvmath-python is NVIDIA's own, the recommended cuFFT path. Neither leaves NVIDIA in any serious way.
Off NVIDIA, the closest option is pyvkfft, a Python wrapper over VkFFT. It binds VkFFT's CUDA and OpenCL backends, so it needs an array framework underneath (PyCUDA, PyOpenCL, or CuPy) and inherits OpenCL's fading support on macOS. gpufft takes VkFFT's Vulkan backend instead, hands you plain numpy arrays, and ships as one wheel that reaches NVIDIA, AMD, Intel, and Apple.
| library | FFT engine | GPUs | arrays |
|---|---|---|---|
cupy.fft | cuFFT | NVIDIA (ROCm partial) | CuPy |
nvmath-python | cuFFT | NVIDIA | numpy / CuPy |
pyvkfft | VkFFT (CUDA, OpenCL) | NVIDIA + OpenCL devices | PyCUDA / PyOpenCL / CuPy |
gpufft | cuFFT + VkFFT (Vulkan) | NVIDIA · AMD · Intel · Apple | numpy |
ferrum-gpu | pure-Rust kernels (cuda-oxide) | NVIDIA (SPIR-V soon) | numpy |
ferrum-gpu sits on a different axis: the language the kernel is written in. The ways to write a GPU kernel without CUDA C today are Numba's @cuda.jit (a Python subset, NVIDIA only), Triton (a tile DSL aimed at ML), or CUDA C wrapped as source strings through CuPy's RawKernel and PyCUDA. The Rust route, cuda-oxide, rust-cuda, and KAIO, gives you the compiler; ferrum-gpu assembles it into an FFT you pip install, with typed buffers and a safe-Rust runtime, and the kernel stays readable Rust with no DSL ceiling.
Roadmap
- gpufft: 2D/3D transforms, real-to-complex and complex-to-real,
f64, and Vulkan device selection. - ferrum-gpu performance: a multi-FFT-per-block, profiler-guided kernel redesign mirroring cuFFT's structure, targeting the cuFFT gate the
perf-gatebinary already enforces. - ferrum-gpu reach:
spirv-oxideand the Vulkan backend, so the Rust kernels run on AMD, Intel, and Apple Silicon. - shared: R2C / C2R, non-power-of-2 sizes, and DLPack zero-copy with CuPy, PyTorch, and JAX.
Both are open source under Apache-2.0. Install them, break them, send issues and PRs:
pip install gpufft # cross-vendor, today
pip install ferrum-gpu # pure-Rust kernelsgithub.com/alejandro-soto-franco/gpufft · github.com/alejandro-soto-franco/ferrum-gpu