An FFT for every GPU: ferrum-gpu and gpufft

June 1, 2026|

Reach for a GPU FFT in Python and the path of least resistance is cuFFT, which means CUDA, which means NVIDIA. The math of a Fourier transform is vendor-neutral; the libraries are where the silicon gets pinned down. These two packages, both on PyPI today, take the pin out from opposite ends.

gpufft is the pragmatic end: one Python FFT API over two mature engines, cuFFT on NVIDIA and VkFFT on any Vulkan device, so the same call runs on NVIDIA, AMD, Intel, and Apple Silicon. ferrum-gpu is the long end: the FFT kernels written in regular Rust, compiled straight to the GPU with no CUDA C anywhere, betting on a single kernel language that lowers to every accelerator. They share a Python surface, so code written against one runs on the other.

	`gpufft`	`ferrum-gpu`
kernels	cuFFT + VkFFT (vendor libraries)	pure Rust → PTX (cuda-oxide)
performance	the vendor libraries themselves	within 1.3-3.7× of cuFFT
GPUs today	NVIDIA, AMD, Intel, Apple	NVIDIA
the bet	cross-vendor reach you can use now	one kernel language for every accelerator

gpufft: cross-vendor FFTs today

VkFFT is a single-header GPU FFT that runs on Vulkan, which every modern GPU speaks: AMD, Intel, Apple (through MoltenVK), and NVIDIA. cuFFT is NVIDIA's decade-tuned library. gpufft puts both behind one API and lets you pick the engine with a keyword, with the same numpy array going in and coming out.

pip install gpufft

import numpy as np
import gpufft

x = (np.random.randn(4096) + 1j*np.random.randn(4096)).astype(np.complex64)

X = gpufft.fft.fft_1d_c2c_pow2(x, log_n=12, backend="cuda")    # NVIDIA, via cuFFT
X = gpufft.fft.fft_1d_c2c_pow2(x, log_n=12, backend="vulkan")  # AMD / Intel / Apple / NVIDIA, via VkFFT
Y = gpufft.fft.fft_1d_c2c(x, n=4096)                           # arbitrary (non-power-of-two) N

One gpufft call, two engines. cuFFT covers NVIDIA; VkFFT covers every Vulkan device, which is to say nearly all of them.

The wheel on PyPI is manylinux_2_28, abi3 (Python 3.10+). It bundles the VkFFT shader pipeline (VkFFT plus SPIR-V tooling) but leaves the heavy vendor libraries external: cuFFT comes from your CUDA 13.x install, the Vulkan loader from your driver. This first release ships 1D complex-to-complex transforms on both backends, power-of-two and arbitrary sizes, batched, with persistent gpufft.cuda.Device and gpufft.vulkan.Device handles. 2D/3D, real transforms, and f64 follow.

ferrum-gpu: the kernels in pure Rust

ferrum-gpu takes the other road. The FFT kernels are written in regular Rust, compiled to NVIDIA PTX by cuda-oxide, and dispatched on the GPU through a thin safe-Rust runtime. No CUDA C, no kernel-DSL, no foreign-language bindings on the device side. You install it with pip and call it over numpy arrays in three lines.

pip install ferrum-gpu

import numpy as np
import ferrum_gpu as fgpu

arr = np.array([1+0j, 2+0j, 3+0j, 4+0j], dtype=np.complex64)
print(fgpu.fft.fft_1d_c2c_pow2(arr, log_n=2))
# [10+0j, -2+2j, -2+0j, -2-2j]   # matches numpy.fft.fft(arr)

A #[kernel] function in Rust is compiled by cuda-oxide's rustc backend, which lowers MIR through a Pliron MLIR-style IR to PTX, the virtual instruction set the NVIDIA driver JIT-compiles into native SASS at load time. The host-side runtime owns the CUDA context, allocates device buffers, packs kernel arguments, and dispatches launches. PyO3 + maturin wrap the whole thing into a Python wheel.

Rust source compiled to PTX by cuda-oxide, dispatched through ferrum-gpu's runtime to NVIDIA GPUs.

Performance

v0.2.0 replaces the generic radix-2 fallback at the three power-of-2 sizes that matter with specialised Stockham kernels: radix-4 for and radix-8 for . cuFFT remains the target to beat. Per-FFT microseconds on an RTX 5060 Laptop GPU (sm_120, Blackwell) with the graphics clock locked to 1500 MHz, , medians over 100 trials after a 10-trial warmup, alternating ferrum-gpu and cuFFT launches per trial so DVFS affects both backends symmetrically:

	kernel	`ferrum-gpu` (µs)	cuFFT (µs)	ratio
256	radix-4	0.035	0.026	1.32×
1024	radix-4	0.102	0.047	2.13×
4096	radix-8	0.509	0.137	3.69×

The specialised kernels run 2-5× faster than the radix-2 fallback they replace, and land within 1.32× of cuFFT at and 3.69× at , closing most of the 5-7× gap v0.1.0 carried. The remaining distance is algorithmic. A profiler trace (Nsight Compute) shows cuFFT on Blackwell dispatching a single vector_fft kernel across all three sizes, batching many FFTs per thread block, which is the structure the next kernel redesign has to match. make perf-gate regenerates this table on any host; its --gate flag asserts a 0.9× target (10% faster than cuFFT) and exits non-zero on a miss. That gate is not yet met at any size and stands as the next performance milestone.

Measurement note: Hardware: RTX 5060 Laptop (sm_120), driver 580.159.03, CUDA Toolkit 13.1, measured 2026-05-29. Event-time brackets each kernel launch with CUDA events and excludes the host launch path; cuFFT's plan initialisation is amortised across the loop and not counted.

Why write the kernels in Rust

PyTorch, JAX, and TensorFlow all carry CUDA kernel layers written in CUDA C and coupled to NVIDIA's toolchain. As accelerator vendors diversify (AMD Instinct, Intel Gaudi, custom silicon), the polyglot kernel ecosystem becomes a coordination tax. One Rust source compiled through several lowering pipelines pays it once.

That is the motivation behind the spirv-oxide companion project in development: a Pliron-based rustc backend that emits SPIR-V instead of PTX, sibling to cuda-oxide and sharing the same MIR-to-Pliron front half. When it lands, the same #[kernel] fn will compile to both PTX (NVIDIA, via the driver) and SPIR-V (any Vulkan-compute device). The cross-vendor reach gpufft gets today from VkFFT, ferrum-gpu will get from its own kernels.

One Rust kernel, two lowerings. cuda-oxide ships today; spirv-oxide unlocks AMD, Intel, and Apple Silicon as it matures.

One API, two engines

The two packages expose the same fft_1d_c2c_pow2 signature, so the choice between them is one import line. Start on gpufft for breadth of hardware and library-grade speed today; reach for ferrum-gpu when you want a transparent kernel you can read, edit, and recompile the same afternoon, with no .cu files and no nvcc.

# same call, either engine
import gpufft as fgpu          # cuFFT / VkFFT, cross-vendor
# import ferrum_gpu as fgpu    # pure-Rust kernels, NVIDIA

X = fgpu.fft.fft_1d_c2c_pow2(x, log_n=12)

Where this fits

A GPU FFT in Python is well served already, as long as the GPU is NVIDIA. CuPy's cupy.fft is a drop-in for numpy.fft over cuFFT, and nvmath-python is NVIDIA's own, the recommended cuFFT path. Neither leaves NVIDIA in any serious way.

Off NVIDIA, the closest option is pyvkfft, a Python wrapper over VkFFT. It binds VkFFT's CUDA and OpenCL backends, so it needs an array framework underneath (PyCUDA, PyOpenCL, or CuPy) and inherits OpenCL's fading support on macOS. gpufft takes VkFFT's Vulkan backend instead, hands you plain numpy arrays, and ships as one wheel that reaches NVIDIA, AMD, Intel, and Apple.

library	FFT engine	GPUs	arrays
`cupy.fft`	cuFFT	NVIDIA (ROCm partial)	CuPy
`nvmath-python`	cuFFT	NVIDIA	numpy / CuPy
`pyvkfft`	VkFFT (CUDA, OpenCL)	NVIDIA + OpenCL devices	PyCUDA / PyOpenCL / CuPy
`gpufft`	cuFFT + VkFFT (Vulkan)	NVIDIA · AMD · Intel · Apple	numpy
`ferrum-gpu`	pure-Rust kernels (cuda-oxide)	NVIDIA (SPIR-V soon)	numpy

ferrum-gpu sits on a different axis: the language the kernel is written in. The ways to write a GPU kernel without CUDA C today are Numba's @cuda.jit (a Python subset, NVIDIA only), Triton (a tile DSL aimed at ML), or CUDA C wrapped as source strings through CuPy's RawKernel and PyCUDA. The Rust route, cuda-oxide, rust-cuda, and KAIO, gives you the compiler; ferrum-gpu assembles it into an FFT you pip install, with typed buffers and a safe-Rust runtime, and the kernel stays readable Rust with no DSL ceiling.

Roadmap

gpufft: 2D/3D transforms, real-to-complex and complex-to-real, f64, and Vulkan device selection.
ferrum-gpu performance: a multi-FFT-per-block, profiler-guided kernel redesign mirroring cuFFT's structure, targeting the cuFFT gate the perf-gate binary already enforces.
ferrum-gpu reach: spirv-oxide and the Vulkan backend, so the Rust kernels run on AMD, Intel, and Apple Silicon.
shared: R2C / C2R, non-power-of-2 sizes, and DLPack zero-copy with CuPy, PyTorch, and JAX.

Both are open source under Apache-2.0. Install them, break them, send issues and PRs:

pip install gpufft       # cross-vendor, today
pip install ferrum-gpu   # pure-Rust kernels

github.com/alejandro-soto-franco/gpufft · github.com/alejandro-soto-franco/ferrum-gpu