Authors:
We present novel algorithms for computing Fourier transforms with high
performance on GPUs. We present hierarchical, mixed radix FFT
algorithms for both power-of-two and non-power-of-two sizes. Our
hierarchical FFT algorithms efficiently exploit shared memory on GPUs
using a Stockham formulation. We reduce the memory transpose overheads
in hierarchical algorithms by combining the transposes into a
block-based multi-FFT algorithm. For non-power-of-two sizes, we use a
combination of mixed radix FFTs of small primes and Bluestein's
algorithm. We use modular arithmetic in Bluestein's algorithm to
improve the accuracy. We implemented our algorithms using the NVIDIA
CUDA API and compared their performance with NVIDIA's CUFFT library and
an optimized CPU-implementation (Intel's MKL) on a high-end quad-core
CPU. On an NVIDIA GPU, we obtained up to 250 GFlops, a 3x performance
improvement over CUFFT, and 5-25x improvement over MKL on large sizes.
The full paper can be found in the
IEEE Computer Society
archive
and
ACM Digital Library
|