Authors:
Most GPU performance "hypes" have focused around tightly-coupled
applications with small memory bandwidth requirements e.g., N-body, but
GPUs are also commodity vector machines sporting substantial memory
bandwidth; however, effective programming methodologies thereof have
been poorly studied. Our new 3-D FFT kernel, written in NVidia CUDA,
achieves nearly 80 GFLOPS on a top-end GPU, being more than three times
faster than any existing FFT implementations on GPUs including CUFFT.
Careful programming techniques are employed to fully exploit modern GPU
hardware characteristics while overcoming their limitations, including
on-chip shared memory utilization, optimizing the number of threads and
registers through appropriate localization, and avoiding low-speed
stride memory accesses. Our kernel applied to real applications
achieves orders of magnitude boost in power&cost vs. performance
metrics. The off-card bandwidth limitation is still an issue, which
could be alleviated somewhat with application kernels confinement
within the card, while ideal solution being facilitation of faster GPU
interfaces.
The full paper can be found in the
IEEE Computer Society
archive
and
ACM Digital Library
|