As one high-performance SIMD designs and an attractive platform for parallel applications, Nvidia’s CUDA modern graphics processing units (GPUs) are now used for accelerating the non-equispaced Fast-fourier Transform algorithm in MRI medical imaging area. The optimization in threading scheduling, data structures and memory access patterns can efficiently improve GPU multithreading implementation by investigating specific convolution conditions of NFFT and its inverse NFFTH. If the shared memory is used for repetitive memory access, the image can be divided into square-blocks along horizontal and vertical direction for the NFFT convolution, and sector-blocks along counter-clockwise direction for NFFTH convolution, and the shared memory size determines the maximum block size. The experimental results also show that a 68X speedup of the NFFTH convolution and 4X speedup of NFFT convolution can be realized for 128*128 brain phantom image compared to the same kernel running on an Intel CPU.
Figure 8. The schematic block assignment for the NFFT convolution algorithm