CUDA Based Tissue Doppler Imaging

Zhengjuan Fan, Li Wang, Chaowei Tan, Dong C. Liu

Introduction

Tissue Doppler imaging (TDI) is a routinely used diagnostic tool to assess the myocardial function, including assessment of LV systolic and diastolic function, estimation of LV filling pressures, assessment of right ventricular function and so on. The required signal processing of TDI is computationally intensive, including modified auto-correlation, scan conversion, image mapping.

The main purpose of this paper is to increase the frame rate by designing and implementing parallel signal-processing algorithms of tissue Doppler imaging with high computational requirement, based on graphics processing units (GPU) general computing platform.

Overview of Methods

Data acquisition

Signal data acquisition in TDI is basically the same as conventional color flow imaging (CFI), except for different clinical objects: CFI signal is acquired from vascular, TDI signal is acquired from heart. The signal data can be viewed as a Nl × Ns × Ne 3D matrix. Where Nl represents the number of scan lines in image, Ns is the number of sample points along the axial line, and Ne notes the ensemble size.

TDI Algorithms based on CPU

Blood has high velocity but low density, resulting in low intensity reflected signals. Tissue has high density, resulting in high intensity signals, but low velocity. Thus, low pass filter is used in TDI. Based on Absjorn Stoylen method, the clutter filtering processing in TDI is optional; we omit it in this paper. The flow chart of tissue Doppler imaging is showed in Fig.1.

  • Modified Auto-Correlation

Based on Liu, D.C methods, the modified autocorrelation method performs best, especially for short and highly noisy signals. Since, the signal in TDI is short and highly noisy signal. Modified autocorrelation method is used as myocardial motion velocity estimator. Autocorrelation function of modified autocorrelation method is written as,

R(n) is estimated by averaging in both spatial direction (K is index of the spatial direction) and temporal direction (M is index of the temporal direction). S notes complex signal: S(n, m) = I(n, m) + jQ (n, m). D(n) and N(n) denote the autocorrelation with averaging in temporal direction. D(k) and N(k) denote the autocorrelation with averaging in temporal direction.

  • Envelope Detection and Dynamic Compression

The envelope detection method is basically similar to conventional B mode. The dynamic compression function compresses high-bit data to low-bit data representation to bring weak signals to visible grey level for display. In this experiment, set 60dB dynamic ranges and compression 15-bit data to 11-bit data resprentation to ensure both weak and strong echoes can be visualization.

  • Scan Conversion

Scan conversion is needed to transform polar coordinate ultrasound data into Cartesian coordinate data. As shown in Fig.2.

  • Tissue/Flow Detection

In tissue Doppler image, the tissue motion velocity is color coded to visualization. Since tissue and blood flow has different echo intensities range. The threshold method can be used to distingue tissue and blood. The threshold is determined by a large amount of actual tests.

TDI Algorithms based on GPU

The flowchart of tissue Doppler imaging algorithms implemented on GPU-based CUDA platform is shown in Fig.3. The first and last blocks are memory copy procedures to transfer data between CPU memory and GPU memory. They are additional time-consuming steps, those are not necessary for a pure CPU-based implementation. Moreover, these memory copy procedures are indispensable for a GPU-based implementation. Although they take extra time, the performance of GPU platform still outperforms that of CPU platform. As shown in Table III.

  • Memory copy between CPU and GPU

In memory Copy to GPU block, we transfer I/Q signal data and color map look-up table data into GPU memory. To facilitate later process performance on GPU platform, we did pre-processing for I/Q signals: separately store I, Q signal data in (Ns × Ne) × Nl 2-D array. And allocate color map parameters on CUDA specific memory to increase computation efficiency.

  • Kernel 1

The auto-correlation (the temporal direction) and envelope detection are implemented in kernel 1. These two blocks is merged together to reduce the data transfer between global memories. And we use shared memory to increase the bandwidth and throughput. The shared memory space is only 16KB. Moreover, there is also hardware limitation on the number of threads per block and neither dimension of a grid of blocks may exceed 65,535. According to above discussion, the solution proposed by us is to launch appropriate numbers of threads and blocks once and use a while loop inside the kernel.

  • Kernel 2

For signal acquisition and the previous data processing is line by line, so transpose processing for the output of Kernel 1 is needed before scan conversion processing. We use the sample procedure (Matrix Transpose) provided in NVIDIA GPU Computation SDK.

In the parallel scan conversion implementation, coordinate transformation requires the inverse tangent to determine the scanning angle and a square root to decide the pixel position along the radius. We propose a thread structure with 2-D grid and 2-D thread block so as to avoid the low throughput arithmetic instructions for addressing such as integer division and modulo operation and use fast path math function for high instruction throughout.

  • Kernel 3

For color map has 256 levels, we must normalize tissue motion velocity in the ranges of 0 to 255 before image mapping block. Kernel 3 implements the auto-correlation (the spatial direction) and the normalization of tissue motion velocity. We also use shared memory to speed up. As the signal size is reduced to Ns × Nl, a bigger thread structure can be used that can increase the GPU occupancy.

  • Kernel 4

Because of the difficulty of combing with other processing blocks, the dynamic compression algorithm for B mode image is separated as the fourth kernel.

  • Kernel 5

The tissue/flow detection and image mapping are implemented in kernel 5. We apply a large amount of shared memory and thread structure to parallel determine a pixel to be a valid tissue pixel.

Constant memory only provides one-dimensional access, and texture memory provides two-dimensional access. Color map look-up table has 12KB data(256 × 3 × 8,256 levels, 3 represents R, G, B colors , every pixel has 8 bits). According to above description, we allocate color map look-up table data on constant and texture memory respectively and compare the performance of using the two kinds of memory.

Experiment and Results

The software environment used: Windows XP and NVDIA CUDA v.2.3. The hardware environment used: Intel(R) Core(TM) i5 CPU and an NVIDIA GForce250 GT with 16 multiprocessors. The results based on CPU platform were implemented by standard C.

The experimental signal data is acquired from a healthy heart in the Saset iMago digital ultrasound scanner: 52 san lines (Nl), 512 samples (Ns) along the san line, and an ensemble size (Ne) of 6. Main data parameters are listed in table I.

In Table II, it compared the performance of allocating the color map look-up table data in texture memory and constant memory. The execution time of kernel 5(image mapping is a procedure of looking up color map table based on tissue motion velocity) is almost same. But the bind time of transferring the color map data from CPU to specific GPU memory is quite different. The time-consuming based on constant memory is lower. The following test program used the constant memory to store the color map data.

The execution time of each kernel and the whole CPU-based and GPU-based procedures are compared in Table III. The total execution efficiency on GPU is about 106 times of that on CPU, and the frame rate increases to 467fps.

The final outputs based on CPU and GPU are in Fig.4 (a) and (b),

Conclusion

    1. On the GPU platform, the computation efficiency of TDI is apparently increased.

    2. The tissue Doppler image implemented on GPU is the same as that implemented on CPU.

    3. Other algorithms about myocardial imaging such as velocity gradient imaging, strain rate imaging can be implemented based on GPU platform.