GPU - Final Exam - Matrix Transpose

Introduction

In this project we focus on different ways of optimizing Matrix Transpose. We study the code discussed in "http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/CUDA/website/C/src/transposeNew/doc/MatrixTranspose.pdf" and try to analyse the various areas to look into while optimizing a CUDA code.

This paper discusses the example for square matrices for the purposes of brevity and simplicity. In this exam we have been asked to relax this requirement. We have also been asked to play with the thread size and the thread orientation in a block.

The code is run on 2 different machines with varying compute capability. We will later compare the performance of these results with the help of graphs.

Problem

In linear algebra, the transpose of a matrix A is another matrix AT (also written A′, Atr or tA) created by any one of the following equivalent actions:

- write the rows of A as the columns of AT

- write the columns of A as the rows of AT

- reflect A by its main diagonal (which starts from the top left) to obtain AT

Formally, the (i,j) element of AT is the (j,i) element of A.

[AT]ij = [A]ji

$\begin{bmatrix} 1 & 2 \\ 3 & 4 \end{bmatrix}^{\mathrm{T}} \!\! \;\! = \, \begin{bmatrix} 1 & 3 \\ 2 & 4 \end{bmatrix}.$

Matrix Transposition and Matrix Multiplication the classic examples for performance analysis. Luckily in this one exam we are looking at the easier one among them which is the Matrix Transposition Problem. We analyse and look into various solutions to eliminate performance bottlenecks. Here we deal with input and output matrices which address separate memory locations.

Solution

In this implementation we have 4 parameters which result in variation of the output. Which are

a. TILE_DIM

b. BLOCK_ROWS

c. size_x

d. size_y

TILE_DIM - The execution totally depends on this parameter as the matrix dimensions need to be an integral multiple of this parameter.

BLOCK_ROWS - It is beneficial to set the value of this parameter to lower factor of TILE_DIM as each thread would be able to process multiple elements. However the maximum value it could be set to is the value of the chosen TILE_DIM

size_x - This is the x dimension of the matrix which is an integral multiple of TILE_DIM

size_y - This is the y dimension of the matrix which is an integral multiple of TILE_DIM

All the results that have been discussed here are got by changing two or more of these parameters.

We start by studying how different looping over kernel is to looping with kernel.

In this piece of code we see that the loop is executed NUM_REPS number of times from outside the kernel. The for loop in this code and the call to the kernel from within the for loop demands that all the blocks from one kernel launch must complete execution before any block of the following launch can begin. This in turn acts as a synchronization mechanism. In this case the active blocks and hence the memory patterns resets in every loop iteration.

In this case the set of active thread blocks spreads as the execution progresses. This is a way to compare the data movement between kernels.

Kernels of Interest

Coalesced Transpose

The interesting thing to observe in this type of optimization is that data is read into shared memory and have each half warp size access noncontiguous locations in shared memory in order to write contiguous data to output data. By doing this there is no performance drop and we also avoid the non-contiguity of data in the odata.

Avoiding Bank Conflicts

Bank conflicts occur when writing partial columns from a tile in shared memory to rows in output data. This is called the 16-way bank conflict.

This can be avoided by padding shared memory array by one column.

Diagonal Transpose to avoid Partition Camping

The term partition camping is used to describe the case when global memory accesses are directed through a subset of partitions, causing requests to queue up at some partitions while other partitions go unused. One way to avoid partition camping in both reading from idata and writing to odata is to use a diagonal interpretation of the components of blockIdx. The y component represents different diagonal slices of tiles through the matrix and the x component indicates the distance along each diagonal.

This is same as the previous ones but just that we initially do a diagonal reordering.

Devices Used

Machine 1

Machine 2

There are 2 devices supporting CUDA

A brief introduction to TESLA machines

NVIDIA Tesla is a family of GPU computing solutions aimed at high performance computing (HPC) applications. The family is built around the NVIDIA Tesla GPU Computing Processor, a dedicated computing board. The family includes the NVIDIA Tesla Deskside Supercomputer, a scalable computing system that includes two NVIDIA Tesla GPUs and attaches to a PC or workstation through an industry-standard PCI-Express connection, and the NVIDIA Tesla GPU Computing Server, a 1U server housing up to eight NVIDIA Tesla GPUs.

Results

The text files of all the below discussed results and more have been attached at that bottom of this page. If you are interested in knowing the appealing results on Tesla machine you must take a look at it.

Bandwidth

This result is got by keeping the matrix size constant and varying the TILE_DIM and BLOCK_ROWS size on Machine 1.

This result is got by varying keeping the matrix size constant and varying the TILE_DIM and BLOCK_ROWS size on Machine 2.

The following result is got by comparing a loop over kernel output with TILE_DIM = 32, BLOCK_ROWS = 16, size_x = 1024 and size_y = 256 on machines 1 and 2.

The following result is an interesting one in that the TILE_DIM and the BLOCK_DIM values both of which are equal to 15. There are two matrix sizes 600X600 and 1500X1500. Again we have compared it on 2 machines.

Timing

The following result shows a comparison between the timing of loop over kernel and loop within kernel and also compares it on two machines.

TILE_DIM = 32, BLOCK_ROWS = 16, size_x = 1024, size_y = 256

Learnings from the course and this assignment

a. Avoid usage of loops in the kernel

b. Use Global Memory Appropriately

c. Synchronization mechanism is important but we should not be using sync_threads recklessly as it may sometimes give rise to deadlocks when used in conditionals.

d. Avoid large scale bank conflicts in shared memory.

e. Make use of as much shared memory as possible.

f. Get used to types like float3 and float4 and use it more often for faster load.

g. Do not use divergent branches within a warp wherever possible.

h. Carefully choose the number of threads per thread block, the amount of shared memory per block, and the number of registers used by the kernel.

Conclusion

In all we see that out of all the various kinds of optimizations we have Diagonal Transpose turns out to be the winner. However this is not the best solution that can be achieved. With the understanding of these optimizations we can apply them to various scenarios and further enhance them to get better results. This assignment proved to be helpful in learning the Memory management in CUDA and data transfers between device and on-chip memory. This exercise finally leaves us with an overall understanding of the GPU Architecture and the ways to manage data on the device and between the devices.

Source Download

http://www2.cs.uic.edu/~anaik/TransposeMatrix.zip

Tips for using the downloaded file

After downloading the file compile it. In the "x64\Debug" folder there is a bat file which would run the the program for multiple inputs. Double click on the file and navigate into the numbered folders to find the results in .txt format.

References

1. http://developer.download.nvidia.com/compute/cuda/3_0/sdk/website/CUDA/website/C/src/transposeNew/doc/MatrixTranspose.pdf

2. http://en.wikipedia.org/wiki/Transpose