Final Exam

Post date: May 3, 2012 11:51:20 PM

1.Introduction

The final project is focusing on implementation and optimization of convolution on a lager image using OpenCL.

Both the implementation and optimization are based on Heterogeneous Computing with OpenCL

2. Implementation

=== in host ===

2.1 set the filter

For implementation, I first read an input image file and load the filter manually. For this project, I simply used a 7*7 filter to perform 45 degree motion blur effect.

// 45 degree motion blur

float filter[49] = 

{ 0, 0, 0, 0, 0, 0.0145, 0,

0, 0, 0, 0, 0.0376, 0.1283, 0.0145,

0, 0, 0, 0.0376, 0.1283, 0.0376,      0,

0, 0.0376, 0.1283, 0.0376, 0, 0, 0,

   0.0145, 0.1283, 0.0376, 0, 0, 0, 0,

0, 0.0145, 0, 0, 0, 0, 0};

2.2 set workgroup sizes

To maximize the performance of this kernel, I chose to create as many work-items as will be performing the convolution. (i.e., padding pixels are not included) and set the size of workgroup at 16*16.

2.3 round up columns and rows

For an image with dimensions imageWidth and imageHeight, only (imageWidth - 2*FilterRadius) * (imageHeight - 2*FilterRadius) work-items are needed. This makes the width and height likely not to be an exact multiple of the workgroup size. Therefore, I created additional workgroups in both the X- and Y-dimensions to ensure they are the multiple of the workgroup size.

2.4 set up openCL environment and OpenCL kernel

I executed kernel for 100 times to get average running time.

        // Execute the kernel

for (int ii=0; ii<100; ii++) {

ciErrNum = clEnqueueNDRangeKernel(queue, kernel, 2, NULL, globalSize, localSize, 0, NULL, &eventKernel);


// Wait for kernel to complete

cl_ulong end, start;

ciErrNum = clWaitForEvents(1, &eventKernel);

ciErrNum = clGetEventProfilingInfo(eventKernel, CL_PROFILING_COMMAND_END, sizeof(cl_ulong), &end, 0);

ciErrNum |= clGetEventProfilingInfo(eventKernel, CL_PROFILING_COMMAND_START, sizeof(cl_ulong), &start, 0);

fprintf(stderr, "time in kernel: %0.2f ms\n",(end-start)*1.0e-6f);

totalTime += end-start;

}


fprintf(stderr, SEPARATOR);

fprintf(stderr, "Total Time in kernel: %0.2f ms\n", totalTime*1.0e-6f);

fprintf(stderr, "Average Time in kernel: %0.2f ms\n", totalTime*1.0e-8f);

=== in kernel ===

2.5 cache data to local memory

Firstly, I allocated space in local memory according to workgroup size.

Then, I copy corresponding area of global memory to local memory, along with some detection job that makes sure we don't go outside of our working area or out of bounds for the image.

        // Determine the size of the workgroup output region

int groupStartCol = get_group_id(0)*get_local_size(0);

int groupStartRow = get_group_id(1)*get_local_size(1);


// Determine the local ID of each work-item

int localCol = get_local_id(0);

int localRow = get_local_id(1);


// Determine the global ID of each work-item

int globalCol = groupStartCol + localCol;

int globalRow = groupStartRow + localRow;

// Cache the data to local memory

// Step down rows

for (int i = localRow; i < localHeight; i += get_local_size(1)) {

int curRow = groupStartRow + i;

// Step across columns

for (int j = localCol; j < localWidth; j += get_local_size(0)) {

int curCol = groupStartCol + j;

// Perform the read if it is in bounds

if (curRow < rows && curCol < cols)

localImage[i*localWidth + j] = imageIn[curRow*cols+curCol];

}

}

2.6 place a barrier

I placed a barrier at the end of the copy so that the convolution won't start until all the copy work is finished

2.7 perform the convolution

This step performs the convolution on my 7*7 filter.

for (int i = localRow; i < localRow+filterWidth; i++) {

int offset = i*localWidth;

for (int j = localCol; j < localCol+filterWidth; j++) {

sum += localImage[offset+j] * filter[filterIdx++];

}

}

2.8 write the data out

This step is straightforward and concludes the kernel part

imageOut[(globalRow + filterRadius) * cols + (globalCol + filterRadius)] = sum;

=== in host ===

2.9 write the output data into an image file

The end.

3. Optimizations

3.1 Aligning for Memory Accesses

Performances on both NVIDIA and AMD GPUs benefits from data alignment in global memory.

I padded the number of columns to make it a multiple of the X-dimension of the workgroup.

int deviceWidth = roundUp(imageWidth, WGX);

Then I used clEnqueueWriteBufferRect() to copy the host data into the padded device buffer.

clEnqueueWriteBufferRect(queue, d_inputImage, CL_TRUE, buffer_origin, host_origin, region, deviceWidth*sizeof(float), 0, imageWidth*sizeof(float), 0, inputImage, 0, NULL, NULL);

Unfortunately, this optimization didn't work.

3.2 Vector Reads

AMD GPUs are optimized for 128-bit read operations per SIMD lane. So I tried to use float4 (which is 128-bit) to replace float.

First step is to allocate extra local memory by rounding up localWidth to a multiple of 4 in host code. 

int localWidth = roundUp(localSize[0]+ paddingPixels, 4);

Then rewrite the kernel as convolution_read4

__kernel void convolution_read4(__global float4* imageIn,

__global float* imageOut,

__constant float* filter,

int rows,

int cols,

int filterWidth,

__local float* localImage,

int localHeight,

int localWidth)

As the results in Section 4 show, reading using vector gained 6%, 28% - 32% and 45% - 48% efficiency for 'not unrolled', 'inner unrolled' and 'completely unrolled' of the loop in the kernel (see Section 3.3)

3.3 Loop Unrolling - when performing the convolution (Section 2.7)

                // Inner loop unrolled

for (int i = localRow; i < localRow+filterWidth; i++) {

int offset = i*localWidth + localCol;

sum += localImage[offset++] * filter[filterIdx++];

sum += localImage[offset++] * filter[filterIdx++];

sum += localImage[offset++] * filter[filterIdx++];

sum += localImage[offset++] * filter[filterIdx++];

sum += localImage[offset++] * filter[filterIdx++];

sum += localImage[offset++] * filter[filterIdx++];

sum += localImage[offset++] * filter[filterIdx++];

}

By unrolling the inner loop when performing the convolution in the kernel, I got a 3.38x speedup for non-optimized memory access and aligned memory access and got a 4.18x speedup for vector reads optimization.

// Completely unrolled

                int offset = localRow * localWidth + localCol;

sum += localImage[offset+0] * filter[filterIdx++];

sum += localImage[offset+1] * filter[filterIdx++];

sum += localImage[offset+2] * filter[filterIdx++];

sum += localImage[offset+3] * filter[filterIdx++];

sum += localImage[offset+4] * filter[filterIdx++];

sum += localImage[offset+5] * filter[filterIdx++];

sum += localImage[offset+6] * filter[filterIdx++];


offset += localWidth;

sum += localImage[offset+0] * filter[filterIdx++];

sum += localImage[offset+1] * filter[filterIdx++];

sum += localImage[offset+2] * filter[filterIdx++];

sum += localImage[offset+3] * filter[filterIdx++];

sum += localImage[offset+4] * filter[filterIdx++];

sum += localImage[offset+5] * filter[filterIdx++];

sum += localImage[offset+6] * filter[filterIdx++];


... again ... 3


... again ... 4

... again ... 5

                ... again ... 6

                ... again ... 7

By completely unrolling both inner and outer loops, I received a 4.5x speedup over non-unrolled vector reads version.

However, for non-optimized and aligned access version, the running time are slightly slower than only unrolling inner loop.

And this is not very useful when filter is very large because the code will be tooooo long.

4. Results

Platform: Mac OS X 10.7.3, AMD Radeon HD 6770M (1GB)

  DEVICE_NAME = ATI Radeon HD 6770M

  DEVICE_VENDOR = AMD

  DEVICE_VERSION = OpenCL 1.1 

  DRIVER_VERSION = 1.0

  DEVICE_MAX_COMPUTE_UNITS = 6

  DEVICE_MAX_CLOCK_FREQUENCY = 675

  DEVICE_GLOBAL_MEM_SIZE = 536870912

I performed convolution on three grayscale RAW images with resolutions 512*512, 1024*1024 and 2048*2048.

512*512:

  

1024*1024:

 

2048*2048:

 

kernel running time (in ms):

5. Other comparisons

5.1 different work-group numbers

Other than 16*16, I also tried 32*8 and 8*32 work-groups. Results showed that 16*16 < 32*8 < 8*32, in terms of running time. But the differences are very slightly.

5.2 size other than 2^N

I was wondering the reason the alignment of memory accesses didn't work is about the resolution of the image, so I tried other resolution than other than 2^N. I tried 2000*2000 (which is a multiple of WGX) and 1900*1900, 1919*1919 (which are not a multiple of WGX). But I didn't see any difference between not_optimized version and read_aligned version, either with or without unrolling of the kernel.

So I guess it is relevant to the architecture of my graphics card. AMD guys said they gained 10% speedup by aligned reading using 6970.

6. Conclusions

1) Optimization on GPU are heavily dependent on hardware architecture of different devices;

2) It is supposed to be that performances are different when the kernel sizes and image sizes are different in terms of whether it is a multiple of the kernel size, but this need more evidence;

3) But in general, giving up the convolution loops can provide significant performance improvement.