OpenCL Image Convolution Analysis




Introduction
In this project, I explore several optimizations to image convolution on the GPU using OpenCL.  The goal is to assess the potential gains in performance of optimizations to code that requires intensive, parallel manipulation of data.  

To complete this project I referenced the following resources:

Source Code
Source code for the program, including a makefile for MacOsX, can be downloaded here
This program requires OpenCL libraries, OpenGL and GLUT and libpng.  

Design Platform
The program was created for my 2007 MacBook Pro, and it runs on the NVIDIA GeForce 8600M GT . The program is designed to be compatible with multiple platforms. 

How to Control the program
Program runs from the command line.  The folder includes several test images titled: test256.png, test512.png, test1024.png, test2048.png.  These can be processed through the following commands:
> make
> ./project4 imageName.png imageWidth imageHeight convolutionKernelSize whichKernel

The images included in the program are:
test256.png, dimensions 256x256 pixels
test512.png, dimensions 512x512 pixels
test1024.png, dimensions 1024x1024 pixels
test2048png, dimensions 2048x2048 pixels

Kernels

First Kernel:  Global Memory
The first kernel stores the image data in global memory. Note, an initialization step is performed within the kernel, due to difficulties in properly copying data from a png to an OpenGL texture to a float4 array.  So, in the first pass of the kernel, the data is initialized, then convolution is performed.  



Second Kernel:  Texture Memory
The second kernel stores the image data in texture memory.  Textures memory access is designed to be faster than global memory accesses, so an increase in performance is expected. 


Third Kernel:  Texture Memory plus constant memory
The third kernel stores the image data in texture memory, and stores the convolution filter data along with the filter width, image width and height in constant memory.  The constant memory available to a workgroup is small, but fast, so a performance increase is expected. 



Fourth Kernel:  Texture Memory plus constant memory and storage of image data in local memory 
The third kernel stores the image data in texture memory, maintains the constant memory storage for constant data types and adds an intermediate stage where image data is stored in shared memory.  The goal is to assess whether utilizing shared memory provides an additional boost in performance to the kernel. 


I created additional kernels to work with global memory, but could not get these to work.  See 'Problems and Future Work' section below for more details. 

Experiments and Results

 I tested these kernels on 4 images of size 256x256, 515x512, 1024x1024, 2048x2048.  Tests were also performed with an image of size 4096x4096, but the processing speed was too slow to include in this project.  
These images were treated with a Laplacian edge detection filter  of size 3x3, 5x5 and 7x7.  The results are as follows:



Kernel Performance ( global kernel time ) for the 12 experimental conditions


Analysis

As expected, the global memory kernel performed the most poorly of the kernels evaluated.  Performance was so poor over the experiments that, for large images, the kernel did not run in the time required by the profiler, and results could not be gathered.  This kernel can thus be considered the 'baseline' performance that needs to be optimized.

The texture memory kernel significantly improved performance in smaller images, but for the largest image runtime was fairly slow.  Texture memory was consistently the second slowest kernel of the four.  

The texture memory kernel with constant memory storage of filter and dimensionality data was the fastest of the kernels.  Further optimization with local memory (the final kernel) only diminished performance, suggesting that texture memory accesses are sufficiently fast and there is no need for shared memory when dealing with textures.  

Problems and Future Work

Two additional kernels were attempted, but these kernels did not run, so results of these conditions could not be shown.  

Attempt at using local mem for image data stored as global memory. I kept getting an 'invalid event' error on clEnqueueNDRangeKernel , and I could not track down the problem. 


I also attempted to create a kernel that used global memory and constant memory, but even after intensive debugging, could not find a problem with this kernel or its initialization. 



Finally, it appears that some of the convolution kernels produce variable results, even when run with the same input conditions.  I did not discover this problem until the testing phase, and as a result I did not have time to track down the source of the variability.  

256x256 for kernels of size 3, 5, and 7 run in the texture kernel (top) and the texture kernel with constant and local memory (bottom)







Comments