OpenCL SVM

http://developer.amd.com/documentation/articles/pages/OpenCLOptimizationCaseStudySupportVectorMachineTraining.aspx

Initial Analysis

Let’s look at a simple C version of the Radial Basis Function evaluations. We’re going to evaluate all data points against two of the data points, one called high and the other called low.

We’ll store the results in two arrays, one called high_kernel and the other called low_kernel. We have nPoints training points, each of which is a sample from an nDim dimensional feature space.We’ll store the data points in an array called data, where each row represents one training point, and so the resulting matrix has nPoints rows and nDim columns. For now, we’ll store the matrix using the standard C “row-major” data layout, which we’ll discuss in more detail later in the article. gammais the scalar parameter from our Radial Basis Function. Listing 1 shows some simple C code to accomplish this.

void svm_kernel(float* data,   int nPoints,   int nDim,   int high,   int low,   float gamma,                  float* high_kernel,   float* low_kernel) {   for(int index = 0; index < nPoints; index++) {     float high_accumulator = 0;     float low_accumulator = 0;     for(int d = 0; d < nDim; d++) {       float x_i_d = data[index * nDim + d];       float x_high_d = data[high * nDim + d];       float x_low_d = data[low * nDim + d];       float high_diff = x_i_d – x_high_d;       high_accumulator += high_diff * high_diff;       float low_diff = x_i_d – x_low_d;       low_accumulator += low_diff * low_diff;     }     float high_result = exp(-gamma * high_accumulator);     high_kernel[index] = high_result;     float low_result = exp(-gamma * low_accumulator);     low_kernel[index] = low_result;   } }

Listing 1: Simple C code for RBF Kernel Evaluations

In this code, we’ve fused both the high and low kernel evaluations together into one inner loop, in order to avoid the overhead of loading the vector twice for each iteration of the outer loop. We can see the row-major data layout in the indexing arithmetic in the inner loop: to load data[row, column] or equivalently data[index, d] in listing 1, we index into data[index * nDim + d], where nDim is the number of columns in each row. We’ll discuss this in detail later.

We can also see that in the inner loop, there are 6 floating-point operations and three memory loads.

However, since the high and low vectors are the same for every iteration of the outer loop, if we write our code correctly, we’ll only have to load them once into on-chip memories, and then we’ll reuse them without having to execute an off-chip memory load in the inner loop. The code in listing 1 does not do this, it is an example of a simple C implementation without any optimizations. Later, we’ll see how to implement this optimization to avoid wasting memory bandwidth.

We can use this information to compute a bound on our expected performance of this code, running on the AMD Radeon HD 5870. The Radeon HD 5870 has 153.6 GB/s of off-chip memory bandwidth.Assuming perfect caching of the high and low vectors, we should see 153.6 GB/second * 6 Single Precision Floating Point Operations/Memory load * 1 Memory load/4 Bytes = 230.4 Single Precision Floating Point Operations/second. We’ll use this bound to evaluate the quality of our implementations as we optimize.

Basic OpenCL Implementation

Translating the C code directly to OpenCL, we will instantiate a work-item for each iteration of the outermost loop. Listing 2 shows a direct OpenCL translation of our simple C implementation from Listing 1.

__kernel void svm_kernel(__global float* data,                  __const int nPoints,                  __const int nDim,                  __const int high,                  __const int low,                  __const float gamma,                  __global float* high_kernel,                  __global float* low_kernel                  ) {      int global_index = get_global_id(0);      float high_accumulator = 0;   float low_accumulator = 0;   for(int d = 0; d < nDim; d++) {     float x_i_d = data[global_index * nDim + d];     float x_high_d = data[high * nDim + d];     float high_diff = x_i_d - x_high_d;     high_accumulator += high_diff * high_diff;     float x_low_d = data[low * nDim + d];     float low_diff = x_i_d - x_low_d;     low_accumulator += low_diff * low_diff;   }           float high_kernel = exp(-gamma * high_accumulator);   high_kernel[global_index] = high_kernel;        float low_kernel = exp(-gamma * low_accumulator);    low_kernel[global_index] = low_kernel; }

Listing 2: Direct OpenCL translation of simple C code for RBF kernel evaluation

Comparing Listing 2 to Listing 1, we can see that we have taken the outermost loop and turned it into an OpenCL kernel, where the index space of the OpenCL kernel corresponds to the iterations of the outermost loop.