Nvidia CUDA Experience

CUDA (Compute Unified Device Architecture) is a parallel computing architecture developed by Nvidia, the compute engine of Nvidia graphic processing units. See more information on this Wikipedia page.  

The Problem:

Being a Java programmer I wanted to explore how to make a java program talk to nVidia hardware using CUDA C code. CUDA hardware is good in handling matrices, I had a problem that could be modeled using matrices and thought I'd use CUDA for matrix processing. For book keeping and handling code logic I feel more comfortable with Java's various packages, such as SWING for GUI. So I thought of designing my solution such that the calculation extensive parts are done on the parallel CUDA hardware while other non-calculation extensive part are done in JVM.

Development Environment:

  • Operating System: Ubuntu 8.04
  • CUDA SDK Version: 2.3 (download here)
  • Development IDE: gedit as an editor and Makefile as he build script is sufficient to this simple example.

The Learning Process

The CUDA documentation provided by nVidia was very helpful in setting up the development environment on my Ubuntu machine. The toolkit and SDK are essential to run tests in emulation mode. Oh, I should mention that I do not have a CUDA enabled hardware thus the whole experience, listed here, is in emulation mode. According to CUDA documentation what works on emulation mode should work on the hardware.

An interesting tutorial on how to get started with your first CUDA program is posted by llpanorama. The first step for anyone to get started with CUDA is to get a simple example, like the one presented by llpanorama, running on the machine. The tutorials explain how to produce an executable that runs routines on the CUDA enabled hardware, which was not what I wanted to do. I wanted to call CUDA-code, that is code that is executed on the GPU cores, from a Java program through JNI. This meant that I needed to create a shared library, or a dll, that contained CUDA-code and link that to my Java code. Being inexperienced in C programming finding out how this could be done took some time.

The whole picture

The following diagram gives an overview of the steps I used to achieve my goal.


Explanation of Steps

The boxes colored in pink above indicate where code was written.

Before Step 1: Write the Java code with native method declaration.

To start with, a Main.java file was created containing simple code that declares a native function

public native int CUDAProxy_matrixAdd(float[] a, float[] b, float[] c);

declares three arrays,

float[] a = new float[SIZE];
float[] b = new float[SIZE];
float[] c = new float[SIZE];

initializes two of them

for (int i = 0; i < a.length; i++)
   a[i] = b[i] = i;

and calls the JNI function

int retVal = m.CUDAProxy_matrixAdd(a, b, c);

Step 1: Compiling the Java code into .class file

The command line used in this step is the conventional javacc command

%>javacc Main.java

Which creates the .class file.

Step 2: Generating the C header file .h from the .class file

The following command line is used

%>javah -jni -classpath ./ Main

Before Step 3: Write the native implementation (proxy.c)

One needs to put their C skills into action, an area I am a bit rusty in. The code simply moves data from Java environment into native C environment

jfloat *a = (*env)->GetFloatArrayElements(env, aArray, 0);
jfloat *b = (*env)->GetFloatArrayElements(env, bArray, 0);
jfloat *c = (*env)->GetFloatArrayElements(env, cArray, 0);

then calls the native function that executes CUDA code on the GPU

cuda_matrixAdd(a, b, c, N);

then returns the data back from native environment to Java environment

(*env)->ReleaseFloatArrayElements(env, aArray, a, 0);
(*env)->ReleaseFloatArrayElements(env, bArray, b, 0);
(*env)->ReleaseFloatArrayElements(env, cArray, c, 0);

The native implementation also involves writing CUDA code in the kernel_code.cu file, which has two sections:

Kernel Code Section

This is the function that will be executed on the GPU. It is prefixed with the __global__ keyword.

__global__ void add_matrix(float *a, float *b, float *c, int N)
{
   int idx = blockIdx.x * blockDim.x + threadIdx.x;
   if (idx < N) c[idx] = a[idx] + b[idx];
}

Host Code Section

This section contains the C code that runs on the host. Usually this code does all the book keeping and handles the data flow between the host and the GPU device. The major steps consist of allocating memory in the GPU device

cudaMalloc((void **) & a_d, size);
cudaMalloc((void **) & b_d, size);
cudaMalloc((void **) & c_d, size);

and the moving the data from host to GPU device

cudaMemcpy(a_d, a_h, size, cudaMemcpyHostToDevice);
cudaMemcpy(b_d, b_h, size, cudaMemcpyHostToDevice);

then executing the kernel code

int block_size = 4;
int n_blocks = N / block_size + (N % block_size == 0 ? 0 : 1);
add_matrix <<<n_blocks, block_size >>>(a_d, b_d, c_d, N);

then moving the data, usually the result, from the GPU device back to the host. 

cudaMemcpy(c_h, c_d, size, cudaMemcpyDeviceToHost); 

IMPORTANT:
To expose your routines through a shared library you need to enclose the CUDA code inside these directives:
#ifdef __cplusplus
extern "C" {
#endif

and

#ifdef __cplusplus
}
#endif

Step 3: Compile the C code into an object file

Since our code is scattered over two files, and in real life it may be scattered over several files. The best strategy to compile all the code into one library is to compile each source code into an object file then combine all object file into one library. This step uses the gcc compiler to compile the native C code.

%>gcc -c *.c -O2 -m32 -MMD -MP -I. -I/usr/local/cuda/include -I/<absolute_path_to_CUDA_SDK>/NVIDIA_GPU_Computing_SDK/C/common/inc -I/usr/lib/jvm/java-6-sun-1.6.0.10/include -I/usr/lib/jvm/java-6-sun-1.6.0.10/include/linux

IMPORTANT:
  • For CUDA SDK 2.1 the include path is /<absolte_path_to_CUDA_SDK>/NVIDIA_GPU_Computing_SDK/common/inc 
  • The default path for the CUDA toolkit was /usr/local/cuda/include on my Ubuntu machine. Change this to suit your installation
  • Since our C implements the native function for our Java code, we need to tell gcc where to find the JNI headers. For my setup these were at /usr/lib/jvm/java-6-sun-1.6.0.10/include and /usr/lib/jvm/java-6-sun-1.6.0.10/include/linux folders. You may change them to suit your JDK installation.

Step 4: Compile the CU code into an object file

Since .cu files contain CUDA code not C code, it cannot be compiled using gcc. Instead, we need to use nvcc, which comes with the CUDA toolkit.

%>nvcc -c -deviceemu *.cu kernel_code.cu -I. -I/usr/local/cuda/include -I/<absolute_path_to_CUDA_SDK>/NVIDIA_GPU_Computing_SDK/C/common/inc

IMPORTANT:
  • The -deviceemu switch tells nvcc to use "device emulation" mode. This is useful if you do not have CUDA enabled hardware installed on your machine, which is the case with me. The entire code will execute on the host in this case, but at least I know that my code will work if I had CUDA enabled hardware.
  • Remove the -deviceemu switch if you want the code to executed on the GPU device.
UPDATE: In some linux installations you may need to do Step 4 before Step 3 because proxy.c contains a call to a function in kernel_code.cu
Step 5: Combine all object files into one dynamic linkable library

Now that all of our object files are created, hopefully if you didn't encounter any compilation errors. We need to combine these object files into a library. For Linux machines these files usually have the .so extension (.dll extension in Windows). We use gcc again to build the library using the following command:

%>gcc -m32 -shared -fPIC -o program.so *.o -lrt -lm -lcudart -lcufftemu -lcublasemu -I. -I/usr/local/cuda/include -I/<absolute_path_to_CUDA_SDK>/NVIDIA_GPU_Computing_SDK/C/common/inc -I/usr/lib/jvm/java-6-sun-1.6.0.10/include -I/usr/lib/jvm/java-6-sun-1.6.0.10/include/linux -L/usr/local/cuda/lib -L.

Note:
The -lcufftemu and -lcublasemu switches are there just in case your code uses calls to functions in those libraries. Though my sample code here does not, I've included these switches in the gcc command for future use. The postfix "emu" indicates emulation mode. If you have the CUDA hardware these need to be changed to -lcufft and -lcublas respectively.
If everything goes well you'll find a library will exist by the name program.so in your current folder. Congratulations!! The hard part is over. Now all we need to do is execute the Java program.

Step 6: Execute the Java program after linking it with the library

The static section in the Java code that loads the dynamic linked library is:

static
{
    Scanner input = new Scanner(System.in);
    System.out.println("Enter library name: ");
    String libName = input.nextLine();
    try
    {
        File path = new File("");
        System.out.println("Current Path = " + path.getAbsolutePath());
        String libPath = path.getAbsolutePath() + File.separator + libName;
        System.out.println("Trying to load library [" + libPath + "] ...");
        System.load(libPath);
        System.out.println("Library loaded");
    }
    catch (Exception e)
    {
        System.out.println("Error: " + e);
    }
}

To execute the code all we need to do is run the Main program by:

%>java Main

and enter the name of the library, in our case it is program.so

Appendix

I've created a simple Makefile to make my life easy. The usage is as follows

make java

to execute Steps 1 and 2. That is, compiling the .java file into .class file and then building the .h file that contains the native headers.

make

to compile the .cu file and the .c files then combine them into one program.so library file

You may need to modify the Makefile according to your system configurations.

ċ
Main.java
(2k)
Adel Ahmed,
Oct 25, 2009, 4:54 PM
ċ
Makefile
(2k)
Adel Ahmed,
Oct 25, 2009, 4:55 PM
ċ
kernel_code.cu
(3k)
Adel Ahmed,
Oct 25, 2009, 4:55 PM
ċ
proxy.c
(1k)
Adel Ahmed,
Oct 25, 2009, 4:54 PM
Comments