Lab4 — GPU Acceleration with Jetson

Due date: September 3, 2023

Today, we'll be digging into the capabilities of GPU acceleration on the Jetson.

You can find the link to the assignment here: https://classroom.github.com/a/ECJ1Sid2  In each of the below sections, please include sample outputs from your application.  Additional requirements for your report will be included in each section.

Setup? Visual display forwarding

Interfacing through Jupyter can get a bit slow. If you prefer, you can use ssh display tunneling or VNC to work on the Jetson:

Background and Resources

Here are some slides to follow along with for today's lab.

Part 1: CUDA Hello World

To start this project, clone your assignment to your Jetson. Convention note: .cc is yet-one-more extension people use for C++ files; .cu is CUDA code.

Start by compiling both versions of the provided main program:

$ g++ main.cc -o hello_cpu

$ nvcc main.cu -o hello_gpu


Make sure each of these programs execute as-expected.

Example 1: Basic Memory Allocation

Now take a look at ex1.cu. Follow through the description in the lab slides and read the code. Do you understand how the allocation and freeing works?

Deliverable: Explain in plain English what this example does.

Example 2: Scope?

Now take a look at ex2.cu. Follow through the description in the lab slides and read the code. What differs between example 1 and example 2?

Deliverable: Succinctly explain the difference between examples 1 and 2, and what is significant about it.

Implementing Matrix Multiply Two Ways

Now you will have to write some code yourself. We are going to implement matrix multiply under two separate memory management schemes.

First, check out lw.cu. We have put together most of the CUDA scaffolding for you. Your task is to implement the kernel that will execute on the GPU (i.e. fill out the myKernel() method).

You have two things to consider here:

Once you have your kernel working correctly, copy it into lw_managed.cu. This is a different scaffolding setup, but can use the same kernel to multiply. Take a look at how the setup code differs between lw.cu and lw_managed.cu.

Deliverables

A Little Extra If You're Interested

There is a suite of example programs that ship with CUDA, by default these are normally installed to /usr/local/cuda/samples, but it's been known to move around :/. Try running /usr/local/cuda/samples/1_Utilities/deviceQuery/deviceQuery to see what the capabilities of the Jetson are. Check out some of the other samples here as well.

Part 2: GPU-Accelerated Image Manipulation

In this part of the lab, we will implement some common image manipulation techniques: greyscale conversion, inversion, and blurring.

A CPU Baseline

We will first implement greyscale conversion on the CPU to make sure your implementation is working as-expected:

img_rgb2gray_cpu(gray.ptr<uchar>(), rgb.ptr<uchar>(), WIDTH, HEIGHT, CHANNELS);

GPU, with separate memory

Now, let's move to the GPU.

You may find the image indexing at the end of the lab slides a useful reference.

Implement and call your algorithm

We'll have to make some changes to img_proc.cu.

The changes/additions here are the actual algorithm implementation on the GPU and a wrapper function to be called on the host:

Hint: Having trouble with indicies? Check out the nVidia CUDA basics slides, in particular the "Combining Threads and Blocks" section (starts on PDF-page 37).

Setup memory and get things going

We'll have to make some changes to main.cpp.

The changes here will manage the GPU device memory and calling the wrapper function we wrote above. We're NOT using unified memory.

cudaMalloc((void **)&gray_device, <SIZE_TO_ALLOCATE>);

unsigned char* rgb_device;

cudaMalloc((void **)&rgb_device, <SIZE_TO_ALLOCATE>);

cudaMemcpy(<PTR_TO_DEVICE_MEM>, <PTR_TO_HOST_MEM>, <SIZE_TO_COPY>, cudaMemcpyHostToDevice);

cudaMemcpy(<PTR_TO_HOST_MEM>, <PTR_TO_DEVICE_MEM>, <SIZE_TO_COPY>, cudaMemcpyDeviceToHost);

GPU, with unified memory

There is also memory that can be shared by host and device. The benefit of this is less code, but it is often less efficient than allocated device memory.

cudaMallocManaged(&gray_device, <SIZE_TO_AOLLOCATE>);

Mat gray = Mat(HEIGHT, WIDTH, CV_8U, gray_device);

img_rgb2gray_gpu(gray.ptr<uchar>(), rgb.ptr<uchar>, ...);

Notice the benefit of not having to write code to copy the data every time.

What are the computation benefits/downfalls to each method?

Image Inversion

Repeat the above steps, but the kernel function will invert the image (pixval = 255 - pixval).

Image Blur

Repeat the above steps, but the kernel function will average a BLUR_SIZE square of pixels.

Part 2 Deliverables

Assignment: CUDA for Familiar Tasks

In this assignment, you will learn more about GPU programming. You will need to develop and run your code on the Jetson TX2 board.

Part 1: Sobel filter

For this part, you can use the input.raw video that is provided. Complete the filter.cu file to encode a Sobel filter on the GPU. You can reuse any code from previous assignments. Keep note of the following items you will need to complete:

In filter.cu:

In main.cpp:

Once everything is implemented:

Deliverables

Report the approximate execution times for OpenCV Sobel, CPU Sobel, and GPU Sobel, for different sizes.

You can use square sizes from 512 to 4096 (note: your code should still work for non-square sizes).

Note that for smaller sizes, the FPS will be limited by the camera FPS, and beyond 1024, the images will not display. If you wish, you can completely disable the display (comment out "imshow" in main.cpp) for all sizes to get a more stable result for the GPU.

Part 2: Blocked matrix multiplication

In this part, you will multiply two matrices using shared (unified) memory. Make sure you carefully read the description of the problem.

Description

Tasks

Hints

You will launch NxN threads (one per output), divided into blocks of size SxS. We will illustrate what happens inside each thread with the example below:

Deliverables

Your final source code and the performance measures requested above.


What to Submit

Prepare a report document with answers for each of the Report Deliverables above.

Lab

Part 1

Part 2

Assignment

Part 1

Report the approximate execution times for OpenCV Sobel, CPU Sobel, and GPU Sobel, for different sizes.

You can use square sizes from 512 to 4096 (note: your code should still work for non-square sizes).

Note that for smaller sizes, the FPS will be limited by the camera FPS, and beyond 1024, the images will not display. If you wish, you can completely disable the display (comment out imshow in main.cpp) for all sizes to get a more stable result for the GPU.

Part 2

Your final source code and the performance measures requested above.

Additional Resources