This assignment is an introduction to parallel programming using GPU. Most of this page will be similar to the HW 2-1 and HW 2-2 pages.
In this assignment, we will be parallelizing a toy particle simulation (similar simulations are used in mechanics, biology, and astronomy). In our simulation, particles interact by repelling one another. A run of our simulation is shown here:
The particles repel one another, but only when closer than a cutoff distance highlighted around one particle in grey.
If we were to naively compute the forces on the particles by iterating through every pair of particles, then we would expect the asymptotic complexity of our simulation to be O(n^2).
However, in our simulation, we have chosen a density of particles sufficiently low so that with n particles, we expect only O(n) interactions. An efficient implementation can reach this time complexity.
You must use the same groups as with HW2-1 and HW2-2. If this is a problem, please privately contact the GSIs.
The starter code is available on github at https://github.com/Berkeley-CS267/hw2-3 and should work out of the box. To get started, we recommend you log in to perlmutter and download the first part of the assignment. This will look something like the following:
student@local:~> ssh student@perlmutter.nersc.gov
student@perlmutter:login39:~> git clone https://github.com/Berkeley-CS267/hw2-3
student@perlmutter:login39:~> cd hw2-3
student@perlmutter:login39:~/hw2-3> ls
CMakeLists.txt common.h job-gpu main.cu gpu.cu
There are five files in the base repository. Their purposes are as follows:
CMakeLists.txt
The build system that manages compiling your code.
main.cu
A driver program that runs your code.
common.h
A header file with shared declarations
job-gpu
A sample job script to run the gpu executable
gpu.cu - - - You may modify this file.
A skeleton file where you will implement your gpu simulation algorithm. It is your job to write an algorithm within the simulate_one_step function.
Please do not modify any of the files besides gpu.cu.
First, we need to make sure that the CMake module is loaded.
student@perlmutter:login39:~/hw2-3> module load cmake
student@perlmutter:login39:~/hw2-3> cmake --version
cmake version 3.30.2
CMake suite maintained and supported by Kitware (kitware.com/cmake).
You should put these commands in your ~/.bash_profile file to avoid typing them every time you log in.
Next, let's build the code. CMake prefers out of tree builds, so we start by creating a build directory.
student@perlmutter:login39:~/hw2-3> mkdir build
student@perlmutter:login39:~/hw2-3> cd build
student@perlmutter:login39:~/hw2-3/build>
Next, we have to configure our build. We can either build our code in Debug mode or Release mode. In debug mode, optimizations are disabled and debug symbols are embedded in the binary for easier debugging with GDB. In release mode, optimizations are enabled, and debug symbols are omitted. For example:
student@perlmutter:login39:~/hw2-3/build> cmake -DCMAKE_BUILD_TYPE=Release ..
-- The C compiler identification is GNU 13.2.1
...
-- Configuring done
-- Generating done
-- Build files have been written to: /global/homes/s/student/hw2-3/build
Once our build is configured, we may actually execute the build:
student@perlmutter:login39:~/hw2-3/build> make
[ 33%] Building CUDA object CMakeFiles/gpu.dir/main.cu.o
[ 66%] Building CUDA object CMakeFiles/gpu.dir/gpu.cu.o
[100%] Linking CUDA executable gpu
[100%] Built target gpu
student@perlmutter:login39:~/hw2-3/build> ls
CMakeCache.txt CMakeFiles cmake_install.cmake Makefile gpu job-gpu
We now have a binary (gpu) and a job script (job-gpu). You should not run the binary on the login node. To run your implementation, you can modify job-gpu by appending arguments after ./gpu, i.e., change line 8 to ./gpu -n 10000000, and run sbatch job-gpu. You can also run your implementation on an interactive node following the steps below:
student@perlmutter:login39:~/hw2-3/build> salloc -A mp309 -N 1 -C gpu -q interactive -t 00:30:00
salloc: Pending job allocation 36496753
salloc: job 36496753 queued and waiting for resources
salloc: job 36496753 has been allocated resources
salloc: Granted job allocation 36496753
salloc: Waiting for resource configuration
salloc: Nodes nid200421 are ready for job
student@perlmutter:nid200421:~/hw2-3/build> ./gpu -n 10000000
Simulation Time = 10.0758 seconds for 10000000 particles.
While the scripts we are providing have small numbers of particles 1000 to allow for the O(n2) algorithm to finish execution, the final codes should be tested with values much larger (100000-10000000) to better see their performance.
We will grade your assignment by reviewing your assignment write-up, measuring the scaling of the implementation, and benchmarking your code's raw performance. To benchmark your code, we will compile it with the exact process detailed above. We will run your submissions on exactly one Perlmutter's A100 GPU.
Suppose you are Group #XY, follow these steps to create an appropriate submission archive:
Ensure that your write-up is located in your source directory. It should be named cs267XY_hw2_3.pdf
From your build directory, run:
student@perlmutter:login39:~/hw2-3/build> cmake -DGROUP_NAME=XY ..
student@perlmutter:login39:~/hw2-3/build> make package
This second command will fail if the PDF is not present.
Confirm that it worked using the following command. You should see output like:
student@perlmutter:login39:~/hw2-3/build> tar tfz cs267XY_hw2_3.tar.gz
cs267XY_hw2_3/cs267XY_hw2_3.pdf
cs267XY_hw2_3/gpu.cu
Submit your .tar.gz through bCourses.
Write-up Details
Your write-up should contain:
The names of the people in your group.
Each member's contribution.
A plot in log-log scale showing that your parallel codes performance and a description of the data structures that you used to achieve it.
You should benchmark your GPU implementation against the starter code varying the number of particles and comparing both of them to a linear behavior.
A description of the synchronization you used in the GPU implementation.
A description of the design choices that you tried and how did they affect the performance.
Please focus on parallelizing your code using only the GPU, thus only 1 CPU core and CPU thread.
You should pick your OpenMP and MPI implementations for a fixed number of nodes/threads (perhaps the setting with the best performance) and comparing them to your GPU code increasing the number of particles.
You should break down the runtime into computation time, synchronization time and/or communication time and how do they scale with the number of particles?
Notes:
Your grade will mostly depend on three factors:
Scaling sustained by your codes on the Perlmutter supercomputer (varying n).
Performance sustained by your codes on the Perlmutter supercomputer.
Explanations of your methodologies and the performance features you observed (including what didn't work).
If your code produces incorrect results, it will not be graded.
You must target Perlmutter's A100 GPU for this assignment.
If you observe that it takes really long to generate outputs, please store the outputs under the $SCRATCH directory.
Programming in CUDA is introduced in Lectures
An Introduction to CUDA/OpenCL and Manycore Graphics Processors