Homework 2 (Part 3):

Parallelizing a Particle Simulation


This assignment is an introduction to parallel programming using GPUs. Most of this page will be similar to the HW 2-1 page.

In this assignment, we will be parallelizing a toy particle simulation (similar simulations are used in mechanics, biology, and astronomy).  In our simulation, particles interact by repelling one another.  A run of our simulation is shown here:

The particles repel one another, but only when closer than a cutoff distance highlighted around one particle in grey.

Asymptotic Complexity

Serial Solution Time Complexity

If we were to naively compute the forces on the particles by iterating through every pair of particles, then we would expect the asymptotic complexity of our simulation to be O(n^2).

However, in our simulation, we have chosen a density of particles sufficiently low so that with n particles, we expect only O(n) interactions.  An efficient implementation can reach this time complexity.

Parallel Speedup

Suppose we have a code that runs in time T = O(n) on a single processor. Then we'd hope to run close to time T/p when using p processors.  You will attempt to reach this speed up with a GPU.

For Remote Students

Getting Set Up

The starter code is available on github at https://github.com/Berkeley-CS267/hw2-3 and should work out of the box.  To get started, we recommend you log in to perlmutter and download the first part of the assignment. This will look something like the following:

student@local:~> ssh demmel@perlmutter-p1.nersc.gov

student@perlmutter:login005:~> git clone https://github.com/Berkeley-CS267/hw2-3

student@perlmutter:login005:~> cd hw2-3

student@perlmutter:login005:~/hw2-3> ls

CMakeLists.txt common.h job-gpu main.cu gpu.cu

There are five files in the base repository. Their purposes are as follows:


The build system that manages compiling your code.


     A driver program that runs your code. 


A header file with shared declarations


A sample job script to run the gpu executable

gpu.cu - - - You may modify this file.

A skeleton file where you will implement your gpu simulation algorithm. It is your job to write an algorithm within the simulate_one_step function.

Please do not modify any of the files besides gpu.cu.

Building our Code

First, we need to make sure that the CMake module is loaded and that the CUDA module is loaded. The CMake module should already be loaded, but we recommended running cmake --version as a sanity check. 

student@perlmutter:login005:~/hw2-3> cmake --version

cmake version 3.20.4

You should put these commands in your ~/.bash_profile file to avoid typing them every time you log in.

Next, let's build the code. CMake prefers out of tree builds, so we start by creating a build directory.

student@perlmutter:login005:~/hw2-3> mkdir build

student@perlmutter:login005:~/hw2-3> cd build


Next, we have to configure our build. We can either build our code in Debug mode or Release mode. In debug mode, optimizations are disabled and debug symbols are embedded in the binary for easier debugging with GDB. In release mode, optimizations are enabled, and debug symbols are omitted. For example:

student@perlmutter:login005:~/hw2-3/build> cmake -DCMAKE_BUILD_TYPE=Release ..

-- The C compiler identification is GNU 8.3.0


-- Configuring done

-- Generating done

-- Build files have been written to: /global/homes/s/student/hw2-3/build

Once our build is configured, we may actually execute the build:

student@perlmutter:login005:~/hw2-3/build> make

Scanning dependencies of target gpu


student@perlmutter:login005:~/hw2-3/build> ls

CMakeCache.txt  CMakeFiles  cmake_install.cmake  Makefile  gpu job-gpu

We now have a binary (gpu) and a job script (job-gpu). You must run the job-gpu, you cannot (correctly) run the gpu binary on the login nodes.

Important notes for Performance:

While the scripts we are providing have small numbers of particles 1000 to allow for the O(n2) algorithm to finish execution, the final codes should be tested with values much larger (50000-1000000) to better see their performance.


We will grade your assignment by reviewing your assignment write-up, measuring the scaling of the implementation, and benchmarking your code's raw performance. To benchmark your code, we will compile it with the exact process detailed above, with the CUDA compiler. We will run your submissions on Perlmutter's A100 GPUs.

Submission Details

Supposing your custom group name is XYZ, follow these steps to create an appropriate submission archive:

student@perlmutter:login005:~/hw2-3/build> cmake -DGROUP_NAME=XYZ ..

student@perlmutter:login005:~/hw2-3/build> make package

This second command will fail if the PDF is not present.

student@perlmutter:login005:~/hw2-3/build> tar tfz cs267XYZ_hw2_3.tar.gz 



Write-up Details


