Homework 2 (Part 3):

Parallelizing a Particle Simulation

Overview

This assignment is an introduction to parallel programming using GPUs. Most of this page will be similar to the HW 2-1 page.

In this assignment, we will be parallelizing a toy particle simulation (similar simulations are used in mechanics, biology, and astronomy). In our simulation, particles interact by repelling one another. A run of our simulation is shown here:

The particles repel one another, but only when closer than a cutoff distance highlighted around one particle in grey.

Asymptotic Complexity

Serial Solution Time Complexity

If we were to naively compute the forces on the particles by iterating through every pair of particles, then we would expect the asymptotic complexity of our simulation to be O(n^2).

However, in our simulation, we have chosen a density of particles sufficiently low so that with n particles, we expect only O(n) interactions. An efficient implementation can reach this time complexity.

Parallel Speedup

Suppose we have a code that runs in time T = O(n) on a single processor. Then we'd hope to run close to time T/p when using p processors. You will attempt to reach this speed up with a GPU.

For Remote Students

Dear remote students, we are thrilled to be a part of your parallel computing learning experience and to share these resources with you! To avoid confusion, please note that the assignment instructions, deadlines, and other assignment details posted here were designed for the local students. You should check with your local instruction team about submission, deadlines, job-running details, etc. and utilize Moodle for questions. With that in mind, the problem statement, source code, and references should still help you get started (just beware of institution-specific instructions). Best of luck and we hope you enjoy the assignment!

Due Date: Thursday, March 19th (11:59 PM PST)

Instructions

Teams

You must use the same groups as with HW2-1. If this is a problem, please privately contact the GSIs.

Getting Connected to Bridges

This part of Homework 2 will be run on the Bridges supercomputer in the Pittsburgh Supercomputing Center. To connect you will need to create an account with XSEDE, request to have your account added to our allocation, and finally setup a password:

Make an account at portal.xsede.org.
Tell us your new account username by filling out this form.
Once your account is approved, set your PSC password so that you can log in to Bridges.

After that, you can ssh login with the following command: ssh -p 2222 bridges.psc.xsede.org

Getting Set Up

The starter code is available on Bitbucket at https://bitbucket.org/Berkeley-CS267/hw2-3.git and should work out of the box. To get started, we recommend you log in to Bridges and download the first part of the assignment. This will look something like the following:

student@local:~> ssh -p 2222 demmel@bridges.psc.xsede.org

student@login005:~> git clone https://bitbucket.org/Berkeley-CS267/hw2-3.git

student@login005:~> cd hw2-3

student@login005:~/hw2-3> ls

CMakeLists.txt common.h job-gpu main.cu gpu.cu

There are five files in the base repository. Their purposes are as follows:

CMakeLists.txt

The build system that manages compiling your code.

main.cu

A driver program that runs your code.

common.h

A header file with shared declarations

job-gpu

A sample job script to run the gpu executable

gpu.cu - - - You may modify this file.

A skeleton file where you will implement your gpu simulation algorithm. It is your job to write an algorithm within the simulate_one_step function.

Please do not modify any of the files besides gpu.cu.

Building our Code

First, we need to make sure that the CMake module is loaded and that the CUDA module is loaded. NOTE: the 3.11.4 version is necessary

student@login005:~/hw2-3> module load cmake/3.11.4

student@login005:~/hw2-3> module load cuda

You should put these commands in your ~/.bash_profile.ext file to avoid typing them every time you log in.

Next, let's build the code. CMake prefers out of tree builds, so we start by creating a build directory.

student@login005:~/hw2-3> mkdir build

student@login005:~/hw2-3> cd build

student@login005:~/hw2-3/build>

Next, we have to configure our build. We can either build our code in Debug mode or Release mode. In debug mode, optimizations are disabled and debug symbols are embedded in the binary for easier debugging with GDB. In release mode, optimizations are enabled, and debug symbols are omitted. For example:

student@login005:~/hw2-3/build> cmake -DCMAKE_BUILD_TYPE=Release ..

-- The C compiler identification is GNU 8.3.0

...

-- Configuring done

-- Generating done

-- Build files have been written to: /global/homes/s/student/hw2-3/build

Once our build is configured, we may actually execute the build:

student@login005:~/hw2-3/build> make

Scanning dependencies of target gpu

...

student@login005:~/hw2-3/build> ls

CMakeCache.txt  CMakeFiles  cmake_install.cmake  Makefile  gpu job-gpu

We now have a binary (gpu) and a job script (job-gpu). You must run the job-gpu, you cannot (correctly) run the gpu binary on the login nodes.

For info on running jobs and editing the code, refer to the HW1 page. (sbatch works here the same)

For info on running the simulation program, refer to the HW2-1 page.

Important notes for Performance:

While the scripts we are providing have small numbers of particles 1000 to allow for the O(n²) algorithm to finish execution, the final codes should be tested with values much larger (50000-1000000) to better see their performance.

Grading

We will grade your assignment by reviewing your assignment write-up, measuring the scaling of the implementation, and benchmarking your code's raw performance. To benchmark your code, we will compile it with the exact process detailed above, with the CUDA compiler. We will run your submissions on Bridge's P100 GPUs.

Submission Details

Supposing your custom group name is XYZ, follow these steps to create an appropriate submission archive:

Ensure that your write-up is located in your source directory. It should be named cs267XYZ_hw2_3.pdf
From your build directory, run:

student@login005:~/hw2-3/build> cmake -DGROUP_NAME=XYZ ..

student@login005:~/hw2-3/build> make package

This second command will fail if the PDF is not present.

Confirm that it worked using the following command. You should see output like:

student@login005:~/hw2-3/build> tar tfz cs267XYZ_hw2_3.tar.gz

cs267XYZ_hw2_3/cs267XYZ_hw2_3.pdf

cs267XYZ_hw2_3/gpu.cu

Submit your .tar.gz through bCourses.

Write-up Details

Your write-up should contain:
- The names of the people in your group.
- Each member's contribution.
- A plot in the log-log scale showing that your parallel codes performance and a description of the data structures that you used to achieve it.
- A description of the synchronization you used in the GPU implementation.
- A description of the design choices that you tried and how did they affect the performance.
- Please focus on parallelizing your code using only the GPU, thus only 1 CPU core and CPU thread.
  - You should benchmark your GPU implementation against the starter code varying the number of particles and comparing both of them to a linear behavior.
  - You should pick your OpenMP and MPI for a fixed number of nodes/threads (perhaps the setting with the best performance) and comparing it to your GPU code increasing the number of particles.
- You should break down the runtime into computation time, synchronization time and/or communication time and how do they scale with the number of particles?

Notes:

Your grade will mostly depend on three factors:
- Scaling sustained by your codes on the Bridges supercomputer (varying n).
- Performance sustained by your codes on the Bridges supercomputer.
- Explanations of your methodologies and the performance features you observed (including what didn't work).
You must use the CUDA Compiler for this assignment. If your code does not compile and run with CUDA, it will not be graded.
If your code produces incorrect results, it will not be graded.
You must target Bridges' P100 GPUs for this assignment.

For info on running the rendering output, refer to the HW2-1 page.

For info on checking output correctness, refer to the HW2-1 page.

Resources

Programming in CUDA is introduced in Lectures.
Hints on getting O(n) serial and Shared memory and MPI implementations (pdf)
NVIDIA CUDA Programming Guide
CUDA - Wikipedia
An Introduction to CUDA/OpenCL and Manycore Graphics Processors