This assignment is an introduction to parallel programming using shared memory and distributed memory programming models.
In this assignment, we will be parallelizing a toy particle simulation (similar simulations are used in mechanics, biology, and astronomy). In our simulation, particles interact by repelling one another. A run of our simulation is shown here:
The particles repel one another, but only when closer than a cutoff distance highlighted around one particle in grey.
If we were to naively compute the forces on the particles by iterating through every pair of particles, then we would expect the asymptotic complexity of our simulation to be O(n^2).
However, in our simulation, we have chosen a density of particles sufficiently low so that with n particles, we expect only O(n) interactions. An efficient implementation can reach this time complexity. The first part of your assignment will be to implement this linear time solution in a serial code, given a naive O(n^2) implementation.
Suppose we have a code that runs in time T = O(n) on a single processor. Then we'd hope to run close to time T/p when using p processors. After implementing an efficient serial O(n) solution, you will attempt to reach this speedup using a number of different programming models.
Dear remote students, we are thrilled to be a part of your parallel computing learning experience and to share these resources with you! To avoid confusion, please note that the assignment instructions, deadlines, and other assignment details posted here were designed for the local students. You should check with your local instruction team about submission, deadlines, job-running details, etc. and utilize Moodle for questions. With that in mind, the problem statement, source code, and references should still help you get started (just beware of institution-specific instructions). Best of luck and we hope you enjoy the assignment!
Starter code for parts one and two is available here: http://cs.berkeley.edu/~brock/cs267/hw/cs267_hw2_2018_src.tar.gz
In this assignment, you will write two versions of our simulation. First, you will write an O(n) serial code. Then, you will write a parallel version of this O(n) code for shared memory architectures using OpenMP.
There are two executables and one short report you need to submit. You need to create at a minimum one serial code (serial.cpp) that runs in O(n) time and one shared memory implementation (openmp.cpp) using OpenMP.
Your submission should be a single <teamname>_hw2.tar.gz that, when unzipped, gives us a folder named <teamname>_hw2. In this folder there should be one subfolder, part1 for source files, a report file named report.pdf, and a text file named members.txt containing your teammate's names, one line for each member.
For example, profs_hw2.tar.gz should have
profs_hw2 (a folder)
|--part1 (a folder)
|----source files for part 1
|----Makefile with make targets serial, openmp
|--report.pdf
|--members.txt
And members.txt should contain a line for each member and nothing else, for example,
Aydin Buluc
James Demmel
Kathy Yelick
We need to be able to build and execute your implementations for you to receive credit. Spell out in your report what Makefile targets we are to build for the different parts of your report.
Here are things you can show in your report:
common.cpp, common.h ; an implementation of common functionality, such as I/O, numerics and timing
autograder.cpp ; a code that helps calculate performance for both serial and parallel implementations
Makefile ; a makefile that should work on all NERSC machines using slurm if you uncomment appropriate lines
A simple correctness check which checks the smallest distance between 2 particles during the entire simulation is provided. A correct simulation will have particles stay at greater than 0.4 (of cutoff) with typical values between 0.7-0.8. A simulation were particles don't interact correctly will have particles closer than 0.4 (of cutoff) with typical values between 0.01-0.05 . More details as well as an average distance are described in the source file.
The code we are providing will do this distance check based on the calls to the interaction function, but the full autograder will do the checks based on the outputted txt file. We'd recommend keeping the correctness checker in your code (for the OpenMP and MPI codes the overhead isn't significant) but depending on performance desires it can be deleted as long as the simulation can still output a correct txt file.
The performance of the code is determined by doing multiple runs of the simulation with increasing particle numbers and comparing the performance with the autograder.cpp provided. This can be done automatically with the auto-* scripts.
There will be two types of scaling that are tested for your parallel codes:
For more details on the options provided by each file you can use the -h flag on the compiled executables.
While the scripts we are providing have small numbers of particles (500-1000) to allow for the O(n2) algorithm to finish execution, the final codes should be tested with values 100 times larger (50000-100000) to better see their performance.
Don't forget to use the no output ( -no
) option for your actual timing runs you use in these calculations and plots.
You're responsible for finding a group. You may work in groups of 2 or 3. If you choose to work on a group of 3 people, then one person in your group should be a non-EECS/CS student. 2 person teams does not have any limitation. After you have chosen a group, please self sign up for a group on bCourses.
There is one executable (mpi.cpp) you need to submit, along with a short report. You need to create one distributed memory implementation (MPI) that runs in O(n) time with hopefully O(n/p) scaling.
Your submission should be a single <teamname>_hw2.tar.gz that, when unzipped, gives us a folder named <teamname>_hw2. In this folder there should be one subfolder named "part1", a report file named report.pdf, and a text file named members.txt containing your teammate's names, one line for each member.
For example, gsi_hw2.tar.gz should have
gsi_hw2 (a folder)
|--part1 (a folder)
|----source files for serial, openmp, mpi
|----Makefile with make targets serial, openmp, mpi
|--report.pdf
|--members.txt
And members.txt should contain a line for each member and nothing else, for example:
Benjamin Brock
Aditya Devarakonda
Grace Dinh
We need to be able to build and execute your implementations for you to receive credit. Spell out in your report what Makefile targets we are to build for the different parts of your report.
Here are some items you might add to your report:
You will require access to XSEDE resources, specifically, Pittsburgh Supercomputing Center's (PSC) Bridges cluster in order to complete Part 3.
You should now have access to Pittsburgh Supercomputing Center's Bridges cluster. We will be using the Bridges Regular Shared Memory GPU partition (16 nodes) with each node equipped with 2 Intel Haswell processors and 2 Nvidia K80 GPUs. You will be using only 1 K80 GPU which means that compute node CPUs will be shared between 2 jobs (since there are 2 GPUs per node). Refer to the Bridges User Guide for how to ssh into the system, how to compile CUDA code, and how to submit jobs, etc.
Starter code for part 3 is available here: https://people.eecs.berkeley.edu/~brock/cs267/hw/xsede_hw3_gpu_2019.tar.gz
We have provided a naive O(n2) GPU implementation, similar to the openmp, and MPI codes listed above. It will be your task to make the necessary algorithmic changes to obtain an O(n) GPU implementation and other machine optimizations to achieve favorable performance across a range of problem sizes.
A simple correctness check which computes the minimal distance between 2 particles during the entire simulation is provided. A correct simulation will have particles stay at greater than 0.4 (of cutoff) with typical values between 0.7-0.8. A simulation were particles don't interact correctly will be less than 0.4 (of cutoff) with typical values between 0.01-0.05 .
Adding the checks inside the GPU code provides too much of an overhead so an autocorrect executable is provided that checks the output txt file for the values mentioned above. We also provide a clean, O(n2) serial implementation, serial.cu.
Submit your report and source codes to bCourses.
Your submission should be a single teamname_hw2.tar.gz that, when unzips, gives us a folder named teamname_hw2.tar.gz. In this folder there should be two subfolders, part1 and part2 for source files, a report file named report.pdf, and a text file named members.txt containing your teammate's names, one line for each member.
For example, gsi_hw2_gpu.tar.gz should have
gsi_hw2_gpu (a folder)
|----source files for part 3
|----Makefile with make target gpu
|--report.pdf
|--members.txt
And members.txt should contain a line for each member and nothing else, for example,
Benjamin Brock
Aditya Devarakonda
Grace Dinh
We need to be able to build and execute your implementations for you to receive credit. Spell out in your report what Makefile targets we are to build for the different parts of your report.
Please include a section in your report detailing your GPU implementation, as well as its performance over varying numbers of particles. Here is the list of items you might show in your report: