Homework 2:

Parallelizing a Particle Simulation

Overview

This assignment is an introduction to parallel programming using shared memory and distributed memory programming models.

In this assignment, we will be parallelizing a toy particle simulation (similar simulations are used in mechanics, biology, and astronomy). In our simulation, particles interact by repelling one another. A run of our simulation is shown here:

The particles repel one another, but only when closer than a cutoff distance highlighted around one particle in grey.

Asymptotic Complexity

Serial Solution Time Complexity

If we were to naively compute the forces on the particles by iterating through every pair of particles, then we would expect the asymptotic complexity of our simulation to be O(n^2).

However, in our simulation, we have chosen a density of particles sufficiently low so that with n particles, we expect only O(n) interactions. An efficient implementation can reach this time complexity. The first part of your assignment will be to implement this linear time solution in a serial code, given a naive O(n^2) implementation.

Parallel Speedup

Suppose we have a code that runs in time T = O(n) on a single processor. Then we'd hope to run close to time T/p when using p processors. After implementing an efficient serial O(n) solution, you will attempt to reach this speedup using a number of different programming models.

For Remote Students

Dear remote students, we are thrilled to be a part of your parallel computing learning experience and to share these resources with you! To avoid confusion, please note that the assignment instructions, deadlines, and other assignment details posted here were designed for the local students. You should check with your local instruction team about submission, deadlines, job-running details, etc. and utilize Moodle for questions. With that in mind, the problem statement, source code, and references should still help you get started (just beware of institution-specific instructions). Best of luck and we hope you enjoy the assignment!

Starter Code

Starter code for parts one and two is available here: http://cs.berkeley.edu/~brock/cs267/hw/cs267_hw2_2018_src.tar.gz

Part 1: Serial and OpenMP

Due date: Thursday, February 21

In this assignment, you will write two versions of our simulation. First, you will write an O(n) serial code. Then, you will write a parallel version of this O(n) code for shared memory architectures using OpenMP.

There are two executables and one short report you need to submit. You need to create at a minimum one serial code (serial.cpp) that runs in O(n) time and one shared memory implementation (openmp.cpp) using OpenMP.

Your submission should be a single <teamname>_hw2.tar.gz that, when unzipped, gives us a folder named <teamname>_hw2. In this folder there should be one subfolder, part1 for source files, a report file named report.pdf, and a text file named members.txt containing your teammate's names, one line for each member.

For example, profs_hw2.tar.gz should have

profs_hw2 (a folder)

  |--part1 (a folder)

  |----source files for part 1

  |----Makefile with make targets serial, openmp

  |--report.pdf

  |--members.txt

And members.txt should contain a line for each member and nothing else, for example,

Aydin Buluc

James Demmel

Kathy Yelick

We need to be able to build and execute your implementations for you to receive credit. Spell out in your report what Makefile targets we are to build for the different parts of your report.

Here are things you can show in your report:

A plot in log-log scale that shows that your serial and parallel codes run in O(n) time and a description of the data structures that you used to achieve it.
A description of the synchronization you used in the shared memory implementation.
A description of the design choices that you tried and how did they affect the performance.
Speedup plots that show how closely your OpenMP code approaches the idealized p-times speedup and a discussion on whether it is possible to do better.
Where does the time go? Consider breaking down the runtime into computation time, synchronization time and/or communication time. How do they scale with p?
A discussion on using OpenMP

Source Code

common.cpp, common.h ; an implementation of common functionality, such as I/O, numerics and timing

autograder.cpp ; a code that helps calculate performance for both serial and parallel implementations

Makefile ; a makefile that should work on all NERSC machines using slurm if you uncomment appropriate lines

Correctness and Performance

A simple correctness check which checks the smallest distance between 2 particles during the entire simulation is provided. A correct simulation will have particles stay at greater than 0.4 (of cutoff) with typical values between 0.7-0.8. A simulation were particles don't interact correctly will have particles closer than 0.4 (of cutoff) with typical values between 0.01-0.05 . More details as well as an average distance are described in the source file.

The code we are providing will do this distance check based on the calls to the interaction function, but the full autograder will do the checks based on the outputted txt file. We'd recommend keeping the correctness checker in your code (for the OpenMP and MPI codes the overhead isn't significant) but depending on performance desires it can be deleted as long as the simulation can still output a correct txt file.

The performance of the code is determined by doing multiple runs of the simulation with increasing particle numbers and comparing the performance with the autograder.cpp provided. This can be done automatically with the auto-* scripts.

There will be two types of scaling that are tested for your parallel codes:

In strong scaling we keep the problem size constant but increase the number of processors
In weak scaling we increase the problem size proportionally to the number of processors so the work/processor stays the same (Note that for the purposes of this assignment we will assume a linear scaling between work and processors)

For more details on the options provided by each file you can use the -h flag on the compiled executables.

Important notes for Performance:

While the scripts we are providing have small numbers of particles (500-1000) to allow for the O(n²) algorithm to finish execution, the final codes should be tested with values 100 times larger (50000-100000) to better see their performance.

Don't forget to use the no output ( -no ) option for your actual timing runs you use in these calculations and plots.

Homework 2 Team

You're responsible for finding a group. You may work in groups of 2 or 3. If you choose to work on a group of 3 people, then one person in your group should be a non-EECS/CS student. 2 person teams does not have any limitation. After you have chosen a group, please self sign up for a group on bCourses.

Part 2: MPI

Due date: Thursday, March 7

There is one executable (mpi.cpp) you need to submit, along with a short report. You need to create one distributed memory implementation (MPI) that runs in O(n) time with hopefully O(n/p) scaling.

Your submission should be a single <teamname>_hw2.tar.gz that, when unzipped, gives us a folder named <teamname>_hw2. In this folder there should be one subfolder named "part1", a report file named report.pdf, and a text file named members.txt containing your teammate's names, one line for each member.

For example, gsi_hw2.tar.gz should have

gsi_hw2 (a folder)

  |--part1 (a folder)

  |----source files for serial, openmp, mpi

  |----Makefile with make targets serial, openmp, mpi

  |--report.pdf

  |--members.txt

And members.txt should contain a line for each member and nothing else, for example:

Benjamin Brock

Aditya Devarakonda

Grace Dinh

We need to be able to build and execute your implementations for you to receive credit. Spell out in your report what Makefile targets we are to build for the different parts of your report.

Here are some items you might add to your report:

A plot in log-linear scale that shows your performance as a percent of peak performance for different numbers of processors. You can use a tool like Craypat to tell you how many flops are performed for different sizes of n.
A description of the communication you used in the distributed memory implementation.
A description of the design choices that you tried and how did they affect the performance.
Speedup plots that show how closely your MPI code approaches the idealized p-times speedup and a discussion on whether it is possible to do better.
Where does the time go? Consider breaking down the runtime into computation time, synchronization time and/or communication time. How do they scale with p?
A discussion on using MPI.

Part 3: GPU

Due Date: Wednesday, March 20

You will require access to XSEDE resources, specifically, Pittsburgh Supercomputing Center's (PSC) Bridges cluster in order to complete Part 3.

If you do not have an account, please create one here: https://portal.xsede.org/#/guest
Please fill out this form so that we can give you access to our allocation.
Once your XSEDE account has been created, set your PSC password: https://portal.xsede.org/psc-bridges#access

You should now have access to Pittsburgh Supercomputing Center's Bridges cluster. We will be using the Bridges Regular Shared Memory GPU partition (16 nodes) with each node equipped with 2 Intel Haswell processors and 2 Nvidia K80 GPUs. You will be using only 1 K80 GPU which means that compute node CPUs will be shared between 2 jobs (since there are 2 GPUs per node). Refer to the Bridges User Guide for how to ssh into the system, how to compile CUDA code, and how to submit jobs, etc.

Starter Code

Starter code for part 3 is available here: https://people.eecs.berkeley.edu/~brock/cs267/hw/xsede_hw3_gpu_2019.tar.gz

We have provided a naive O(n²) GPU implementation, similar to the openmp, and MPI codes listed above. It will be your task to make the necessary algorithmic changes to obtain an O(n) GPU implementation and other machine optimizations to achieve favorable performance across a range of problem sizes.

A simple correctness check which computes the minimal distance between 2 particles during the entire simulation is provided. A correct simulation will have particles stay at greater than 0.4 (of cutoff) with typical values between 0.7-0.8. A simulation were particles don't interact correctly will be less than 0.4 (of cutoff) with typical values between 0.01-0.05 .

Adding the checks inside the GPU code provides too much of an overhead so an autocorrect executable is provided that checks the output txt file for the values mentioned above. We also provide a clean, O(n²) serial implementation, serial.cu.

Submission

Submit your report and source codes to bCourses.

Your submission should be a single teamname_hw2.tar.gz that, when unzips, gives us a folder named teamname_hw2.tar.gz. In this folder there should be two subfolders, part1 and part2 for source files, a report file named report.pdf, and a text file named members.txt containing your teammate's names, one line for each member.

For example, gsi_hw2_gpu.tar.gz should have

gsi_hw2_gpu (a folder)

  |----source files for part 3

  |----Makefile with make target gpu

  |--report.pdf

  |--members.txt

And members.txt should contain a line for each member and nothing else, for example,

Benjamin Brock

Aditya Devarakonda

Grace Dinh

We need to be able to build and execute your implementations for you to receive credit. Spell out in your report what Makefile targets we are to build for the different parts of your report.

Please include a section in your report detailing your GPU implementation, as well as its performance over varying numbers of particles. Here is the list of items you might show in your report:

A plot in log-log scale that shows the performance of your code versus the naive GPU code
A plot of the speedup of the GPU code versus the serial, openmp, mpi runs on the CPU of the node
A description of any synchronization needed
A description of any GPU-specific optimizations you tried
A discussion on the strengths and weaknesses of CUDA and the current GPU architecture

Resources

NERSC is running a half-day OpenMP tutorial on Feb 5
Programming in shared and distributed memory models are introduced in Lectures.
Shared memory implementations may require using locks that are availabale as omp_lock_t in OpenMP (requires omp.h) and pthread_mutex_t in pthreads (requires pthread.h).
You may consider using atomic operations such as __sync_lock_test_and_set with the GNU compiler. This syntax changes between compilers.
Distributed memory implementation may benefit from overlapping communication and computation that is provided by nonblocking MPI routines such as MPI_Isend and MPI_Irecv.
Other useful resources: pthreads tutorial, OpenMP tutorial, OpenMP specifications and MPI specifications.
It can be very useful to use a performance measuring tool in this homework. Parallel profiling is a complicated business but there are a couple of tools that can help.
IPM is a profiling tool that is inserted into your link command in your Makefile (afer you module load ipm) and instrumented versions of your MPI calls are put into your program for you.
TAU (Tuning and Analysis Utilities) is a source code instrumentation system to gather profiling information. You need module load tau to access these capabilities. This system can profile MPI, OpenMP and PThread code, and mixtures, but it has a learning curve.
HPCToolkit Is a sampling profiler for parallel programs. You need module load hpctoolkit . You can install the hpcviewer on your own computer for offline analysis, or use the one on NERSC by using the NX client to get X windows displayed back to your own machine.
If you are using TAU or HPCToolkit you should run in your $SCRATCH directory which has faster disk access to the compute nodes (profilers can generate big profile files).
Hints on getting O(n) serial and Shared memory and MPI implementations (pdf)
NVIDIA CUDA Programming Guide v6.5 (Note that Stampede's default CUDA module is CUDA 5.5, and it only has up to CUDA 6.0, so some of the features you found in this guide might be unavailable.)
CUDA - Wikipedia
An Introduction to CUDA/OpenCL and Manycore Graphics Processors