Your task in this assignment is to write an optimized matrix multiplication function for NERSC's Cori supercomputer. We will give you a generic matrix multiplication code (also called matmul or dgemm), and it will be your job to tune our code to run efficiently on Cori's processors. We are asking you to write an optimized single-threaded matrix multiply kernel. This will run on only one core.
We consider a special case of matmul:
C := C + A*B
where A, B, and C are n x n matrices. This can be performed using 2n^3 floating point operations (n^3 adds, n^3 multiplies), as in the following pseudocode:
for i = 1 to n
for j = 1 to n
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j)
end
end
end
Dear Remote Students, we are thrilled to be a part of your parallel computing learning experience and to share these resources with you! To avoid confusion, please note that the assignment instructions, deadlines, and other assignment details posted here were designed for the local UC Berkeley students. You should check with your local instruction team about submission, deadlines, job-running details, etc. and utilize Moodle for questions. With that in mind, the problem statement, source code, and references should still help you get started (just beware of institution-specific instructions). Best of luck and we hope you enjoy the assignment!
Note that you will work in assigned teams for this assignment. See bCourses for your group assignments.
Please read through the Cori tutorial, available here: https://bitbucket.org/Berkeley-CS267/cori-tutorial/src/master/cori-tutorial.md
The starter code is available on Bitbucket at https://bitbucket.org/Berkeley-CS267/hw1.git and should work out of the box. To get started, we recommend you log in to Cori and download the first part of the assignment. This will look something like the following:
student@local:~> ssh demmel@cori.nersc.gov
student@cori04:~> git clone https://bitbucket.org/Berkeley-CS267/hw1.git
student@cori04:~> cd hw1
student@cori04:~/hw1> ls
CMakeLists.txt README.md benchmark.c dgemm-blas.c dgemm-blocked.c dgemm-naive.c job.in
There are seven files in the base repository. Their purposes are as follows:
CMakeLists.txt
The build system that manages compiling your code.
README.md
README file explaining the build system in more detail.
benchmark.c
A driver program that runs your code.
dgemm-blas.c
A wrapper which calls the vendor's optimized BLAS implementation of matrix multiply (here, MKL).
dgemm-blocked.c - - - You may only modify this file.
A simple blocked implementation of matrix multiply. It is your job to optimize the square_dgemm() function in this file.
dgemm-naive.c
For illustrative purposes, a naive implementation of matrix multiply using three nested loops.
job.in
Template job script that is filled in by the build system for each of blas, blocked, and naive.
Please do not modify any of the files besides dgemm-blocked.c.
First, we need to make sure that the CMake module is loaded and that the GNU compiler is selected.
student@cori04:~/hw1> module load cmake
student@cori04:~/hw1> module swap PrgEnv-intel PrgEnv-gnu
You should put these commands in your ~/.bash_profile.ext
file to avoid typing them every time you log in.
Next, let's build the code. CMake prefers out of tree builds, so we start by creating a build directory.
student@cori04:~/hw1> mkdir build
student@cori04:~/hw1> cd build
student@cori04:~/hw1/build>
Next, we have to configure our build. We can either build our code in Debug mode or Release mode. In debug mode, optimizations are disabled and debug symbols are embedded in the binary for easier debugging with GDB. In release mode, optimizations are enabled, and debug symbols are omitted. For example:
student@cori04:~/hw1/build> cmake -DCMAKE_BUILD_TYPE=Release ..
-- The C compiler identification is GNU 8.3.0
...
-- Configuring done
-- Generating done
-- Build files have been written to: /global/homes/s/student/hw1/build
Once our build is configured, we may actually execute the build:
student@cori04:~/hw1/build> make
Scanning dependencies of target benchmark
[ 14%] Building C object CMakeFiles/benchmark.dir/benchmark.c.o
[ 14%] Built target benchmark
...
[ 85%] Building C object CMakeFiles/benchmark-naive.dir/dgemm-naive.c.o
[100%] Linking C executable benchmark-naive
[100%] Built target benchmark-naive
student@cori04:~/hw1/build> ls
benchmark-blas benchmark-naive CMakeFiles job-blas job-naive
benchmark-blocked CMakeCache.txt cmake_install.cmake job-blocked Makefile
We now have three binaries (benchmark-blas, benchmark-blocked, and benchmark-naive) and three corresponding job scripts (job-blas, job-blocked, and job-naive). Feel free to create your own job scripts by copying one of these to the above source directory.
You might find that your code works in Debug mode, but not Release mode. To add debug symbols to a release build, run
student@cori04:~/hw1/build> cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_FLAGS="-g3" ..
You can add arbitrary extra compiler flags in this way, just remember to re-run make after you do this.
The easiest way to run the code is to submit a batch job. We've already provided batch files which will launch jobs for each matrix multiply version.
student@cori04:~/hw1/build> sbatch job-blas
Submitted batch job 27505251
student@cori04:~/hw1/build> sbatch job-blocked
Submitted batch job 27505253
student@cori04:~/hw1/build> sbatch job-naive
Submitted batch job 27505255
Our jobs are now submitted to Cori's job queue. We can now check on the status of our submitted jobs using a few different commands.
student@cori04:~/hw1/build> squeue -u student
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
27505255 debug_hsw job-naiv reinking PD 0:00 1 (QOSMaxJobsPerUserLimit)
27505253 debug_hsw job-bloc reinking R 0:32 1 nid00545
27505251 debug_hsw job-blas reinking R 0:39 1 nid12790
student@cori04:~/hw1/build> sqs
JOBID ST USER NAME NODES REQUESTED USED SUBMIT QOS ESTIMATED_START FEATURES REASON
27505255 R student job-naive 1 1:00 0:36 2020-01-19T10:56:43 debug_hsw 2020-01-19T10:57:25 haswell None
27505253 R student job-blocked 1 1:00 1:19 2020-01-19T10:56:39 debug_hsw 2020-01-19T10:56:42 haswell None
When our job is finished, we'll find new files in our directory containing the output of our program. For example, we'll find the files job-blas.o27505253 and job-blas.e27505253. The first file contains the standard output of our program, and the second file contains the standard error.
You may find it useful to launch an interactive session when developing your code. This lets you compile and run code interactively on a compute node that you've reserved. In addition, running interactively lets you use the special interactive queue, which means you'll receive your allocation quicker.
One of the easiest ways to implement your homework is to directly change the code on the server. For this you need to use a command line editor like nano or vim.
For beginners we recommend taking your first steps with nano. You can use it on Cori like this:
student@cori04:~/hw1> module load nano
student@cori04:~/hw1> nano dgemm-blocked.c
Use Ctrl+X to exit.
For a more complete code editor try vim which is loaded by default:
student@cori04:~/hw1> vim dgemm-blocked.c
Use Esc and :q to exit. (:q! if you want to discard changes). Try out the interactive vim tutorial to learn more.
If you're more familiar with a graphical environment, many popular IDEs can use the provided CMakeLists.txt as a project definition. Refer to the documentation of your particular IDE for help setting this up. Using hosted version control like GitHub or Bitbucket makes uploading your changes much easier. If you're in a Windows environment, consider using the Windows Subsystem for Linux (WSL) for development.
The benchmark.c file generates matrices of a number of different sizes and benchmarks the performance. It outputs the performance in FLOPS and in a percentage of theoretical peak attained. Your job is to get your matrix-multiply's performance as close to the theoretical peak as possible.
When you run your code on a different computer, you will likely need to adjust the MAX_SPEED variable. This can be done like so:
student@cori04:~/hw1/build> cmake -DCMAKE_BUILD_TYPE=Debug -DMAX_SPEED=36.8 ..
student@cori04:~/hw1/build> make
On Cori, this value is computed as 2.3 GHz * 8 vector width * 2 flops for FMA = 36.8 GF/s.
Cori is actually partitioned in two: Cori Phase I contains nodes with Intel Xeon CPUs with the Haswell microarchitecture, and Cori Phase II contains Intel Xeon Phi nodes. In this assignment, we will only be using Cori Phase I. Be sure you use the flag '-C haswell' on any jobs that you run. The job files included with the starter code do this automatically.
Our benchmark harness reports numbers as a percentage of theoretical peak. Here, we show you how we calculate the theoretical peak of Cori's Haswell processors. If you'd like to run the assignment on your own processor, you should follow this process to arrive at the theoretical peak of your own machine, and then replace the MAX_SPEED constant in benchmark.c with the theoretical peak of your machine. Be sure to change it back if you run your code on Cori again.
One core has a clock rate of 2.3 GHz, so it can issue 2.3 billion instructions per second. Cori's processors also have a 256-bit vector width, meaning each instruction can operate on 8 32-bit data elements at a time. Furthermore, the Haswell microarchitecture includes a fused multiply-add (FMA) instruction, which means 2 floating point operations can be performed in a single instruction.
So, the theoretical peak of Cori's Haswell nodes is:
Now, it's time to optimize! A few optimizations you might consider adding:
You may, of course, proceed however you wish. We recommend you look through the lecture notes as reference material to guide your optimization process, as well as the references at the bottom of this write-up.
The development environment on Cori supplies several different compilers, including Intel, Cray, and LLVM. However, we want you to use the GNU C compiler for this assignment. You can make sure you are using GNU by running:
student@cori04:~> module swap PrgEnv-intel PrgEnv-gnu
Still, you might want to try your code with different compilers to see if one outperforms the other. If the difference is significant, consider using the Compiler Explorer to figure out why GCC isn't optimizing your code as well. For more information on compilers available on Cori, see the NERSC docs.
We will grade your assignment by reviewing your assignment write-up, looking at the optimization methods you attempted, and benchmarking your code's performance. To benchmark your code, we will compile it with the exact process detailed above, with the GNU compiler. Note that code that does not return correct results will receive significant penalties.
Supposing you are Group #04, follow these steps to create an appropriate submission archive:
student@cori04:~/hw1/build> cmake -DGROUP_NO=04 ..
student@cori04:~/hw1/build> make package
This second command will fail if the PDF is not present.
student@cori04:~/hw1/build> tar tfz cs267Group04_hw1.tar.gz
cs267Group04_hw1/cs267Group04_hw1.pdf
cs267Group04_hw1/dgemm-blocked.c
These parts are not graded. You should be satisfied with your square_dgemm results and write-up before beginning an optional part.
If you wish to submit optional parts, send them to us via email, rather than through the bCourses system.
You are also welcome to learn from the source code of state-of-art BLAS implementations such as GotoBLAS and ATLAS. However, you should not reuse those codes in your submission.
* We emphasize these are example scripts because for these as well as all other assignment scripts we provide, you may need to adjust the number of requested nodes and cores and amount of time according to your needs (your allocation and the total class allocation is limited). To understand how you are charged, READ THIS alongside the given scripts. For testing (1) try running and debugging on your laptop first, (2) try running with the minimum resources you need for testing and debugging, (3) once your code is fully debugged, use the amount of resources you need to collect the final results for the assignment. This will become more important for later assignments, but it is good to get in the habit now.