Your task in this assignment is to write an optimized matrix multiplication function for NERSC's Cori supercomputer. We will give you a generic matrix multiplication code (also called matmul or dgemm), and it will be your job to tune our code to run efficiently on Cori's processors. We are asking you to write an optimized single-threaded matrix multiply kernel. This will run on only one core.
We consider a special case of matmul:
C := C + A*B
where A, B, and C are n x n matrices. This can be performed using 2n^3 floating point operations (n^3 adds, n^3 multiplies), as in the following pseudocode:
for i = 1 to n
for j = 1 to n
for k = 1 to n
C(i,j) = C(i,j) + A(i,k) * B(k,j)
end
end
end
Dear Remote Students, we are thrilled to be a part of your parallel computing learning experience and to share these resources with you! To avoid confusion, please note that the assignment instructions, deadlines, and other assignment details posted here were designed for the local UC Berkeley students. You should check with your local instruction team about submission, deadlines, job-running details, etc. and utilize Moodle for questions. With that in mind, the problem statement, source code, and references should still help you get started (just beware of institution-specific instructions). Best of luck and we hope you enjoy the assignment!
Note that you will work in assigned teams for this assignment. See bCourses for your group assignments.
Download the starter code.
The tarball contains starter code for the serial matrix multiply, with the following source files.
dgemm-blocked.c
A simple blocked implementation of matrix multiply. It is your job to optimize the square_dgemm() function in this file.
dgemm-blas.c
A wrapper which calls the vendor's optimized BLAS implementation of matrix multiply (here, MKL).
dgemm-naive.c
For illustrative purposes, a naive implementation of matrix multiply using three nested loops.
benchmark.c
A driver program that runs your code. You will not modify this file, except perhaps to change the MAX_SPEED constant if you wish to test on another computer (more about this below).
Makefile
A simple makefile to build the executables. It has support for the PrgEnv-gnu and PrgEnv-intel (default) compilers.
job-blas, job-blocked, job-naive
Example job scripts* to run the executables on Cori compute nodes. For example, you can type "sbatch job-blas" to benchmark the BLAS version.
The starter code should work out of the box. To get started, we recommend you log in to Cory and download the first part of the assignment. This will look something like the following:
[demmel@blasboss ~]$ ssh demmel@cori.nersc.gov
[demmel@cori10 ~]$ wget "https://cs.berkeley.edu/~brock/cs267/hw/cs267_hw1_2019.tar.gz"
[demmel@cori10 ~]$ tar -xf cs267_hw1_2019.tar.gz
[demmel@cori10 ~]$ cd cs267_hw1_2019
[demmel@cori10 cs267_hw1_2019]$ ls
Makefile benchmark.c dgemm-blas.c dgemm-blocked.c dgemm-naive.c job-blas job-blocked job-naive
Next let's build the code.
[demmel@cori10 cs267_hw1_2019]$ make
rm -f benchmark-naive benchmark-blocked benchmark-blas benchmark.o dgemm-naive.o dgemm-blocked.o dgemm-blas.o
cc -c -Wall -std=gnu99 -O2 benchmark.c
cc -c -Wall -std=gnu99 -O2 dgemm-naive.c
cc -o benchmark-naive benchmark.o dgemm-naive.o -lrt
cc -c -Wall -std=gnu99 -O2 dgemm-blocked.c
cc -o benchmark-blocked benchmark.o dgemm-blocked.o -lrt
cc -c -Wall -std=gnu99 -O2 dgemm-blas.c
cc -o benchmark-blas benchmark.o dgemm-blas.o -lrt
[demmel@cori10 cs267_hw1_2019]$
We now have three binaries, benchmark-blas, benchmark-blocked, and benchmark-naive. The easiest way to run the code is to submit a batch job. We've already provided batch files which will launch jobs for each matrix multiply version.
[demmel@cori10 cs267_hw1_2019]$ sbatch job-blas
Submitted batch job 9637621
[demmel@cori10 cs267_hw1_2019]$ sbatch job-blocked
Submitted batch job 9637622
[demmel@cori10 cs267_hw1_2019]$ sbatch job-naive
Our jobs are now submitted to Cori's job queue. We can now check on the status of our submitted jobs using a few different commands.
[demmel@cori10 cs267_hw1_2019]$ squeue -u demmel
JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)
9637621 debug job-blas demmel PD 0:00 1 (Priority)
9637622 debug job-bloc demmel PD 0:00 1 (Priority)
9637623 debug job-naiv demmel PD 0:00 1 (Priority)
[demmel@cori10 cs267_hw1_2019]$ sqs
JOBID ST USER NAME NODES REQUESTED USED SUBMIT QOS SCHEDULED_START REASON
9637621 PD demmel job-blas 1 1:00 0:00 2018-01-19T11:42:15 debug_hsw avail_in_~1.0_hrs Priority
9637622 PD demmel job-blocked 1 1:00 0:00 2018-01-19T11:42:17 debug_hsw avail_in_~1.0_hrs Priority
9637623 PD demmel job-naive 1 1:00 0:00 2018-01-19T11:42:20 debug_hsw avail_in_~1.0_hrs Priority
When our job is finished, we'll find new files in our directory containing the output of our program. For example, we'll find the files job-blas.o9637621 and job-blas.e9637621. The first file contains the standard output of our program, and the second file contains the standard error.
You may find it useful to launch an interactive session when developing your code. This lets you compile and run code interactively on a compute node that you've reserved. In addition, running interactively lets you use the special interactive queue, which means you'll receive you allocation quicker.
One of the easiest ways to implement your homework is to directly change the code on the server. For this you need to use a command line editor like nano or vim.
For beginners we recommend taking your first steps with nano. You can use it on Cori like this:
[demmel@cori10 cs267_hw1_2019]$ module load nano
[demmel@cori10 cs267_hw1_2019]$ nano dgemm-blocked.c
Use Ctrl+X to exit.
For a more complete code editor try vim which is loaded by default:
[demmel@cori10 cs267_hw1_2019]$ vim dgemm-blocked.c
Use Esc and :q to exit. (:q! if you want to discard changes). Try out the interactive vim tutorial to learn more.
The benchmark.c file generates matrices of a number of different sizes and benchmarks the performance. It outputs the performance in FLOPS and in a percentage of theoretical peak attained. Your job is to get your matrix multiply's performance as close to the theoretical peak as possible.
Cori is actually partitioned in two: Cori Phase I contains nodes with Intel Xeon CPUs with the Haswell microarchitecture, and Cori Phase II contains Intel Xeon Phi nodes. In this assignment, we will only be using Cori Phase I. Be sure you use the flag '-C haswell' on any jobs that you run. The job files included with the starter code do this automatically.
Our benchmark harness reports numbers as a percentage of theoretical peak. Here, we show you how we calculate the theoretical peak of Cori's Haswell processors. If you'd like to run the assignment on your own processor, you should follow this process to arrive at the theoretical peak of your own machine, and then replace the MAX_SPEED constant in benchmark.c with the theoretical peak of your machine. Be sure to change it back if you run your code on Cori again.
One core has a clock rate of 2.3 GHz, so it can issue 2.3 billion instructions per second. Cori's processors also have a 256-bit vector width, meaning each instruction can operate on 8 32-bit data elements at a time. Furthermore, the Haswell microarchitecture includes a fused multiply-add (FMA) instruction, which means 2 floating point operations can be performed in a single instruction. So, the theoretical peak of Cori's Haswell nodes is:
Now, it's time to optimize! A few optimizations you might consider adding:
You may, of course, proceed however you wish. We recommend you look through the lecture notes as reference material to guide your optimization process, as well as the references at the bottom of this write-up.
You will probably want to try your code with different compilers to see if you can get a performance boost for free. The default compiler on Cori is Intel's compiler. GNU, Cray, and LLVM compilers are also available. We recommend you use the Intel or GNU compilers. For more information on compilers available on Cori, see the NERSC docs.
We will grade your assignment by reviewing your assignment write-up, looking at the optimization methods you attempted, and benchmarking your code's performance.
To benchmark your code, we will compile it with the Intel and GNU compilers, run the binaries, and take the best performance result. If you wish to be graded using a different compiler, you must contact the TAs directly. We'll allow it, but will have to grade your assignment separately.
These parts are not graded. You should be satisfied with your square_dgemm results and write-up before beginning an optional part.
You are also welcome to learn from the source code of state-of-art BLAS implementations such as GotoBLAS and ATLAS. However, you should not reuse those codes in your submission.
* We emphasize these are example scripts because for these as well as all other assignment scripts we provide, you may need to adjust the number of requested nodes and cores and amount of time according to your needs (your allocation and the total class allocation is limited). To understand how you are charged, READ THIS alongside the given scripts. For testing (1) try running and debugging on your laptop first, (2) try running with the minimum resources you need for testing and debugging, (3) once your code is fully debugged, use the amount of resources you need to collect the final results for the assignment. This will become more important for later assignments, but it is good to get in the habit now.