Assignment 1: Optimizing Matrix Multiplication

Problem statement

Your task in this assignment is to write an optimized matrix multiplication function for NERSC's Cori supercomputer. We will give you a generic matrix multiplication code (also called matmul or dgemm), and it will be your job to tune our code to run efficiently on Cori's processors. We are asking you to write an optimized single-threaded matrix multiply kernel. This will run on only one core.

We consider a special case of matmul:

C := C + A*B

where A, B, and C are n x n matrices. This can be performed using 2n^3 floating point operations (n^3 adds, n^3 multiplies), as in the following pseudocode:

  for i = 1 to n

    for j = 1 to n

      for k = 1 to n

        C(i,j) = C(i,j) + A(i,k) * B(k,j)

end

end

end

Remote XSEDE/Moodle Students: Please Read

Dear Remote Students, we are thrilled to be a part of your parallel computing learning experience and to share these resources with you! To avoid confusion, please note that the assignment instructions, deadlines, and other assignment details posted here were designed for the local UC Berkeley students. You should check with your local instruction team about submission, deadlines, job-running details, etc. and utilize Moodle for questions. With that in mind, the problem statement, source code, and references should still help you get started (just beware of institution-specific instructions). Best of luck and we hope you enjoy the assignment!

Due Date: Tuesday, February 12th (9:00am PST)

Instructions

Teams

Note that you will work in assigned teams for this assignment. See bCourses for your group assignments.

Starter Code

Download the starter code.

https://cs.berkeley.edu/~brock/cs267/hw/cs267_hw1_2019.tar.gz

The tarball contains starter code for the serial matrix multiply, with the following source files.

dgemm-blocked.c

A simple blocked implementation of matrix multiply. It is your job to optimize the square_dgemm() function in this file.

dgemm-blas.c

A wrapper which calls the vendor's optimized BLAS implementation of matrix multiply (here, MKL).

dgemm-naive.c

For illustrative purposes, a naive implementation of matrix multiply using three nested loops.

benchmark.c

A driver program that runs your code. You will not modify this file, except perhaps to change the MAX_SPEED constant if you wish to test on another computer (more about this below).

Makefile

A simple makefile to build the executables. It has support for the PrgEnv-gnu and PrgEnv-intel (default) compilers.

job-blas, job-blocked, job-naive

Example job scripts* to run the executables on Cori compute nodes. For example, you can type "sbatch job-blas" to benchmark the BLAS version.

Running our Code

The starter code should work out of the box. To get started, we recommend you log in to Cory and download the first part of the assignment. This will look something like the following:

[demmel@blasboss ~]$ ssh demmel@cori.nersc.gov

[demmel@cori10 ~]$ wget "https://cs.berkeley.edu/~brock/cs267/hw/cs267_hw1_2019.tar.gz"

[demmel@cori10 ~]$ tar -xf cs267_hw1_2019.tar.gz

[demmel@cori10 ~]$ cd cs267_hw1_2019

[demmel@cori10 cs267_hw1_2019]$ ls

Makefile  benchmark.c  dgemm-blas.c  dgemm-blocked.c  dgemm-naive.c  job-blas  job-blocked  job-naive

Next let's build the code.

[demmel@cori10 cs267_hw1_2019]$ make

rm -f benchmark-naive benchmark-blocked benchmark-blas benchmark.o dgemm-naive.o dgemm-blocked.o dgemm-blas.o

cc  -c -Wall -std=gnu99 -O2 benchmark.c

cc  -c -Wall -std=gnu99 -O2 dgemm-naive.c

cc  -o benchmark-naive benchmark.o dgemm-naive.o -lrt

cc  -c -Wall -std=gnu99 -O2 dgemm-blocked.c

cc  -o benchmark-blocked benchmark.o dgemm-blocked.o -lrt

cc  -c -Wall -std=gnu99 -O2 dgemm-blas.c

cc  -o benchmark-blas benchmark.o dgemm-blas.o -lrt

[demmel@cori10 cs267_hw1_2019]$

We now have three binaries, benchmark-blas, benchmark-blocked, and benchmark-naive. The easiest way to run the code is to submit a batch job. We've already provided batch files which will launch jobs for each matrix multiply version.

[demmel@cori10 cs267_hw1_2019]$ sbatch job-blas

Submitted batch job 9637621

[demmel@cori10 cs267_hw1_2019]$ sbatch job-blocked

Submitted batch job 9637622

[demmel@cori10 cs267_hw1_2019]$ sbatch job-naive

Our jobs are now submitted to Cori's job queue. We can now check on the status of our submitted jobs using a few different commands.

[demmel@cori10 cs267_hw1_2019]$ squeue -u demmel

             JOBID PARTITION     NAME     USER ST       TIME  NODES NODELIST(REASON)

           9637621     debug job-blas   demmel PD       0:00      1 (Priority)

           9637622     debug job-bloc   demmel PD       0:00      1 (Priority)

           9637623     debug job-naiv   demmel PD       0:00      1 (Priority)

[demmel@cori10 cs267_hw1_2019]$ sqs

JOBID    ST  USER   NAME         NODES REQUESTED USED  SUBMIT               QOS        SCHEDULED_START    REASON

9637621  PD  demmel  job-blas     1     1:00      0:00  2018-01-19T11:42:15  debug_hsw  avail_in_~1.0_hrs  Priority

9637622  PD  demmel  job-blocked  1     1:00      0:00  2018-01-19T11:42:17  debug_hsw  avail_in_~1.0_hrs  Priority

9637623  PD  demmel  job-naive    1     1:00      0:00  2018-01-19T11:42:20  debug_hsw  avail_in_~1.0_hrs  Priority

When our job is finished, we'll find new files in our directory containing the output of our program. For example, we'll find the files job-blas.o9637621 and job-blas.e9637621. The first file contains the standard output of our program, and the second file contains the standard error.

Interactive Session

You may find it useful to launch an interactive session when developing your code. This lets you compile and run code interactively on a compute node that you've reserved. In addition, running interactively lets you use the special interactive queue, which means you'll receive you allocation quicker.

Editing the Code

One of the easiest ways to implement your homework is to directly change the code on the server. For this you need to use a command line editor like nano or vim.

For beginners we recommend taking your first steps with nano. You can use it on Cori like this:

[demmel@cori10 cs267_hw1_2019]$ module load nano

[demmel@cori10 cs267_hw1_2019]$ nano dgemm-blocked.c

Use Ctrl+X to exit.

For a more complete code editor try vim which is loaded by default:

[demmel@cori10 cs267_hw1_2019]$ vim dgemm-blocked.c

Use Esc and :q to exit. (:q! if you want to discard changes). Try out the interactive vim tutorial to learn more.

Our Harness

The benchmark.c file generates matrices of a number of different sizes and benchmarks the performance. It outputs the performance in FLOPS and in a percentage of theoretical peak attained. Your job is to get your matrix multiply's performance as close to the theoretical peak as possible.

Cori's Processors

Cori Phase I

Cori is actually partitioned in two: Cori Phase I contains nodes with Intel Xeon CPUs with the Haswell microarchitecture, and Cori Phase II contains Intel Xeon Phi nodes. In this assignment, we will only be using Cori Phase I. Be sure you use the flag '-C haswell' on any jobs that you run. The job files included with the starter code do this automatically.

Theoretical Peak

Our benchmark harness reports numbers as a percentage of theoretical peak. Here, we show you how we calculate the theoretical peak of Cori's Haswell processors. If you'd like to run the assignment on your own processor, you should follow this process to arrive at the theoretical peak of your own machine, and then replace the MAX_SPEED constant in benchmark.c with the theoretical peak of your machine. Be sure to change it back if you run your code on Cori again.

One Core

One core has a clock rate of 2.3 GHz, so it can issue 2.3 billion instructions per second. Cori's processors also have a 256-bit vector width, meaning each instruction can operate on 8 32-bit data elements at a time. Furthermore, the Haswell microarchitecture includes a fused multiply-add (FMA) instruction, which means 2 floating point operations can be performed in a single instruction. So, the theoretical peak of Cori's Haswell nodes is:

2.3 GHz * 4-element (256-bit) vector * 2 vector pipelines * 2 ops in an FMA = 36.8 GFlops/s

Optimizing

Now, it's time to optimize! A few optimizations you might consider adding:

Perform blocking. The dgemm-blocked.c already gets you started with this, although you'll need to tune block sizes.
Write a register-blocked kernel, either by writing an inner-level fixed-size matrix multiply and hoping (and maybe checking) that the compiler inlines it, writing AVX intrinsics, or even writing inline assembly instructions.
Add manual prefetching.

You may, of course, proceed however you wish. We recommend you look through the lecture notes as reference material to guide your optimization process, as well as the references at the bottom of this write-up.

Available Compilers

You will probably want to try your code with different compilers to see if you can get a performance boost for free. The default compiler on Cori is Intel's compiler. GNU, Cray, and LLVM compilers are also available. We recommend you use the Intel or GNU compilers. For more information on compilers available on Cori, see the NERSC docs.

Grading

We will grade your assignment by reviewing your assignment write-up, looking at the optimization methods you attempted, and benchmarking your code's performance.

Benchmarking

To benchmark your code, we will compile it with the Intel and GNU compilers, run the binaries, and take the best performance result. If you wish to be graded using a different compiler, you must contact the TAs directly. We'll allow it, but will have to grade your assignment separately.

Submission Details

Your submission should be a gzipped tar archive, formatted (for Team 4) like: team04_hw1.tgz. It should contain:
- dgemm-blocked.c, a C-language source file containing your implementation of the routine: void square_dgemm(int, double*, double*, double*);
- described in pseudocode above. We provide an example dgemm-blocked.c, below.
- Makefile, only if you modified it. If you modified it, make sure it still correctly builds the provided benchmark.c, which we will use to grade your submission.
- (e.g. for Team 4) team04_hw1.pdf, your write-up.
- Please do use these formats and naming conventions. Not following these instructions leads to more busy work for the GSI's, which makes the GSI's sad...
This link tells you how to use tar to make a .tgz file.
Submit your .tgz through bCourses.
Your write-up should contain:
- the names of the people in your group (and each member's contribution),
- the optimizations used or attempted,
- the results of those optimizations,
- the reason for any odd behavior (e.g., dips) in performance, and
- how the performance changed when running your optimized code on a different machine.
For the last requirement, you may run your implementation on another NERSC machine, on your laptop, cellphone, toaster, etc.
Please carefully read the notes for implementation details. Stay tuned to Piazza (signup) for updates and clarifications, as well as discussion.
If you are new to optimizing numerical codes, we recommend reading the papers in the references section.

Notes:

Your grade will mostly depend on two factors:
- performance sustained by your codes on the Cori supercomputer,
- explanations of the performance features you observed (including what didn't work)
There are other formulations of matmul (e.g., Strassen) that are mathematically equivalent, but perform asymptotically fewer computations - we will not grade submissions that do fewer computations than the two n cubed algorithm. This is actually an optional part of HW1.
Your code must use double-precision to represent real numbers. A common reference for double-precision matrix multiplication is the dgemm (double-precision general matrix-matrix multiply) routine in the level-3 BLAS. We will compare your implementation with the tuned dgemm implementation in the vendor-provided BLAS library - on Cori's Haswell partition (Cray XC40), we will compare with the Cray LibSci implementation of dgemm. Note that dgemm has a more general interface than square_dgemm - an optional part of HW1 encourages you to explore this richer tuning space.
You may use any compiler available. We recommend starting with the GNU C compiler (gcc). If you use a compiler other than gcc, you may have to change the provided Makefile, which uses gcc-specific flags. Note that the default compilers, every time you open a new terminal, are Intel - you will have to type "module swap PrgEnv-intel PrgEnv-gnu" to switch to, eg, GNU compilers. You can type "module list" to see which compiler wrapper you have loaded.
Besides compiler intrinsic functions and built-ins, your code (dgemm-blocked.c) must only call into the C standard library.
You may not use compiler flags that automatically detect dgemm kernels and replace them with BLAS calls, i.e. Intel's -matmul flag.
You should try to use your compiler's automatic vectorizer before manually vectorizing.
- GNU C provides many extensions, which include intrinsics for vector (SIMD) instructions and data alignment. (Other compilers may have different interfaces.)
- Ideally your compiler injects the appropriate intrinsics into your code automatically (eg, automatic vectorization and/or automatic data alignment). gcc's auto-vectorizer reports diagnostics that may help you identify if manual vectorization is required.
- To manually vectorize, you must add compiler intrinsics to your code.
- Consult your compiler's documentation.
You may assume that A and B do not alias C; however, A and B may alias each other. It is semantically correct to qualify C (the last argument to square_dgemm) with the C99 restrict keyword. There is a lot online about restrict and pointer-aliasing - this is a good article to start with, along with the Wikipedia article on the restrict keyword.
The matrices are all stored in column-major order, i.e. Ci,j == C(i,j) == C[(i-1)+(j-1)*n], for i=1:n, where n is the number of rows in C. Note that we use 1-based indexing when using mathematical symbols and MATLAB index notation (parentheses), and 0-based indexing when using C index notation (square brackets).
We will check correctness by the following componentwise error bound: |square_dgemm(n,A,B,0) - A*B| < eps*n*|A|*|B|.
where eps := 2-52 = 2.2 * 10-16 is the machine epsilon.
The target processor is a 32-core Intel Haswell processor running at 2.3 GHz, yielding 8 flops per vector instruction, with 2 flops in an FMA = 36.8 Gflops/s.

Optional Parts

These parts are not graded. You should be satisfied with your square_dgemm results and write-up before beginning an optional part.

Implement Strassen matmul. Consider switching over to the three-nested-loops algorithm when the recursive subproblems are small enough.
Support the dgemm interface (ie, rectangular matrices, transposing, scalar multiples).
Try float (single-precision).
Try complex numbers (single- and double-precision) - note that complex numbers are part of C99 and supported in gcc. This forum thread gives advice on vectorizing complex multiplication with the conventional approach - but note that there are other algorithms for this operation.
Optimize your matmul for the case when the inputs are symmetric. Consider conventional and packed symmetric storage.

Documentation

Cori's programming environment documentation
GCC documentation - Cori's default version currently is GCC 7.1.0.
Intel Intrinsic Guide (scroll down). This Java program has a complete reference of all SIMD intrinsics on Intel architectures. Note that Cori supports AVX2, which includes fused multiply-add instructions.

You are also welcome to learn from the source code of state-of-art BLAS implementations such as GotoBLAS and ATLAS. However, you should not reuse those codes in your submission.

* We emphasize these are example scripts because for these as well as all other assignment scripts we provide, you may need to adjust the number of requested nodes and cores and amount of time according to your needs (your allocation and the total class allocation is limited). To understand how you are charged, READ THIS alongside the given scripts. For testing (1) try running and debugging on your laptop first, (2) try running with the minimum resources you need for testing and debugging, (3) once your code is fully debugged, use the amount of resources you need to collect the final results for the assignment. This will become more important for later assignments, but it is good to get in the habit now.

References

Goto, K., and van de Geijn, R. A. 2008. Anatomy of High-Performance Matrix Multiplication, ACM Transactions on Mathematical Software 34, 3, Article 12.
(Note: explains the design decisions for the GotoBLAS dgemm implementation, which also apply to your code.)
Chellappa, S., Franchetti, F., and PÃƒÂ¼schel, M. 2008. How To Write Fast Numerical Code: A Small Introduction, Lecture Notes in Computer Science 5235, 196-259.
(Note: how to write C code for modern compilers and memory hierarchies, so that it runs fast. Recommended reading, especially for newcomers to code optimization.)
Bilmes, et al. The PHiPAC (Portable High Performance ANSI C) Page for BLAS3 Compatible Fast Matrix Matrix Multiply.
(Note: PHiPAC is a code-generating autotuner for matmul that started as a submission for this HW in a previous semester of CS267. Also see ATLAS; both are good examples if you are considering code generation strategies.)
Lam, M. S., Rothberg, E. E, and Wolf, M. E. 1991. The Cache Performance and Optimization of Blocked Algorithms, ASPLOS'91, 63-74.
(Note: clearly explains cache blocking, supported by with performance models.)
Hints for HW1 and SIMD example pptx and pdf