Assignment 1: Optimizing Matrix Multiplication

Problem statement

Your task in this assignment is to write an optimized matrix multiplication function for NERSC's Cori supercomputer. We will give you a generic matrix multiplication code (also called matmul or dgemm), and it will be your job to tune our code to run efficiently on Cori's processors. We are asking you to write an optimized single-threaded matrix multiply kernel. This will run on only one core.

We consider a special case of matmul:

C := C + A*B

where A, B, and C are n x n matrices. This can be performed using 2n^3 floating point operations (n^3 adds, n^3 multiplies), as in the following pseudocode:

for i = 1 to n

for j = 1 to n

for k = 1 to n

C(i,j) = C(i,j) + A(i,k) * B(k,j)

end

end

end

Remote XSEDE/Moodle Students: Please Read

Dear Remote Students, we are thrilled to be a part of your parallel computing learning experience and to share these resources with you! To avoid confusion, please note that the assignment instructions, deadlines, and other assignment details posted here were designed for the local UC Berkeley students. You should check with your local instruction team about submission, deadlines, job-running details, etc. and utilize Moodle for questions. With that in mind, the problem statement, source code, and references should still help you get started (just beware of institution-specific instructions). Best of luck and we hope you enjoy the assignment!

Due Date: Thursday, February 3th 2022 at 11:59 PM PST

Instructions

Teams

Note that you will work in assigned teams for this assignment. See bCourses for your group assignments.

Getting Started with Cori

Please read through the Cori tutorial, available here: https://bitbucket.org/Berkeley-CS267/cori-tutorial/src/master/cori-tutorial.md

Getting Set Up

The starter code is available on Bitbucket at https://bitbucket.org/Berkeley-CS267/hw1-knl.git and should work out of the box. To get started, we recommend you log in to Cori and download the first part of the assignment. This will look something like the following:

student@local:~> ssh demmel@cori.nersc.gov

student@cori04:~> git clone https://bitbucket.org/Berkeley-CS267/hw1-knl.git

student@cori04:~> cd hw1

student@cori04:~/hw1> ls

CMakeLists.txt README.md benchmark.c dgemm-blas.c dgemm-blocked.c dgemm-naive.c job.in

There are seven files in the base repository. Their purposes are as follows:

CMakeLists.txt

The build system that manages compiling your code.

README.md

README file explaining the build system in more detail.

benchmark.c

A driver program that runs your code.

dgemm-blas.c

A wrapper which calls the vendor's optimized BLAS implementation of matrix multiply (here, MKL).

dgemm-blocked.c - - - You may only modify this file.

A simple blocked implementation of matrix multiply. It is your job to optimize the square_dgemm() function in this file.

dgemm-naive.c

For illustrative purposes, a naive implementation of matrix multiply using three nested loops.

job.in

Template job script that is filled in by the build system for each of blas, blocked, and naive.

Please do not modify any of the files besides dgemm-blocked.c.

Building our Code

First, we need to make sure that the CMake module is loaded and that the GNU compiler is selected.

student@cori04:~/hw1> module load cmake

student@cori04:~/hw1> module swap PrgEnv-intel PrgEnv-gnu

You should put these commands in your ~/.bash_profile file to avoid typing them every time you log in.

Next, let's build the code. CMake prefers out of tree builds, so we start by creating a build directory.

student@cori04:~/hw1> mkdir build

student@cori04:~/hw1> cd build

student@cori04:~/hw1/build>

Next, we have to configure our build. We can either build our code in Debug mode or Release mode. In debug mode, optimizations are disabled and debug symbols are embedded in the binary for easier debugging with GDB. In release mode, optimizations are enabled, and debug symbols are omitted. For example:

student@cori04:~/hw1/build> cmake -DCMAKE_BUILD_TYPE=Release ..

-- The C compiler identification is GNU 8.3.0

...

-- Configuring done

-- Generating done

-- Build files have been written to: /global/homes/s/student/hw1/build

Once our build is configured, we may actually execute the build:

student@cori04:~/hw1/build> make

Scanning dependencies of target benchmark

[ 14%] Building C object CMakeFiles/benchmark.dir/benchmark.c.o

[ 14%] Built target benchmark

...

[ 85%] Building C object CMakeFiles/benchmark-naive.dir/dgemm-naive.c.o

[100%] Linking C executable benchmark-naive

[100%] Built target benchmark-naive

student@cori04:~/hw1/build> ls

benchmark-blas benchmark-naive CMakeFiles job-blas job-naive

benchmark-blocked CMakeCache.txt cmake_install.cmake job-blocked Makefile

We now have three binaries (benchmark-blas, benchmark-blocked, and benchmark-naive) and three corresponding job scripts (job-blas, job-blocked, and job-naive). Feel free to create your own job scripts by copying one of these to the above source directory.

Other build considerations

You might find that your code works in Debug mode, but not Release mode. To add debug symbols to a release build, run

student@cori04:~/hw1/build> cmake -DCMAKE_BUILD_TYPE=Release -DCMAKE_C_FLAGS="-g3" ..

You can add arbitrary extra compiler flags in this way, just remember to re-run make after you do this.

Running our Code

The easiest way to run the code is to submit a batch job. We've already provided batch files which will launch jobs for each matrix multiply version.

student@cori04:~/hw1/build> sbatch job-blas

Submitted batch job 27505251

student@cori04:~/hw1/build> sbatch job-blocked

Submitted batch job 27505253

student@cori04:~/hw1/build> sbatch job-naive

Submitted batch job 27505255

Our jobs are now submitted to Cori's job queue. We can now check on the status of our submitted jobs using a few different commands.

student@cori04:~/hw1/build> squeue -u student

JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON)

27505255 debug_knl job-naiv reinking PD 0:00 1 (QOSMaxJobsPerUserLimit)

27505253 debug_knl job-bloc reinking R 0:32 1 nid00545

27505251 debug_knl job-blas reinking R 0:39 1 nid12790

student@cori04:~/hw1/build> sqs

JOBID ST USER NAME NODES REQUESTED USED SUBMIT QOS ESTIMATED_START FEATURES REASON

27505255 R student job-naive 1 1:00 0:36 2020-01-19T10:56:43 debug_knl 2020-01-19T10:57:25 knl&quad&cache None

27505253 R student job-blocked 1 1:00 1:19 2020-01-19T10:56:39 debug_knl 2020-01-19T10:56:42 knl&quad&cache None


When our job is finished, we'll find new files in our directory containing the output of our program. For example, we'll find the files job-blas.o27505253 and job-blas.e27505253. The first file contains the standard output of our program, and the second file contains the standard error.

Interactive Session

You may find it useful to launch an interactive session when developing your code. This lets you compile and run code interactively on a compute node that you've reserved. In addition, running interactively lets you use the special interactive queue, which means you'll receive your allocation quicker. To launch an interactive KNL node for 1 hour and run a benchmark (after you've built your code), try this:

student@cori04:~/hw1> salloc -N 1 -C knl -q interactive -t 01:00:00

salloc: Granted job allocation 53324632

salloc: Waiting for resource configuration

salloc: Nodes nid02346 are ready for job

student@nid02346:~/hw1> cd build

student@nid02346:~/hw1/build> ./benchmark-blocked

Interactive nodes may not always be available, depending on system demand. Be aware that compiling your code is much slower on a KNL interactive node compared to a login node.

Editing the Code

One of the easiest ways to implement your homework is to directly change the code on the server. For this you need to use a command line editor like nano or vim.

For beginners we recommend taking your first steps with nano. You can use it on Cori like this:

student@cori04:~/hw1> module load nano

student@cori04:~/hw1> nano dgemm-blocked.c

Use Ctrl+X to exit.

For a more complete code editor try vim which is loaded by default:

student@cori04:~/hw1> vim dgemm-blocked.c

Use Esc and :q to exit. (:q! if you want to discard changes). Try out the interactive vim tutorial to learn more.

If you're more familiar with a graphical environment, many popular IDEs can use the provided CMakeLists.txt as a project definition. Refer to the documentation of your particular IDE for help setting this up. Using hosted version control like GitHub or Bitbucket makes uploading your changes much easier. If you're in a Windows environment, consider using the Windows Subsystem for Linux (WSL) for development.

Our Harness

The benchmark.c file generates matrices of a number of different sizes and benchmarks the performance. It outputs the performance in FLOPS and in a percentage of theoretical peak attained. Your job is to get your matrix-multiply's performance as close to the theoretical peak as possible.

When you run your code on a different computer, you will likely need to adjust the MAX_SPEED variable. This can be done like so:

student@cori04:~/hw1/build> cmake -DCMAKE_BUILD_TYPE=Debug -DMAX_SPEED=44.8 ..

student@cori04:~/hw1/build> make

On Cori, this value is computed as 1.4 GHz * 8 vector width * 2 vector pipelines * 2 flops for FMA = 44.8 GF/s.

Cori's Processors

Cori Phase I

Cori is actually partitioned in two: Cori Phase I contains nodes with Intel Xeon CPUs, and Cori Phase II contains Intel Xeon Phi nodes with the KNL microarchitecture. In this assignment, we will only be using Cori Phase II. Be sure you use the flag '-C knl' on any jobs that you run. The job files included with the starter code do this automatically.

Theoretical Peak

Our benchmark harness reports numbers as a percentage of theoretical peak. Here, we show you how we calculate the theoretical peak of Cori's KNL processors. If you'd like to run the assignment on your own processor, you should follow this process to arrive at the theoretical peak of your own machine, and then replace the MAX_SPEED constant in benchmark.c with the theoretical peak of your machine. Be sure to change it back if you run your code on Cori again.

One Core

One core has a clock rate of 1.4 GHz, so it can issue 1.4 billion instructions per second. Cori's processors also have a 512-bit vector width, meaning each instruction can operate on 8 64-bit data elements at a time. Furthermore, the KNL microarchitecture includes a fused multiply-add (FMA) instruction, which means 2 floating point operations can be performed in a single instruction.

So, the theoretical peak of Cori's KNL nodes is:

  • 1.4 GHz * 8-element (512-bit) vector * 2 vector pipelines * 2 ops in an FMA = 44.8 GFlops/s

Optimizing

Now, it's time to optimize! A few optimizations you might consider adding:

  1. Perform blocking. The dgemm-blocked.c already gets you started with this, although you'll need to tune block sizes.

  2. Write a register-blocked kernel, either by writing an inner-level fixed-size matrix multiply and hoping (and maybe checking) that the compiler inlines it, writing AVX intrinsics, or even writing inline assembly instructions.

  3. Add manual prefetching.

You may, of course, proceed however you wish. We recommend you look through the lecture notes as reference material to guide your optimization process, as well as the references at the bottom of this write-up.

Available Compilers

The development environment on Cori supplies several different compilers, including Intel, Cray, and LLVM. However, we want you to use the GNU C compiler for this assignment. You can make sure you are using GNU by running:

student@cori04:~> module swap PrgEnv-intel PrgEnv-gnu

Still, you might want to try your code with different compilers to see if one outperforms the other. If the difference is significant, consider using the Compiler Explorer to figure out why GCC isn't optimizing your code as well. For more information on compilers available on Cori, see the NERSC docs.

Grading

We will grade your assignment by reviewing your assignment write-up, looking at the optimization methods you attempted, and benchmarking your code's performance. To benchmark your code, we will compile it with the exact process detailed above, with the GNU compiler. Note that code that does not return correct results will receive significant penalties.

Submission Details

Supposing you are Group #04, follow these steps to create an appropriate submission archive:

  • Ensure that your write-up is located in your source directory, next to dgemm-blocked.c. It should be named cs267Group04_hw1.pdf

  • From your build directory, run:

student@cori04:~/hw1/build> cmake -DGROUP_NO=04 ..

student@cori04:~/hw1/build> make package

This second command will fail if the PDF is not present.

  • Confirm that it worked using the following command. You should see output like:

student@cori04:~/hw1/build> tar tfz cs267Group04_hw1.tar.gz

cs267Group04_hw1/cs267Group04_hw1.pdf

cs267Group04_hw1/dgemm-blocked.c

  • Submit your .tar.gz through bCourses.

Write-up Details

  • Your write-up should contain:

    • the names of the people in your group (and each member's contribution),

    • the optimizations used or attempted,

    • the results of those optimizations,

    • the reason for any odd behavior (e.g., dips) in performance, and

    • how the performance changed when running your optimized code on a different machine.

  • Please carefully read the notes for implementation details. Stay tuned to Piazza (signup) for updates and clarifications, as well as discussion.

  • If you are new to optimizing numerical codes, we recommend reading the papers in the references section.

Notes:

  • Your grade will mostly depend on three factors:

    • whether or not it is correct (ie. finishes running without exiting early)

    • performance sustained by your codes on the Cori supercomputer,

    • explanations of the performance features you observed (including what didn't work)

  • There are other formulations of matmul (e.g., Strassen) that are mathematically equivalent, but perform asymptotically fewer computations - we will not grade submissions that do fewer computations than the 2 n^3 algorithm. This is actually an optional part of HW1.

  • You must use the GNU C Compiler 8.3 for this assignment. If your code does not compile and run with GCC 8.3, it will not be graded.

  • Besides compiler intrinsic functions and built-ins, your code (dgemm-blocked.c) must only call into the C standard library.

  • GNU C provides many extensions, which include intrinsics for vector (SIMD) instructions and data alignment. (Other compilers may have different interfaces.)

    • To manually vectorize, you should prefer to add compiler intrinsics to your code; avoid using inline assembly, at least at first.

    • The Compiler Explorer project will be useful for exploring the relationship between your C code and its corresponding assembly. Release mode builds compile with -O3.

  • You may assume that A and B do not alias C; however, A and B may alias each other. It is semantically correct to qualify C (the last argument to square_dgemm) with the C99 restrict keyword. There is a lot online about restrict and pointer-aliasing - this is a good article to start with, along with the Wikipedia article on the restrict keyword.

  • The matrices are all stored in column-major order, i.e. Ci,j == C(i,j) == C[(i-1)+(j-1)*n], for i=1:n, where n is the number of rows in C. Note that we use 1-based indexing when using mathematical symbols and MATLAB index notation (parentheses), and 0-based indexing when using C index notation (square brackets).

  • We will check correctness by the following component-wise error bound: |square_dgemm(n,A,B,0) - A*B| < eps*n*|A|*|B|.

Optional Parts

These parts are not graded. You should be satisfied with your square_dgemm results and write-up before beginning an optional part.

  • Implement Strassen matmul. Consider switching over to the three-nested-loops algorithm when the recursive subproblems are small enough.

  • Support the dgemm interface (ie, rectangular matrices, transposing, scalar multiples).

  • Try float (single-precision).

  • Try complex numbers (single- and double-precision) - note that complex numbers are part of C99 and supported in gcc. This forum thread gives advice on vectorizing complex multiplication with the conventional approach - but note that there are other algorithms for this operation.

  • Optimize your matmul for the case when the inputs are symmetric. Consider conventional and packed symmetric storage.

If you wish to submit optional parts, send them to us via email, rather than through the bCourses system.

Documentation

You are also welcome to learn from the source code of state-of-art BLAS implementations such as GotoBLAS and ATLAS. However, you should not reuse those codes in your submission.

* We emphasize these are example scripts because for these as well as all other assignment scripts we provide, you may need to adjust the number of requested nodes and cores and amount of time according to your needs (your allocation and the total class allocation is limited). To understand how you are charged, READ THIS alongside the given scripts. For testing (1) try running and debugging on your laptop first, (2) try running with the minimum resources you need for testing and debugging, (3) once your code is fully debugged, use the amount of resources you need to collect the final results for the assignment. This will become more important for later assignments, but it is good to get in the habit now.

References

  • Goto, K., and van de Geijn, R. A. 2008. Anatomy of High-Performance Matrix Multiplication, ACM Transactions on Mathematical Software 34, 3, Article 12.

    • (Note: explains the design decisions for the GotoBLAS dgemm implementation, which also apply to your code.)

  • Chellappa, S., Franchetti, F., and Puschel, M. 2008. How To Write Fast Numerical Code: A Small Introduction, Lecture Notes in Computer Science 5235, 196-259.

    • (Note: how to write C code for modern compilers and memory hierarchies, so that it runs fast. Recommended reading, especially for newcomers to code optimization.)

  • Bilmes, et al. The PHiPAC (Portable High Performance ANSI C) Page for BLAS3 Compatible Fast Matrix Matrix Multiply.

    • (Note: PHiPAC is a code-generating autotuner for matmul that started as a submission for this HW in a previous semester of CS267. Also see ATLAS; both are good examples if you are considering code generation strategies.)

  • Lam, M. S., Rothberg, E. E, and Wolf, M. E. 1991. The Cache Performance and Optimization of Blocked Algorithms, ASPLOS'91, 63-74.

    • (Note: clearly explains cache blocking, supported by with performance models.)