EE P 590A: Applied Parallel Programming on GPU
I'm told this is one of the most popular course in ECE for spring.
Shifts often happen in week 1 of the quarter across all classes, so students often are able to register for their first choice if they wait a bit (and sign up for NotifyUW which will send you an email/ text if/when a spot opens up: https://itconnect.uw.edu/tools-services-support/academic-planning/notify-uw/).
Why is it important to learn the GPU SW-HW for AI optimizations:
The demand for compute from LLMs has been outpacing what Moore's Law and heterogeneous hardware designs can offer, by several orders of magnitude. This margin can only be bridged with an aggressive wave of SW optimizations. Unfortunately, it is not easy for a fresh grad to jump right into an industrial position and excel in backend AI library optimizations due to a significant gap between coursework and the rapidly evolving AI practice (one of the main reasons why AI library teams worldwide are preferring experienced candidates over fresh grads). Therefore, it is important for graduates to bridge this knowledge gap and EE 590A is a milestone in this direction.
What are we doing differently:
EE 590A is one of the most popular classes in ECE in spring. It is designed with a strong relevance to what is practiced in the industry and combines elements of industry practice with academia. Instead of the traditional redo of an age-old problem statement, students form groups of 3 and get to pick their course projects of interest from a set of hand-picked real-world problems from the industry, National Labs and academia so that the students get an opportunity to learn about and work on something that is both novel and useful to the AI community. The best project is then recognized by AMD. Additionally, instead of having exams or quizzes, we go the practice-oriented route and let students solve programming assignments on our AMD HPC & AI clusters for this class.
Course description [Video description]
This course focuses on programming massively parallel processors like GPUs. AI applications including LLMs, self-driving cars, and augmented reality are all examples of applications involving parallel computing. Through a deep dive into parallel computing concepts, programming models, and case studies, students will develop a strong foundation in GPU programming and gain valuable skills for efficient parallel algorithm design and implementation.
The course will be based on AMD’s HIP (ROCm) programming interface because of its many open-sourced components and portability. We will cover the internal architecture of GPUs and high-performance implementations of parallel algorithms. The curriculum will be delivered in 10 lectures. The class has 6 programming assignments and a final project (teams of 2). Best Project Award for this course is sponsored by AMD.
Course Objective:
Develop a thorough understanding of parallel computing principles and their relevance to modern applications.
Familiarize with GPU programming using AMD’s HIP (ROCm) interface leveraging its various open-source components.
Understand the internal architecture of GPUs.
Learn to map computations to parallel hardware and optimize performance.
Analyze cutting-edge applications in deep learning and genomics.
Course Topics:
Introduction
Course Overview
Introduction to GPUs and their importance in parallel computing
Introduction to data parallel programming with HIP (ROCm)
Programming Assignment: HIP Device query & vector addition
GPU Architecture and Profiling (2 lectures)
Study of GPU memory model, hierarchy, and locality
Microarchitecture
Profiling- Roofline, Tools & Techniques
Programming Assignment: Naive matrix multiplication
Parallel Computation Patterns (2 lectures)
Reduction trees
Scans
Atomic operations and Histogramming
Programming Assignments: List reduction, Histogram
Parallel Algorithm Implementation (2 lectures)
Matrix Multiplication- naive & tiled, reuse analysis
Matrix Transpose
Convolution- naive and tiled
Programming Assignments: Tiled matrix multiplication, 3D convolution
Application Case Study
Genomics
Deep Learning
Miscellaneous Topics
Streams
HIP Graphs
Sorting
Graph Traversal
Sparse Matrix-Vector Multiplication
Class Review & Project Discussion
Assessment:
Programming assignments will be individually assessed and graded. There will be a final project whose code and report will be graded.
Pre-requisites:
An ideal class consists of students familiar with C++ programs and loops.
Some understanding of C++ programming is required.
An understanding of algorithms is preferred.
Fundamental knowledge of computer organization is helpful but optional. We cover it in a capsule format.
About the course instructor:
Dr. Hari Sadasivan is a Staff Engineer in the AI Group at AMD and the course instructor for CSE P590. Hari works on optimizing irregular AI and genomics workloads on GPUs. Hari drives multiple research collaborations between AMD and academia. Hari holds a PhD in CSE from U Michigan Ann Arbor. His PhD thesis is on Portable & Programmable solutions for accelerated DNA Sequencing. Hari's PhD research has won several accolades including the ACM MICRO Top Picks with Honorable Mention. Hari has worked with NVIDIA and Samsung R&D in the past.
Checkout Hari's Home