EE P 590A: Applied Parallel Programming on GPU

I'm told this is one of the most popular course in ECE for spring.

Shifts often happen in week 1 of the quarter across all classes, so students often are able to register for their first choice if they wait a bit (and sign up for NotifyUW which will send you an email/ text if/when a spot opens up: https://itconnect.uw.edu/tools-services-support/academic-planning/notify-uw/).

Why is it important to learn the GPU SW-HW for AI optimizations:

The demand for compute from LLMs has been outpacing what Moore's Law and heterogeneous hardware designs can offer, by several orders of magnitude. This margin can only be bridged with an aggressive wave of SW optimizations. Unfortunately, it is not easy for a fresh grad to jump right into an industrial position and excel in backend AI library optimizations due to a significant gap between coursework and the rapidly evolving AI practice (one of the main reasons why AI library teams worldwide are preferring experienced candidates over fresh grads). Therefore, it is important for graduates to bridge this knowledge gap and EE 590A is a milestone in this direction.

What are we doing differently:

EE 590A is one of the most popular classes in ECE in spring. It is designed with a strong relevance to what is practiced in the industry and combines elements of industry practice with academia. Instead of the traditional redo of an age-old problem statement, students form groups of 3 and get to pick their course projects of interest from a set of hand-picked real-world problems from the industry, National Labs and academia so that the students get an opportunity to learn about and work on something that is both novel and useful to the AI community. The best project is then recognized by AMD. Additionally, instead of having exams or quizzes, we go the practice-oriented route and let students solve programming assignments on our AMD HPC & AI clusters for this class.

Course description [Video description]

This course focuses on programming massively parallel processors like GPUs. AI applications including LLMs, self-driving cars, and augmented reality are all examples of applications involving parallel computing. Through a deep dive into parallel computing concepts, programming models, and case studies, students will develop a strong foundation in GPU programming and gain valuable skills for efficient parallel algorithm design and implementation.

The course will be based on AMD’s HIP (ROCm) programming interface because of its many open-sourced components and portability. We will cover the internal architecture of GPUs and high-performance implementations of parallel algorithms. The curriculum will be delivered in 10 lectures. The class has 6 programming assignments and a final project (teams of 2). Best Project Award for this course is sponsored by AMD.

Course Objective:

Develop a thorough understanding of parallel computing principles and their relevance to modern applications.
Familiarize with GPU programming using AMD’s HIP (ROCm) interface leveraging its various open-source components.
Understand the internal architecture of GPUs.
Learn to map computations to parallel hardware and optimize performance.
Analyze cutting-edge applications in deep learning and genomics.

Course Topics:

Introduction
1. Course Overview
2. Introduction to GPUs and their importance in parallel computing
3. Introduction to data parallel programming with HIP (ROCm)

Programming Assignment: HIP Device query & vector addition

GPU Architecture and Profiling (2 lectures)
1. Study of GPU memory model, hierarchy, and locality
2. Microarchitecture
3. Profiling- Roofline, Tools & Techniques

Programming Assignment: Naive matrix multiplication

Parallel Computation Patterns (2 lectures)
1. Reduction trees
2. Scans
3. Atomic operations and Histogramming

Programming Assignments: List reduction, Histogram

Parallel Algorithm Implementation (2 lectures)
1. Matrix Multiplication- naive & tiled, reuse analysis
2. Matrix Transpose
3. Convolution- naive and tiled

Programming Assignments: Tiled matrix multiplication, 3D convolution

Application Case Study
1. Genomics
2. Deep Learning
Miscellaneous Topics
1. Streams
2. HIP Graphs
3. Sorting
4. Graph Traversal
5. Sparse Matrix-Vector Multiplication
Class Review & Project Discussion

Assessment:

Programming assignments will be individually assessed and graded. There will be a final project whose code and report will be graded.

Pre-requisites:

An ideal class consists of students familiar with C++ programs and loops.
Some understanding of C++ programming is required.
An understanding of algorithms is preferred.
Fundamental knowledge of computer organization is helpful but optional. We cover it in a capsule format.

About the course instructor:

Dr. Hari Sadasivan is a Staff Engineer in the AI Group at AMD and the course instructor for CSE P590. Hari works on optimizing irregular AI and genomics workloads on GPUs. Hari drives multiple research collaborations between AMD and academia. Hari holds a PhD in CSE from U Michigan Ann Arbor. His PhD thesis is on Portable & Programmable solutions for accelerated DNA Sequencing. Hari's PhD research has won several accolades including the ACM MICRO Top Picks with Honorable Mention. Hari has worked with NVIDIA and Samsung R&D in the past.

Checkout Hari's Home

Page updated

Report abuse