Parallel Computing
Course Objective:
The objective of this course is to teach the students, how to launch jobs in any HPC system, how to write simple codes on single and multi-threaded environments.
Learning Outcome:
LO-1: The student will learn the different memory architectures, thread structures of shared as well as distributed memory computation.
LO-2: Basic debugging and profiling tools and computation in shared and distributed memory platforms will also be taught.
Course Content:
Unit I: Computation on Shared Memory Architecture
Basic concepts of parallel computing
The OpenMP execution model
Compiler directives, clauses, “sentinels” and pragmas
Data sharing of variables (shared, private, default)
Race conditions
Constructs and Regions
Parallel loops, Parallel sections, Load balancing, Scheduling of parallel operations
Collapsing loops, Orphan directives, Environment variables
Hands-on and Optimization
Unit II: Parallel Computing on Distributed Memory
Memory classification, Message passing (MPI) vs shared memory (OpenMP) parallel computing
Rank and size; error checking
MPI datatypes, Blocking communication, deadlocks and Non-blocking communication
Barriers, Broadcasts, Gathering and Scattering data
Constructing MPI datatypes for Fortran types and C structures.
Generating shell-scripts for HPC
Basic HPC commands
Hands-on, Performance optimization tools and techniques
Unit III: Computer Simulation using a single GPU device
Evolution of GPU architectures
Understanding Parallelism with GPU
CUDA Hardware Overview
Accelerators, Kernels Launch parameters
Thread hierarchy, Blocks, Grids, Warps
1D/2D/3D thread mapping
Memory hierarchy, DRAM / global, local / shared, private / local, textures, Constant Memory, Pointers, Parameter Passing, Arrays and dynamic Memory, Multi-dimensional Arrays, Memory Allocation, Memory copying across devices
Unit IV: Parallel Computation with multiple GPUs and FPGA
Programming on Heterogeneous Cluster
Common Problems: CUDA Error Handling, Parallel Programming Issues, Synchronization, Algorithmic Issues, Finding and Avoiding Errors.
Optimizing CUDA Applications: Problem Decomposition, Memory Considerations, Transfers, Thread Usage, Resource Contentions
Debugging GPU Programs: Profiling, Profile tools, Performance aspects
Introduction to FPGA: OpenCL Standard
Kernels
Host Device Interaction
Execution Environment
Memory Model
Basic OpenCL Examples
Course Evaluation:
Internal Test 1 (25 Marks)
Internal Test 2 (25 Marks)
Internal Test 3 (25 Marks)
Final Exam (50 Marks)
Final evaluation sheet will be prepared using the best TWO out of the three Internal Tests (25+25 = 50 Marks) + Final Exam (50 Marks).
Textbooks:
Parallel Programming Patterns: Working with Concurrency in OpenMP, MPI, Java, and OpenCL – by Timothy G. Mattson, Berna Massingill and Beverly Sanders; Pearson Press
An Introduction to Parallel Programming with OpenMP, PThreads and MPI – by Robert Cook; Cook's Books (2011)
The OpenMP Common Core: Making OpenMP Simple Again (Scientific and Engineering Computation), by Timothy G. Mattson, Yun He, Alice E. Koniges; The MIT Press (2019)
Using MPI: Portable Parallel Programming with the Message-Passing Interface, by William Gropp, Ewing Lusk, Anthony Skjellum; The MIT Press (2014)
MPI: The Complete Reference, by Marc Snir, Jack Dongarra, Janusz S. Kowalik, The MIT Press
CUDA Programming: A Developer's Guide to Parallel Computing with GPUs, by Shane Cook; Elsevier Science (2012)
CUDA by Example: An Introduction to General-Purpose GPU Programming, by Edward Kandrot; Pearson (2010)
The CUDA Handbook: A Comprehensive Guide to GPU Programming, by Nicholas Wilt; Pearson (2013)
Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation, by Andre DeHon, Scott Hauck; Morgan Kaufmann Publishers (2007)
Introduction to Reconfigurable Computing: Architectures, Algorithms, and Applications, by Christophe Bobda; Springer (2007)