Parallel Computing

Course Objective:

The objective of this course is to teach the students, how to launch jobs in any HPC system, how to write simple codes on single and multi-threaded environments.

Learning Outcome:

LO-1: The student will learn the different memory architectures, thread structures of shared as well as distributed memory computation. 

LO-2: Basic debugging and profiling tools and computation in shared and distributed memory platforms will also be taught.

Course Content:

Unit I: Computation on Shared Memory Architecture

Basic concepts of parallel computing

The OpenMP execution model

Compiler directives, clauses, “sentinels” and pragmas

Data sharing of variables (shared, private, default)

Race conditions

Constructs and Regions

Parallel loops, Parallel sections, Load balancing, Scheduling of parallel operations

Collapsing loops, Orphan directives, Environment variables

Hands-on and Optimization

Unit II: Parallel Computing on Distributed Memory

Memory classification, Message passing (MPI) vs shared memory (OpenMP) parallel computing 

Rank and size; error checking

MPI datatypes, Blocking communication, deadlocks and Non-blocking communication

Barriers, Broadcasts, Gathering and Scattering data

Constructing MPI datatypes for Fortran types and C structures.

Generating shell-scripts for HPC

Basic HPC commands

Hands-on, Performance optimization tools and techniques

Unit III: Computer Simulation using a single GPU device

Evolution of GPU architectures 

Understanding Parallelism with GPU 

CUDA Hardware Overview 

Accelerators, Kernels Launch parameters

Thread hierarchy, Blocks, Grids, Warps

1D/2D/3D thread mapping

Memory hierarchy, DRAM / global, local / shared, private / local, textures, Constant Memory, Pointers, Parameter Passing, Arrays and dynamic Memory, Multi-dimensional Arrays, Memory Allocation, Memory copying across devices

Unit IV: Parallel Computation with multiple GPUs and FPGA

Programming on Heterogeneous Cluster

Common Problems: CUDA Error Handling, Parallel Programming Issues, Synchronization, Algorithmic Issues, Finding and Avoiding Errors.

Optimizing CUDA Applications: Problem Decomposition, Memory Considerations, Transfers, Thread Usage, Resource Contentions

Debugging GPU Programs: Profiling, Profile tools, Performance aspects

Introduction to FPGA: OpenCL Standard 

Kernels 

Host Device Interaction 

Execution Environment 

Memory Model 

Basic OpenCL Examples

Course Evaluation:

Internal Test 1  (25 Marks) 

Internal Test 2 (25 Marks)

Internal Test 3 (25 Marks)

Final Exam (50 Marks)

Final evaluation sheet will be prepared using the best TWO out of the three Internal Tests (25+25 = 50 Marks) + Final Exam (50 Marks).

Textbooks:

Parallel Programming Patterns: Working with Concurrency in OpenMP, MPI, Java, and OpenCL – by Timothy G. Mattson, Berna Massingill and Beverly Sanders; Pearson Press

An Introduction to Parallel Programming with OpenMP, PThreads and MPI – by Robert Cook; Cook's Books (2011)

The OpenMP Common Core: Making OpenMP Simple Again (Scientific and Engineering Computation), by Timothy G. Mattson, Yun He, Alice E. Koniges;  The MIT Press (2019)

Using MPI: Portable Parallel Programming with the Message-Passing Interface, by William Gropp, Ewing Lusk, Anthony Skjellum; The MIT Press (2014)

MPI: The Complete Reference, by Marc Snir, Jack Dongarra, Janusz S. Kowalik, The MIT Press

CUDA Programming: A Developer's Guide to Parallel Computing with GPUs, by Shane Cook; Elsevier Science (2012)

CUDA by Example: An Introduction to General-Purpose GPU Programming, by Edward Kandrot; Pearson (2010)

The CUDA Handbook: A Comprehensive Guide to GPU Programming, by Nicholas Wilt; Pearson (2013)

Reconfigurable Computing: The Theory and Practice of FPGA-Based Computation, by Andre DeHon, Scott Hauck; Morgan Kaufmann Publishers (2007)

Introduction to Reconfigurable Computing: Architectures, Algorithms, and Applications, by Christophe Bobda; Springer (2007)