Teaching
In Spring (Jan-May) 2019, Dr Sparsh taught "Hardware Architectures for Deep Learning" course. Here are its course-contents.
Overview and motivation for designing hardware accelerators for deep-learning
Background:
Approximate computing and storage
Roofline Model
Cache tiling (blocking)
GPU architecture, CUDA programming, understanding shared/global memory-bottlenecks in GPUs
FPGA architecture
Matrix multiplication using systolic array
3D/2.5D DRAM memory for high bandwidth
DRAM architecture
Deep-learning:
Deep learning on FPGAs
Case study of Microsoft's Brainwave
Deep learning on Embedded System (especially NVIDIA's Jetson Platform)
Deep learning on Edge Devices (smartphones). Review of “Machine Learning at Facebook: Understanding Inference at the Edge”.
Study of Google's Tensor Processing Unit
Memristor-based accelerators for deep-learning
Intel's Xeon Phi architecture and Deep-learning using Intel's Xeon Phi
Convolutional strategies: Direct, FFT-based, Winograd-based and Matrix-multiplication based. Review of "Performance Analysis of GPU-based Convolutional Neural Networks"
Addressing memory bottleneck during DNN training. Review of "vDNN: Virtualized Deep Neural Networks for Scalable, Memory-Efficient Neural Network Design"
Hardware-aware pruning of DNNs. Review of "Scalpel: Customizing DNN Pruning to the Underlying Hardware Parallelism."
Distributed training of DNNs. Review of "Optimizing Network Performance for Distributed DNN Training on GPU Clusters: ImageNet/AlexNet Training in 1.5 Minutes"
Review of “Deep Compression: Compressing Deep Neural Networks with Pruning, Trained Quantization and Huffman Coding”
Hardware/system-challenges in autonomous driving. Review of "The Architectural Implications of Autonomous Driving: Constraints and Acceleration".
Neural branch predictor. Review of "Using Branch Predictors to Predict Brain Activity in Brain-Machine Implants"
Data-compression and its use for addressing memory bottleneck in DL
Comparison of memory technologies (SRAM, DRAM, eDRAM, STT-RAM, PCM, Flash) and their suitability for designing memory-elements in DNN accelerator