VLSI Architectures for Deep Learning
Introduction
Introduction
- Deep learning achieves great success in various domains
- Playing Go – Personal assistant – Search engine
- Deep neural networks (DNNs) are the cores of deep learning
- Repeated hierarchy with tens to hundred of layers
- Modular hierarchy with CONV, POOL, FC
- Large computational complexity and huge memory footprints
- About 10G Ops per frame
- Over 50M parameters to store
CMOS-based Deep Learning Accelerator
CMOS-based Deep Learning Accelerator
Network Compression
Network Compression
- Blocked hash compression
- Extra block constraint to preserve the spatial locality for hash compression
- Compress network by 16x ~ 32x
Sparsity-based Hardware Accelerator
Sparsity-based Hardware Accelerator
- Dedicated sparsity predictor to bypass the unnecessary operations
- Network-on-Chip (NoC) based hardware architecture
- Throughput improvement: 10% ~ 70%
- Power reduction: 50%
Memristor-based Deep Learning Accelerator
Memristor-based Deep Learning Accelerator
In-situ Analog Computation Based on RRAM
In-situ Analog Computation Based on RRAM
- Matrix-vector multiplication done in-situ on resistive random access memory(RRAM) to address the memory wall issue
- Energy and timing overhead mainly on the analog computing unit and analog-to-digital interface
High-throughput and Energy-efficient Accelerator Design
High-throughput and Energy-efficient Accelerator Design
- Dedicated encoding on synaptic weights and activations to improve energy efficiency of the analog computation
- Distribution analysis on crossbar bitline outputs: reducing ADC bit-resolution
- Dynamically quantize the weights and activations according to the significance of fine-grained partial products
- Throughput, energy efficiency and area efficiency improvement: 2x~4x