VLSI Architectures for Deep Learning

Introduction

  • Deep learning achieves great success in various domains
    • Playing Go – Personal assistant – Search engine
  • Deep neural networks (DNNs) are the cores of deep learning
    • Repeated hierarchy with tens to hundred of layers
    • Modular hierarchy with CONV, POOL, FC
  • Large computational complexity and huge memory footprints
    • About 10G Ops per frame
    • Over 50M parameters to store

CMOS-based Deep Learning Accelerator

Network Compression

  • Blocked hash compression
  • Extra block constraint to preserve the spatial locality for hash compression
  • Compress network by 16x ~ 32x

Sparsity-based Hardware Accelerator

  • Dedicated sparsity predictor to bypass the unnecessary operations
  • Network-on-Chip (NoC) based hardware architecture
  • Throughput improvement: 10% ~ 70%
  • Power reduction: 50%

Memristor-based Deep Learning Accelerator

In-situ Analog Computation Based on RRAM

  • Matrix-vector multiplication done in-situ on resistive random access memory(RRAM) to address the memory wall issue
  • Energy and timing overhead mainly on the analog computing unit and analog-to-digital interface

High-throughput and Energy-efficient Accelerator Design

  • Dedicated encoding on synaptic weights and activations to improve energy efficiency of the analog computation
  • Distribution analysis on crossbar bitline outputs: reducing ADC bit-resolution
  • Dynamically quantize the weights and activations according to the significance of fine-grained partial products
  • Throughput, energy efficiency and area efficiency improvement: 2x~4x