DepthShrinker: A New Compression Paradigm Towards Boosting

Real Hardware Efficiency of Compact Neural Networks

Yonggan Fu, Haichuan Yang, Jiayi Yuan, Meng Li, Cheng Wan, Raghuraman Krishnamoorthi, Vikas Chandra, Yingyan Lin

Rice University Meta Reality Lab

PAPER

CODE

SLIDES

VIDEO

DepthShrinker: Abstract

Efficient deep neural network (DNN) models equipped with compact operators (e.g., depthwise convolutions) have shown great potential in reducing DNNs' theoretical complexity (e.g., the total number of weights/operations) while maintaining a decent model accuracy. However, existing efficient DNNs are still limited in fulfilling their promise in boosting real-hardware efficiency, due to their commonly adopted compact operators' low hardware utilization. In this work, we open up a new compression paradigm for developing real-hardware efficient DNNs, leading to boosted hardware efficiency while maintaining model accuracy. Interestingly, we observe that while some DNN layers' activation functions help DNNs' training optimization and achievable accuracy, they can be properly removed after training without compromising the model accuracy. Inspired by this observation, we propose a framework dubbed DepthShrinker, which develops hardware-friendly compact networks via shrinking the basic building blocks of existing efficient DNNs that feature irregular computation patterns into dense ones with much improved hardware utilization and thus real-hardware efficiency. Excitingly, our DepthShrinker framework delivers hardware-friendly compact networks that outperform both state-of-the-art efficient DNNs and compression techniques, e.g., a 3.06% higher accuracy and 1.53x throughput on Tesla V100 over SOTA channel-wise pruning method MetaPruning.

DepthShrinker: Motivation

There exists a dilemma between the trends of modern computing platform advances vs. efficient deep neural network (DNN) design:

  • Modern computing platforms: A higher degree of parallel computing.

  • Efficient DNNs: Adopt lightweight operators featuring low utilization.

DepthShrinker: Key Idea

Goal: Build compact DNNs with boosted utilization on modern hardware with increased parallelism.

Key idea: Shrink consecutive operations, between which the non-linear functions are properly removed, into one single dense operation.

DepthShrinker: Framework

To push forward the accuracy-efficiency frontier after shrinking, two research questions need to be answered: (1) which activation functions to remove, and (2) how to restore the accuracy. To this end, we implement DepthShrinker as a three-stage framework:

  • Stage 1: Identify redundant activation functions via learning binary masks on top of activation functions constrained with an L0 sparsity.

  • Stage 2: Finetune the network after the removal, optionally equipped with a self-distillation mechanism.

  • Stage 3: Merge consecutive linear operators, where the resulting dense convolution is only determined by the input/output channels in the first/last convolution of a block, which makes inverted residual blocks favorable as their wide intermediate layers are shrunk.

DepthShrinker: Experimental Results

Experiment setup

  • Models & Datasets: MobileNetV2/Efficient-Lite families @ ImageNet.

  • Considered devices: Desktop (Tesla V100 / RTX 2080Ti GPU) + Edge (Jetson TX2 Edge GPU / Google Pixel 3 / Raspberry Pi 4).

Benchmark with channel-wise pruning

  • DepthShrinker achieves consistently better accuracy-FPS trade-off with a better scalability to extremely efficient cases, e.g., a 3.06% higher accuracy and 1.53× throughput on a Tesla V100 GPU over SOTA channel-wise pruning method MetaPruning.

Benchmark with layer-wise pruning (lower left)

  • DepthShrinker achieves much better acc under large compression ratios, indicating shrinking is a soft version of layer pruning and better than "hard layer pruning".

Visualization of block-wise latency breakdown (lower right)

  • DepthShrinker can reduce the block-wise latency by up to 96.1%.

  • DepthShrinker can successfully Identify bottleneck layers in terms of latency.

Acknowledgement

The work performed by Yonggan Fu, Jiayi Yuan, Cheng Wan, and Yingyan Lin is supported by the National Science Foundation (NSF) through the SCH, NeTS, and MLWiNS program.

Citation

@article{fu2022depthshrinker,

title={DepthShrinker: A New Compression Paradigm Towards Boosting Real-Hardware Efficiency of Compact Neural Networks},

author={Fu, Yonggan and Yang, Haichuan and Yuan, Jiayi and Li, Meng and Wan, Cheng and Krishnamoorthi, Raghuraman and Chandra, Vikas and Lin, Yingyan},

journal={arXiv preprint arXiv:2206.00843},

year={2022}

}