Research

Efficient AI Algorithms

Quantization for Efficient Large Language Model Inference

Token-Scaled Logit Distillation for Ternary Weight Generative Language Models

NeurIPS 2023 [paper]

We investigated the challenge of applying quantization-aware training (QAT) on decoder-based Transformer models like LLaMA and proposed a novel knowledge distillation (KD) method to overcome it. With our new KD, we could maintain LLM inference accuracy while compressing its model size roughly 16 times.

Enhancing Computation Efficiency in Large Language Models through Weight and Activation Quantization

EMNLP 2023 [paper]

In this work, we analyzed the shortcomings of SOTA post-training quantization (PTQ) techniques for quantizing both weight and activation, and we proposed intuitive ideas as well as a new INT4 data format to improve LLM's performance. We demonstrated that the combined efforts help recover accuracy for 4-bit weight and 8-bit activation quantization (W4A8), sacinv about 50% of hardware cost from the INT8 MAC unit.

Algorithm-Hardware Co-Design

Sparse Convolution Accelerator with Pruning Algorithm

SPADE: Sparse Pillar-based 3D Object Detection Accelerator for Autonomous Driving

HPCA 2024 [paper]

This paper propose SPADE, an algorithm-hardware co- design strategy to maximize vector sparsity in pillar-based 3D object detection and accelerate vector-sparse convolution com- mensurate with the improved sparsity. SPADE consists of three components: (1) a dynamic vector pruning algorithm balancing ac- curacy and computation savings from vector sparsity, (2) a sparse coordinate management hardware transforming 2D systolic array into a vector-sparse convolution accelerator, and (3) sparsity- aware dataflow optimization tailoring sparse convolution schedules for hardware efficiency. Taped-out with a commercial technology, SPADE saves the amount of computation by 36.3–89.2% for representative 3D object detection networks and benchmarks, leading to 1.3–10.9× speedup and 1.5–12.6× energy savings compared to the ideal dense accelerator design. These sparsity- proportional performance gains equate to 4.1–28.8× speedup and 90.2–372.3× energy savings compared to the counterpart server and edge platforms.

Look-Up Table Approximation for Efficient Transformer Model Inference

Nn-lut: neural approximation of non-linear operations for efficient transformer inference

DAC 2022 [paper]

This paper proposes an accurate and hardware-friendly approximation framework for efficient Transformer inference. Our framework employs a simple neural network as a universal approximator with its structure equivalently transformed into a LUT. The proposed framework called NN-LUT can accurately replace all the non-linear operations in popular BERT models with significant reductions in area, power consumption, and latency.

Range-Invariant Approximation of Non-Linear Operations for Efficient BERT Fine-Tuning

DAC 2023 [paper]

This paper proposes a range-invariant approximation of non-linear operations for training computations of Transformer-based large language models. The proposed method decomposes the approximation into the scaling and the range-invariant resolution for LUT approximation during task-dependent BERT fine-tuning. We demonstrate that the proposed method robustly approximates all the non-linear operations of BERT without score degradation on GLUE benchmarks using only a single-entry LUT, facilitating 52% area savings in hardware implementation.

AI Software Optimization

Software Optimization for Efficient DNN Inference

Understanding and Optimizing INT4 Convolution for Accelerated DNN Inference on Tensor Cores

SIPS 2022 [paper]

This work proposes three techniques to enhance INT4 WMMA utilization on Tensor Cores: duplicate-aware load for increasing the reuse of convolution input, register-level packing for alleviating overhead of handling INT4 data, and data layout optimization for coalesced data transfer. The proposed INT4 WMMA optimization techniques are evaluated on convolution operations of popular neural networks to demonstrate substantial speedup on Tensor Core compared to the state of the art.

Architecture-Aware Optimization of Layer Fusion for Latency-Optimal CNN Inference

AICAS 2023 [paper]

This study presents an analytical latency model for a 2D systolic array accelerator, taking into account various hardware factors such as array dimensions, buffer size, and bandwidth. We then investigate the influence of hardware architecture and fusion strategies, including weight and overlap reuse, on performance; these aspects are insufficiently addressed in existing access- based fusion models. We achieve up to a 53.1% reduction in end-to-end network latency compared to an access-based model.

Page updated

Google Sites

Report abuse