Model Compression

When deploying ML models, time and space complexity of the prediction phase is usually more important than the training phase. For example, real-time applications usually need to make predictions in milliseconds, and on-device apps have very strict constraints on model size. Our task is to develop algorithms to shrink model size and improve prediction speed for machine learning models, such as deep neural networks, latent factor models (e.g., matrix factorization), extreme classification and kernel machines. Here are some of our previous work. Currently we are focusing on compression algorithms for LSTM, transformers, and CNN models.

  • Compress LSTM model size by exploiting frequency information:

GroupReduce: Block-Wise Low-Rank Approximation for Neural Language Model Shrinking (NIPS '18)

  • Faster inner product search (can be applied to matrix factorization or the final layer of neural networks):

A Greedy Approach for Budgeted Maximum Inner Product Search (NIPS '17)

  • Tree-based approaches:

Gradient Boosted Decision Trees for High Dimensional Sparse Output (ICML '17)

  • Kernel machines:

Computationally Efficient Nystrom Approximation using Fast Transforms (ICML '16)

Fast Prediction for Large-Scale Kernel Machines (NIPS '14)


Thanks to my students (Patrick Chen, Huan Zhang) and collaborators (Si Si, Yang Li, Ciprian Chelba, Hsiang-Fu Yu, Sathya Keerthi, Druv Mahajan, Inderjit Dhillon) for all of these interesting work.