A Programming System for Model Compression
Explaining the Novelty of A Programmable approach and
Bayesian Optimization of Black-Box User Objectives
For Automated DNN Model Compression
Surpassing the State-of-the-Art in DNN Model Compression
Using Condensa we demonstrate that one can Surpass the State-of-the-Art in DNN model compression and we have compared our approach with two prior works:
- Reinforcement Learning based Automatic Model Compression, AMC (He et. al. 2018) and
- Alternating Direction Method of Multipliers based on Simulated Annealing, AutoSlim (Liu et. al. 2019) in Table 2 on CIFAR-10 with ResNet56 and ImageNet datasets with VGG16-BN.
- On ImageNet, we obtain compression ratio of 25.59x as compared to 6.4x (AutoSlim Liu et.al. 2019) surpassing the state-of-the art and
- A throughput improvement of 1.16x using Filter Pruning Scheme as compared to AMC which uses theoretical FLOPs (1.25x).
Fully Automated DNN Model Compression Capabilities for Practical Usage
Illustration of the level of automation provided by CONDENSA
- Problem: Improve the inference throughput of VGG-19 (Simonyan & Zisserman, 2014) on the CIFAR-10 image classification task (Krizhevsky et al., 2014).
- VGG-19 is a convolutional neural network, one way to improve its inference performance on modern hardware such as GPUs is by pruning away individual convolutional filters (He et al., 2018a).
- Condensa's solution corresponds to a sparsity ratio of 0.72 and is depicted as the vertical dashed line. This result is significant for three reasons:
- Using the CONDENSA library, the filter pruning strategy employed for this experiment was expressed in less than 10 lines of Python code,
- The optimal sparsity ratio of 0.72 (shown as the vertical dashed line in the Figure) that achieves a state-of-the-art throughput of 2130 images/sec (2.22× improvement) and a top-1 accuracy improvement of 0.5% was obtained automatically by CONDENSA using a sample-efficient constrained Bayesian optimization algorithm.
- For this to work, the user didn’t have to specify any sparsity ratios manually, and instead only had to define a domain-specific objective function to maximize (inference throughput, in this case).
Demonstrated "Actual" TensorRT Speedups
- Layer-Wise Runtime Performance
- Table 3 on the left shows layer-by-layer compression ratios and mean runtimes collected over 100 runs for filter pruning.
- The columns labeled R and C represent results for the reference, and filter-pruned models, respectively.
- We only show data for convolutional layers as they dominate computation time for this network.
- We observe large inference runtime speedups in later layers of the network. This result helps us gain more insight into how the L-C algorithm distributes global sparsity ratios to each layer, resulting in actual hard- ware speedups.
Explaining the Novelty of using a Programmable Approach
"Natual Expression of Compression Intent based on DNN Architecture & Inference Hardware"
Selectively compress DNN layers based on
- The graph properties of the network for example BatchNormalization or Projection Layer etc.
- Targeted Inference Hardware using
Optimization on "any" User defined Black Box function indicating a performance metric of the DNN to improve, some examples are:
- Actual Inference throughput
- Memory Footprint
- Floating point operations per second
Easily and Correctly compose Condensa schemes
- Interesting Condensa schemes can be easily constructed using the
- For example easily obtain an additional 2x compression by adding
Quantize(float16)if platform supports IEEE half precision floating point computations
Ability to "Programmatically Induce" Sparsity Patterns into DNN for "Inference Accelaration"
- In line 6 below, the user of Condensa writes a compression program/scheme, we have a pre-written "scheme" library of popular compression techniques.
- The novelty we want to highlight in Condensa is the "Ability to express sparsity patterns" like the ones shown below, imperatively in
PyThoneasily, in the compression program of line-6.
- Having this ability is very useful for Inference Acceleration, Here is the a use case for this approach:
- Numerous sparse formats have already been investigated to improve the overall performance of operations such as Sparse Matrix-Vector multiply including focusing on vectorizability of the final DNN.
- One could automatically build sets of regular subcomputations by mining for regular sub-regions in the irregular data structure and generate GEMM codes that is specialized to the sparsity structure of the input matrix (or tensor), but which does not need anymore any indirection array, thereby improving SIMD vectorizability.
- This mining is done at compile-time, and requires the sparsity structure (e.g., the nonzero coordinates) to be both known at compile-time and invariant during the computation, as the GEMM code generated is valid only for one specific sparsity structure.
- As far as we are aware, no existing framework or approach provides this feature, of Programmatically Inducing Particularly Acceleratable Sparsity Patterns into the DNN Compression workflow.
- Line 6 : Condensa user can develop his/her own scheme, or choose from Condensa library provides well tested and useful schemes.
- Line 8 : Express black-box function that represents performance metric of the DNN to optimize for, in this listing we show throughput maximization.
Explaining the Novelty of using Implicit Level Sets for Black-Box Optimization
There are two major choices that must be made when performing Bayesian optimization.
- Phase 1 : Develop a prior over functions that will express assumptions about the function being optimized. For this we choose the Gaussian process prior, due to its flexibility and tractability.
- Phase 2: Develop an acquisition function, which is used to construct a utility function from the model posterior, allowing us to determine the next point to evaluate.
Phase 1: Systematically Placing Priors over Black Box Functions
- Any Bayesian method depends on a prior distribution, by definition.
- A Bayesian optimization method will converge to the optimum if
- The acquisition function is continuous and approximately minimizes the risk (defined as the expected deviation from the global minimum at a fixed point x); and
- Conditional variance converges to zero (or appropriate positive minimum value in the presence of noise) if and only if the distance to the nearest observation is zero.
- Many models could be used for this prior, By default we use Gaussian Process (GP) prior in Condensa
This plot illustrates the prior we use in Condensa when searching for the global sparsity value, By default we setup a simple 1D Gaussian process. In this plot above, you will notice 3 past observations. The solid black line is the GP surrogate mean prediction of the objective function given the data, and the shaded area shows the mean plus and mean minus the variance. The superimposed Gaussians corresponds to the GP mean and standard deviation of the prediction at the 3 previously sampled points.
Here there are 2 plots, the plot on the left illustrates Condensa sparsity profile for ResNet 110 (CIFAR-10) Filter Pruning and the Goal for Bayesian optimization is search for a global sparsity value that satisfies a user provided accuracy constraint, for example line 17 in the listing above (- 2.00%).
The plot on the right, illustrates how we modeled our prior belief of this expensive black-box function (L-C , the Green Curve) to be a Gaussian Process, with kernel functions that have default values, and can be easily modified by the Condensa user, depending on the dataset-network under compression.
Gaussian Processes (GP) are a generic supervised learning method designed to solve regression and probabilistic classification problems.
The advantages of Gaussian processes are:
- The prediction interpolates the observations for regular kernels.
- The prediction is probabilistic (Gaussian) so that we can compute empirical confidence intervals and decide based on those if one should refit the prediction in some region of interest.
- Versatile: different kernels can be specified. Common kernels are provided, in Condensa but it is also possible to specify custom kernels.
The choice of co variance functions for the Gaussian Process is crucial, as it determines the smoothness properties of samples drawn from it.
- The squared exponential kernel is naive, in that divergences of all features of sparsity affect the co-variance equally.
- In Condensa, we use the Mat´ern [Mat´ern, 1960, Stein, 1999] kernel by default, which incorporates a smoothness parameter, ς to permit greater flexibility in modeling functions.
- The Mat´ern kernel reduces to the squared exponential kernel as ς tends it infinity and when ς = 0.5, it reduces to the unsquared exponential kernel.
- While the squared exponential and Mat´ern are the most common kernels for GPs. "Appropriate covariance functions can also be used in Condensa to extend the model in other interesting ways"
Phase 2: Analytically Developing Acquisition Components to Guide the Search for Optimum
- Now that we have discussed placing priors over smooth functions and how to update these priors in light of new observations
- The role of the acquisition function is to guide the search for the optimum.
- Typically, acquisition functions are defined such that high acquisition corresponds to potentially high values of the objective function, whether because
- the prediction is high,
- the uncertainty is great,
- or both.
- Maximizing the acquisition function is used to select the next point at which to evaluate the function.
- One intuitive strategy is to maximize the probability of improving over the best current value (incumbent).
- This function is also sometimes called MPI, for “maximum probability of improvement” or “the P-algorithm”,since the utility is the probability of improvement.
- A somewhat more satisfying alternative acquisition function would be one that takes into account not only the probability of improvement, but the magnitude of the improvement a point can potentially yield. In particular, one could minimize the expected deviation from the true maximum, when choosing a new trial point.
- A more recent development is the idea of exploiting lower confidence bounds (upper, when considering maximization) to construct acquisition functions that minimize regret over the course of their optimization.
- Like other parameterized acquisition models we have seen, the parameter κ is could be decided by user.
- PI, EI and GP-UCB give rise to distinct sampling behaviour over time, as we illustrate in the figure below.
- With several different parameterized acquisition functions in the literature, it is often unclear which one to use. For the problem of Automated DNN model compression which is the goal of Condensa, we propose ILS-UCB, that is described next.
Novelty of Using "Level Sets" to represent User Constraints
A level set of a real-valued function f of n real variables is a set of the form:
- L_c(f) is, a set where the function takes on a given constant value c.
- When the number of variables is two, a level set is generically a curve, called a level curve, contour line, or isoline. So a level curve is the set of all real-valued solutions of an equation in two variables x1 and x2.
- When n = 3, a level set is called a level surface, So a level surface is the set of all real-valued roots of an equation in three variables x1, x2 and x3, and
- a level hypersurface is the set of all real-valued roots of an equation in n (n > 3) variables., and for higher values of n the level set is a level hypersurface
- An "implicit curve" is a "level curve", which is considered independently of its neighbor curves, emphasizing that such a curve is defined by an implicit equation. Analogously, a level surface is sometimes called an implicit surface or an isosurface.
- Level sets show up in many applications, often under different names. "We are the first to introduce this novel idea to the field of Automated DNN Model Compression."
- In addition to unconstrained optimization, to enable CONDENSA to achieve constraint satisfaction we build on top of level-set black-box optimization (Bogunovic et al., 2016; Garg et al., 2016; Zanette et al., 2018).
- We leverage a Gaussian Process Adaptive Sampling criterion called Implicit Level Set Upper Confidence Bound (ILS-UCB) (Garg et al., 2016), that prioritizes sampling near a level set of the estimate.
- This model prioritizes searching the expected L-C curve intersection with user accuracy constraints, conditional on estimated uncertainty, and does not seek to precisely learn the shape of the entire L-C curve.
- Intuitively, by reducing the estimation space to specifically localize the sparsity that meets user accuracy constraints, we can "Reduce the total number of measurements"
- Consequently the "time required to achieve an optimal value for the sparsity".
- Hence, rather than prioritizing both high variance and high mean like UCB, ILS-UCB prioritizes sampling in areas near a level set of the mean represented by the Gaussian Process Implicit Surface.
- We propose to minimize the implicit potential defined by µ(x) − L, and where the confidence interval is large:
Ablation Study: Effect of Varying Acquisition Models
- Running Condensa Bayesian Optimizer with PI, EI, UCB as the acquisition model after 15 steps and ILS-UC after 5 samples.
- ILS-UCB gets a good estimate around the requested level set quickly with 3x fewer samples, while UCB and EI also perform reasonably but with many more samples.
- In this domain, where each sample is very expensive this difference is substantial !
Condensa Compression Experiments on Real world Tasks
- On three real-world image classification and language modeling tasks, CONDENSA achieves memory footprint reductions of up to 65× and runtime throughput improvements of up to 2.22× using at most 10 samples per search.
- With the initial framework in place, we envision a number of directions to expand on CONDENSA’s capability. For example, we plan to "augment automatic sparsity ratio inference with support for additional compression hyperparameters such as block sizes in block-sparsification" (Gray et al., 2017), and data types for quantization.
- Our long-term goal is "a framework that makes model compression easier, more flexible, and accessible to a wide range of users".
- Carl Eduard Rasmussen and Christopher K.I. Williams, “Gaussian Processes for Machine Learning”, MIT Press 2006,
- [Mat´ern, 1960] B. Mat´ern. Spatial Variation. Springer-Verlag, 2nd (1986) edition, 1960.
- [Stein, 1999] M. L. Stein. Interpolation of Spatial Data: Some Theory for Kriging. Springer Series in Statistics. Springer, 1999
- Topological shape optimization of microstructural metamaterials using a level set method
- GPU Kernels for Block-Sparse Weights OpenAI https://d4mucfpksywv.cloudfront.net/blocksparse/blocksparsepaper.pdf
- Generating Piecewise-Regular Code from Irregular Structures, Travis Augustine et al . PLDI 2019