# Condensa

# A Programming System for Model Compression

## Explaining the Novelty of A Programmable approach and

## Bayesian Optimization of Black-Box User Objectives

## For Automated DNN Model Compression

## Surpassing the State-of-the-Art in DNN Model Compression

Using *Condensa* we demonstrate that one can * Surpass the State-of-the-Art in DNN model compression* and we have compared our approach with two prior works:

- Reinforcement Learning based Automatic Model Compression, AMC (He et. al. 2018) and
- Alternating Direction Method of Multipliers based on Simulated Annealing, AutoSlim (Liu et. al. 2019) in Table 2 on
*CIFAR-10*with ResNet56 and*ImageNet*datasets with VGG16-BN. - On
*ImageNet, w*e obtain compression ratio of**25.59x**as compared to**6.4x**(AutoSlim Liu et.al. 2019) surpassing the state-of-the art and - A throughput improvement of
**1.16x**using Filter Pruning Scheme as compared to AMC which uses theoretical FLOPs**(1.25x)**.

## Fully Automated DNN Model Compression Capabilities for Practical Usage

**Illustration of the level of automation provided by CONDENSA**

- Problem:
**Improve the inference throughput**of VGG-19 (Simonyan & Zisserman, 2014) on the CIFAR-10 image classification task (Krizhevsky et al., 2014). - VGG-19 is a convolutional neural network, one way to improve its inference performance on modern hardware such as GPUs is by pruning away
*individual**convolutional**filters*(He et al., 2018a). - Condensa's solution corresponds to a sparsity ratio of
**0.72**and is depicted as the*vertical dashed line*.**This result is significant for three reasons:**

- Using the
*CONDENSA*library, the filter pruning strategy employed for this experiment was expressed in*less than*,**10 lines of Python code** - The optimal sparsity ratio of
**0.72**(shown as the vertical dashed line in the Figure) that achieves a state-of-the-art throughput of 2130 images/sec (**2.22×**improvement) and a top-1 accuracyof 0.5% was obtained automatically by CONDENSA using a**improvement***sample-efficient constrained Bayesian optimization algorithm*. - For this to work, the
**user didn’t have to specify any sparsity ratios manually,**and instead only had to define a*domain-specific objective function*to*maximize*(inference throughput, in this case).

## Demonstrated "Actual" TensorRT Speedups

**Layer-Wise Runtime Performance**

- Table 3 on the left shows
**layer-by-layer**compression ratios and mean runtimes collected over 100 runs for filter pruning. - The columns labeled
*R and C*represent results for the reference, and filter-pruned models, respectively. - We only show data for convolutional layers as they
**dominate computation**time for this network. - We observe large inference runtime speedups in later layers of the network.
*This result helps us gain more insight*into how the L-C algorithm distributes global sparsity ratios to each layer, resulting in actual hard- ware speedups.

## Explaining the Novelty of using a Programmable Approach

### "Natual Expression of Compression Intent based on DNN Architecture & Inference Hardware"

**Selectively** compress DNN layers based on

- The
**graph properties**of the network for example BatchNormalization or Projection Layer etc.`layer.<PROPERTY_NAME>`

methods. - Targeted Infer
*e*nce Hardware using`platform_has_<QUERY_NAME>`

methods.

Optimization on "**any**" User defined** Black Box **function indicating a *performance metric* of the DNN to improve, some examples are:

- Actual Inference throughput
- Memory Footprint
- Floating point operations per second

**Easily** and **Correctly **compose Condensa *schemes*

*Interesting*Condensa*schemes*can be easily**constructed**using the`Compose[S<n]`

method- For example easily obtain an additional
**2x**compression by adding`Quantize(float16)`

if platform supports IEEE half precision floating point computations

### Ability to "Programmatically Induce" Sparsity Patterns into DNN for "Inference Accelaration"

- In line 6 below, the user of
writes a**Condensa***compression**program*/*scheme*, we have a pre-written "" library of popular**scheme***compression**techniques*. - The novelty we want to highlight in
is the "**Condensa**" like the ones shown below,**Ability to express sparsity patterns***imperatively*in`PyThon`

*easily*, in the*compression**program*of line-6. - Having this ability is very useful for Inference Acceleration, Here is the a use case for this approach:
- Numerous sparse formats have already been investigated to improve the overall performance of operations such as Sparse Matrix-Vector multiply including focusing on vectorizability of the final DNN.
- One could automatically build sets of regular subcomputations by mining for regular sub-regions in the irregular data structure and generate GEMM codes that is specialized to the sparsity structure of the input matrix (or tensor), but which does not need anymore any indirection array, thereby improving SIMD vectorizability.
- This mining is done at compile-time, and
and**requires the sparsity structure (e.g., the nonzero coordinates) to be both known at compile-time**, as the GEMM code generated is valid only for**invariant during the computation****one specific sparsity structure.**

**As far as we are aware, no existing framework or approach provides this feature, of Programmatically Inducing Particularly Acceleratable Sparsity Patterns into the DNN Compression workflow.**

**Line 6 :**Condensa user can**develop**his/her own scheme, or choose from Condensa library provides*well tested*and*useful*schemes.**Line 8 :**Express**black-box function**that represents performance metric of the DNN to optimize for, in this listing we show*throughput maximization.*

## Explaining the Novelty of using Implicit Level Sets for Black-Box Optimization

There are * two major choices* that must be made when performing Bayesian optimization.

- Phase 1 : Develop a
that will express assumptions about the function being optimized. For this we choose the**prior over functions**, due to its**Gaussian process prior***flexibility*and*tractability*. - Phase 2: Develop an
, which is used to construct a utility function from the model posterior, allowing us to determine the next point to evaluate.**acquisition function**

### Phase 1: Systematically Placing Priors over Black Box Functions

*Any Bayesian method*, by definition.**depends on a prior distribution***A Bayesian optimization method*if**will converge to the optimum**- The acquisition function is continuous and approximately minimizes the risk (defined as the expected deviation from the global minimum at a fixed point x); and
- Conditional variance converges to zero (or appropriate positive minimum value in the presence of noise) if and only if the distance to the nearest observation is zero.

*Many models could be used for this prior*, By default we use Gaussian Process (GP) prior in**Condensa**

This plot illustrates the prior we use in *Condensa* when searching for the global sparsity value, By default we setup **a simple 1D Gaussian process**. In this plot above, you will notice 3 past observations. The solid black line is the **GP surrogate** mean prediction of the objective function given the data, and the shaded area shows the mean plus and mean minus the variance. The superimposed Gaussians corresponds to the GP mean and standard deviation of the prediction at the 3 previously sampled points.

Here there are 2 plots, the plot on the left illustrates *Condensa* sparsity profile for ResNet 110 (CIFAR-10) Filter Pruning and the Goal for Bayesian optimization is **search for a global sparsity value** that satisfies a user provided accuracy constraint, for example line 17 in the listing above (- 2.00%).

The plot on the right, illustrates **how we modeled our prior belief of this expensive black-box function **(L-C , the Green Curve) to be a Gaussian Process, with kernel functions that have default values, and can be *easily modified* by the Condensa user, depending on the dataset-network under compression.

**Gaussian Processes (GP)** are a generic supervised learning method designed to solve *regression* and *probabilistic classification* problems.

The *advantages* of Gaussian processes are:

- The prediction
**interpolates**the observations for regular kernels. - The prediction is probabilistic (Gaussian) so that we can compute empirical
**confidence intervals**and decide based on those if one should refit the prediction in some region of interest. **Versatile**: different kernels can be specified. Common kernels are provided, in*Condensa*but it is also possible to.**specify custom kernels**

The choice of co variance functions for the Gaussian Process is crucial, as it determines the smoothness properties of samples drawn from it.

- The squared exponential kernel is naive, in that divergences of all features of sparsity affect the co-variance equally.
- In
*Condensa,*we use the[Mat´ern, 1960, Stein, 1999] kernel by default, which incorporates a smoothness parameter, ς to permit greater flexibility in modeling functions.**Mat´ern** - The
**Mat´ern****kernel**reduces to the*squared exponential kernel*as ς*unsquared exponential kernel.* - While the squared exponential and
**Mat´ern***Appropriate covariance functions can also be used in Condensa to extend the model in other interesting ways"*

### Phase 2: Analytically Developing Acquisition Components to Guide the Search for Optimum

- Now that we have discussed
*placing priors*over smooth functions and how to*update these priors*in light of**new****observations** - The role of the acquisition function is to
.**guide the search for the optimum** - Typically, acquisition functions are defined such that high acquisition corresponds to potentially high values of the objective function, whether because
- the prediction is high,
- the uncertainty is great,
- or both.

- Maximizing the acquisition function is used to select the next point at which to evaluate the function.

- One intuitive strategy is to maximize the probability of improving over the best current value (incumbent).
- This function is also sometimes called MPI, for “maximum probability of improvement” or “the P-algorithm”,since the utility is the probability of improvement.
- A somewhat more satisfying alternative acquisition function would be one that takes into account not only the probability of improvement, but the magnitude of the improvement a point can potentially yield. In particular, one could minimize the expected deviation from the true maximum, when choosing a new trial point.

- A more recent development is the idea of exploiting lower confidence bounds (upper, when considering maximization) to construct acquisition functions that minimize regret over the course of their optimization.
- Like other parameterized acquisition models we have seen, the parameter κ is could be decided by user.
- PI, EI and GP-UCB give rise to distinct sampling behaviour over time, as we illustrate in the figure below.
- With several different parameterized acquisition functions in the literature, it is often unclear which one to use. For the problem of Automated DNN model compression which is the goal of
*Condensa,*we propose ILS-UCB, that is described next.

## Novelty of Using "Level Sets" to represent User Constraints

A level set of a real-valued function f of n real variables is a set of the form:

- L_c(f) is, a set where the function takes on a
**given constant value c.** - When the number of variables is two, a
is generically a curve, called a**level set**,**level curve**, or**contour line**. So a level curve is the set of all real-valued solutions of an equation in two variables x1 and x2.**isoline** - When n = 3, a
is called a**level set**, So a**level surface**is the set of all real-valued roots of an equation in three variables x1, x2 and x3, and**level surface** - a
is the set of all real-valued roots of an equation in n (n > 3) variables., and for higher values of n the level set is a**level hypersurface****level hypersurface** - An "
, emphasizing that such a curve is defined by an implicit equation. Analogously, a level surface is sometimes called an**implicit curve" is a "level curve", which is considered independently of its neighbor curves**or an**implicit surface**.**isosurface** show up in many applications, often under different names. "**Level sets****We are the first to introduce this novel idea to the field of Automated DNN Model Compression."**

- In addition to
*unconstrained**optimization*, to enableto achieve**CONDENSA****constraint**we build on top of**satisfaction**(Bogunovic et al., 2016; Garg et al., 2016; Zanette et al., 2018).**level-set black-box optimization** - We leverage a Gaussian Process Adaptive Sampling criterion called Implicit Level Set Upper Confidence Bound (ILS-UCB) (Garg et al., 2016), t
**hat prioritizes sampling near a level set of the estimate.** - This model prioritizes searching the expected L-C curve intersection with
**user****accuracy constraints**, conditional on estimated uncertainty, and.**does not seek to precisely learn the shape of the entire L-C curve**

- Intuitively, by reducing the estimation space to s
,**pecifically localize the sparsity that meets user accuracy constraints****we can "Reduce the total number of measurements"** - Consequently the
**"time required to achieve an optimal value for the sparsity".** - Hence, rather than prioritizing both high variance and high mean like UCB, ILS-UCB prioritizes
represented by the Gaussian Process Implicit Surface.**sampling in areas near a level set of the mean** - We propose to
by µ(x) − L, and where the confidence interval is large:**minimize the implicit potential defined**

## Ablation Study: Effect of Varying Acquisition Models

- Running Condensa Bayesian Optimizer with PI, EI, UCB as the acquisition model after 15 steps and ILS-UC after 5 samples.
- ILS-UCB gets a good estimate around the requested level set quickly with 3x fewer samples, while UCB and EI also perform reasonably but with many more samples.
- In this domain, where each sample is very expensive this difference is substantial !

## Condensa Compression Experiments on Real world Tasks

- On three real-world image classification and language modeling tasks,
achieves memory footprint reductions of up to**CONDENSA**and runtime throughput improvements of up to**65×**using at most**2.22×**samples per search.**10** - With the initial framework in place, we envision a number of directions to expand on CONDENSA’s capability. For example, we plan to "
(Gray et al., 2017), and data types for quantization.**augment automatic sparsity ratio inference with support for additional compression hyperparameters such as block sizes in block-sparsification"** **Our long-term goal is "a framework that makes model compression easier, more flexible, and accessible to a wide range of users".**

### References

- https://scikit-learn.org/stable/modules/gaussian_process.html
- Carl Eduard Rasmussen and Christopher K.I. Williams, “Gaussian Processes for Machine Learning”, MIT Press 2006,
- [Mat´ern, 1960] B. Mat´ern. Spatial Variation. Springer-Verlag, 2nd (1986) edition, 1960.
- [Stein, 1999] M. L. Stein. Interpolation of Spatial Data: Some Theory for Kriging. Springer Series in Statistics. Springer, 1999
- Topological shape optimization of microstructural metamaterials using a level set method
- GPU Kernels for Block-Sparse Weights OpenAI https://d4mucfpksywv.cloudfront.net/blocksparse/blocksparsepaper.pdf
- Generating Piecewise-Regular Code from Irregular Structures, Travis Augustine et al . PLDI 2019