Quantization

Aim of this page

I would like to summarize major quantization techniques available there in the industry. When I say "quantization" I mean translating float (32 or 64) ML model to int (<8) model. Every big company nowadays has a team specializing in this area (Facebook, Google, Apple, ...). It is interesting that also hardware companies, (especially AI oriented) invest in quantization to boost the attractiveness of their products. These pages will try to "make sense" of the main ideas in all the existing groups and present their ideas in up to a paragraph.

I would also like to exchange ideas through a blog: https://andreymath-quantizing.blogspot.com/

Background

There are numerous basic quantization techniques. Let us have a look at some of them, considering their properties.

Before we do that let us examine a simple example, explaining where the whole theory comes from. Most of the NN layers have basic operation consisting of a linear operator applied to the input: sum w_i input_i, where the numbers <w_i> are usually called weights, applied to the input, presented as a vector <input_i>. We apply one of the common activation functions (tanh, sigmoid, etc) to this result. So, in the case of floating point computations we obtain that the output vector at place j is presented as

output_j = activation (sum w_ji input_i)

A natural question arises, how can we present all these computations if we have integer values for each variable? i.e. input_i, w_ji and output_j are all translated to integers, how?

Also, why would we want to do that?

The why question is pretty simple: silicon implementation of integer operation (plus, minus, multiplication, div, etc) is way more power efficient, cheaper then the same operation implemented for floats! As a result, int- based chips are just lower power consumers and are quick! Syntian audio chip works on micro A levels!.

The question of how, we can address in numerous ways. For this we create bins (quantization) for the inputs and weights, converting all the operations into integer computations. The activation function becomes a table (from integers to integers) or a simple approximation! (see https://nervanasystems.github.io/distiller/quantization.html). In order to create an even better network, suitable for quantization, one can create a network avoiding complicated activation functions and use clipping, for example, as it is done for mobilenet_V2 (https://github.com/tensorflow/models/tree/master/research/slim/nets/mobilenet). In general you would like to use operations that are very natural to use in the integer world: addition, multiplication, clamping, etc.