Binary-code based quantization (sub-3bit!) [code]

TL;DR

This article showcases the benefits of BCQ: 1) BCQ's ability to achieve extremely low bit quantization, and 2) readily available mixed precision matmul scheme BiQGEMM allowing fast on-device inference. For reference, Chung & Kim et al. (2020) demonstrated 3.5× speedup, 8.3× runtime memory reduction and 11.8× model size compression with sub-3bit transformers on the on-device translation task — while preserving BLEU scores. And check out the code here.

Models are larger than ever

The recent influx of large language models (LLMs) has shown us that there is no limit to how large a model will go for that sweet performance. These models continue to grow in size, pushing the boundaries of what is considered possible. However, this size expansion poses a challenge for the general public, as it becomes increasingly difficult to harness the power of these machine learning (ML) models on personal devices. Consequently, utilizing the latest ML products can potentially jeopardize user privacy. For instance, imagine seeking solace in consulting ChatGPT about your mental hardships, only to realize that doing so requires uploading your entire conversation to external servers managed by OpenAI.

Additionally, the cost of training large language models (LLMs) poses significant challenges for small companies, creating barriers to entry and concentrating power within the industry. The immense financial requirements associated with training LLMs make it difficult for smaller businesses with limited budgets to compete with larger organizations that can afford the investment. Consequently, this concentration of power among tech giants and well-funded corporations stifles innovation, limits access to state-of-the-art language models, and hampers the ability of smaller companies to offer unique features and capabilities.

In the context of neural network compression, the significance of compression techniques becomes increasingly crucial. However, neural network compression is a challenging multi-objective optimization problem that aims to reduce inference latency, runtime memory usage, and model size while maintaining accuracy. Among these techniques, binary-code-based quantization emerges as a highly effective method that achieves a balanced outcome across all these objectives. This method proves particularly valuable when applied to transformers, which serve as the foundational element for AI advancements across various domains.

In this article, our focus centers on highlighting the importance of binary-code-based quantization as a means to optimize neural networks, specifically transformers. By harnessing this technique, we strive to enhance the performance and accessibility of large language models while simultaneously addressing the challenges associated with their considerable size and demanding resource requirements.

Sub 3-bit transformer quantization

Binary-code-based quantization (BCQ) emerges as a highly effective quantization method specifically designed for transformers. Chung & Kim et al. (2020) demonstrated its potential by achieving sub 3-bit quantization for transformer models in multiple translation tasks, with negligible loss in BLEU score—within -0.5 BLEU across 3 translation tasks. Furthermore, BCQ also delivers a 3.5X speedup and an 8.3X reduction in runtime memory footprint compared to full precision 32-bit models.

BCQ offers distinct advantages over the well-known uniform 8-bit quantization:

Low bit weights can be efficiently multiplied with full precision activation, resulting in substantial speed gains with minimal impact on task performance. In contrast, uniform quantization requires activation quantization for quantized inference, often leading to higher quantization errors and latency overhead.
BCQ enables extremely low bit quantization, as empirically demonstrated (Chung & Kim et al. 2020).

In the following content, we will discuss why BCQ is capable of achieving these outcomes through its unique advantages of efficient mixed precision multiplication, resulting in significant speed gains with minimal impact on task performance, as well as its ability to enable extremely low bit quantization.

High inference overhead of uniform 8-bit quantization

Before we talk about BCQ, it is crucial to shed light on the overheads associated with uniform quantization. Uniform quantization entails the mapping of 32-bit full precision (FP32) weights to 8-bit integer (INT8) weights. This process involves several steps for each weight matrix: determining the min-max value range, followed by mapping each FP value within the matrix to a corresponding value ranging from 0 to 255 (or sometimes -127 to 127). The objective is to ensure that the smallest value is mapped to 0 and the largest value is mapped to 255 (refer to Figure 1 and 2).

Figure 1. uniform 8-bit quantization (src).

Figure 2. quantization equation of uniform 8-bit quantiation.

While the process of uniform quantization is straightforward and effective for various neural networks, the predicament arises during the inference phase. The crux of the issue lies in the absence of a known effective scheme for multiplying INT8 weights with FP32 activation values. To perform matrix multiplication between the quantized weights and float input values, one of two approaches must be taken.

Runtime weight dequantization
- This involves dequantizing the quantized weight values to FP32. Since the matrix multiplication is performed between dequantized FP32 weights and FP32 activation values, no speed improvement can be expected.
Quantized matmul by quantizing FP32 activation to INT8: This approach introduces notable problems.
- First, quantizing the FP32 activation introduces additional noise to the overall quantization scheme, potentially leading to further degradation in task performance.
- Secondly, the matmul operation between INT8 weights and activation values results in an INT32 output matrix. To ensure compatibility, the output range should be pre-established, which often involves approximations that may not be fully accurate.
- Additionally, the INT32 result must be converted back to INT8 to be passed on to the next layer, typically involving a two-stage data conversion process: INT32 👉 FP32 (or FP16) 👉 INT8, further increasing to overall latency (see Figure 3).

Figure 3. Illustration of quantized matmul w/ uniform quantization (src).

The lack of a direct "as-is" multiplication scheme between INT8 weights and FP activation values forces uniform quantization's inference to suffer from high memory and latency overhead and additional quantization noise.

Furthermore, studies have highlighted the varying significance of computations in full precision (Bhandare et al., 2019)—as a consequence, achieving an optimal trade-off between accuracy and latency often requires a combination of full precision and quantized operations. Uniform quantization's lack of flexible mixed precision matmul approach poses a challenge in this regard, limiting the ability to strike the desired balance.

Binary code based quantization

The implementation of BCQ involves the conversion of FP weights into a set of 1-bit weights, represented by B matrices, and corresponding FP scale parameters grouped in α vectors. Notably, each parameter in the α vector corresponds to a row of the original FP weight matrix W. For instance, when applying a 2-bit BCQ to an FP matrix with shape m×n, we obtain two m×n 1-bit matrices B₁ and B₂, accompanied by two FP vectors α₁ and α₂ of length m. Figure 4 provides a clearer visual representation of the quantized weights.

Figure 4. Example of 2-bit BCQ weights (src).

Figure 5. Example of a binary tree showing values representable by 3-bit BCQ (src).

Now, let's discuss how we can transform W into B's and α's. There exist several methods to approximate the B matrices and α vectors, and here we introduce the greedy approximation scheme:

Initialize the residual matrix R to be equal to the original FP weight matrix W.
Determine the parameters of the Bₖ matrix by assigning either +1 or -1 based on the sign of the corresponding value in W.
Compute the values in αₖ such that it minimize the difference between R and αₖ ◦ B, where each parameter in α is multiplied to a row in B.
Update R by subtracting the product of B and α: R = R - αₖ ◦ Bₖ.
Repeat steps 2-4 n times for n-bit BCQ.

Conversely, dequantization process would be W' = α₁ ◦ B₁ + α₂ ◦ B₂ + … + αₖ ◦ Bₖ, such that an element in W' at position (i, j) is as follows:

W'ⁱʲ= α₁ⁱ B₁ⁱʲ + α₂ⁱʲ B₂ⁱʲ + … + αₖⁱ Bₖⁱʲ

Efficient "as-is" matmul between binary weights and FP values

One of the advantages of BCQ is its effortless matrix multiplication between quantized weights and FP activation values during inference. This process eliminates the need for runtime data conversion, thereby avoiding redundant overhead and the introduction of additional errors.

The key to achieving this seamless operation lies in BiQGEMM (Jeon et al. 2020), a dynamic programming-based matrix multiplication approach that enables a seamless mixed precision approach. To understand the inner workings of BiQGEMM, consider the matrix multiplication between the quantized weights W' and the FP-valued vector x (of length n), which can be represented as follows:

y = W'x = α₁ ◦ (B₁⋅x) + α₂ ◦ (B₂⋅x) + … + αₖ ◦ (Bₖ⋅x)

As discussed, the elements in B are either 1 or -1, indicating that the output elements of B⋅x are always derived from the set of all possible 2ⁿ add-subtract combinations of x's elements. Exploiting this property, we can 1) divide the B⋅x workload into sub-regions, and 2) precompute all possible outcomes of B'⋅x' per sub-region and reuse these values for efficient mixed precision matrix multiplication.

Figure 6. Splitting B and x into sub-region for efficient pre-computation (src).

Consider the hypothetical scenario presented in Figure 6, where the B matrix has dimensions of 12×8, and x is a vector consisting of 12 elements. To obtain the resulting vector y, with a length of 8, each element in the i-th row of B (Bᵢ) is multiplied by the corresponding element in the x vector. To optimize the computation, we can divide x and B into three segments, as indicated by the white and gray highlights in the figure. This allows us to express y as the sum of three sub-problems: B′ ⋅ x′, B′′ ⋅ x′′, and B′′' ⋅ x′′′. By addressing these sub-problems, each involving a subregion of x and B, we can efficiently perform the original matrix multiplication using 3 * 2⁴ pre-computed intermediate values.

Once all the intermediate values are computed, the computation of B⋅x simplifies to a sequence of indexing and addition operations, exploiting the reuse of the same intermediate values multiple times for improved efficiency. Furthermore, the subsequent multiplication between scale vectors (α's) and B⋅x is straightforward.

The principles we discussed for vector-matrix level matrix multiplication using BiQGEMM can be seamlessly extended to matrix-matrix multiplication scenarios. Moreover, BiQGEMM incorporates additional optimizations, such as reusing negated pre-computation values to reduce computational overhead by half and utilizing bit-packed indices for faster indexing of pre-computed values. For more detailed explanations of these optimizations, refer to the original paper.

As demonstrated by Chung & Kim et al. (2020), BiQGEMM's efficient matrix multiplication method provides fast and efficient inference rendering BCQ a highly beneficial approach for on-device transformer use cases, such as translation. Also, the ability to perform matrix multiplication between quantized weights and FP activation values allows BCQ to have a flexible approach (e.g. only quantize layers that aren't subject to large loss in task performance when pertubated).

Recovering from high quantization error

When mapping FP values to extremely low-bit datatypes (e.g., sub-3 bits), significant quantization errors are likely to occur, resulting in a sharp decline in task performance. To address this issue, an effective approach is to incorporate a period of non-regularization, or pNR (Lee et al. 2019). By considering quantization as a form of weight regularization, we recognize that it limits the potential range of model weights within the loss surface. Introducing pNR allows the target model to freely explore the loss space, occasionally enforcing the quantization constraint to strike a balance between quantization requirements and optimal task performance. For instance, in the case of transformers, a simple implementation could involve training the model with FP weights and applying quantization every 1000 batches to evaluate its performance. The figure below illustrates the training process with pNR.

Figure 7. Illustration of the application pNR during training (src).

Experiments show pNR constants ranging from 500 to 2000 yielded satisfactory results for transformers, with little variation between them (Chung & Kim et al., 2020). The use of a large pNR value implies that the quantization overhead is rarely incurred, resulting in only a marginal increase in overall training time.

By integrating a period of non-regularization into the training process, the adverse effects of severe quantization errors can be mitigated, allowing for improved task performance while minimizing the impact on training efficiency.

Fine-grained compression approach for transformers

Fine-grained compression is essential when dealing with transformers, as compressing them presents significant challenges. These challenges arise from various factors. Firstly, different layers within a transformer exhibit varying sensitivity to compression errors. Some layers can tolerate quantization errors without a substantial impact on task performance, while others are more sensitive to such errors. Secondly, real-world applications often have highly variable word frequency distributions. As a result, applying a uniform compression approach to all word vectors in an embedding may not yield optimal results. Lastly, compression rates and latency improvements vary across different parts of transformers. For instance, the autoregressive inference scheme requires multiple decoding steps but only a single encoding step.

Addressing these challenges, Chung & Kim et al. (2020) proposed a fine-grained quantization scheme for transformers. This approach takes advantage of BiQGEMM, which allows for a flexible quantization scheme where the quantization configuration of one layer does not impact the inference scheme of downstream layers. The proposed scheme enables a granular approach to quantization, assigning varying bit precision to different parts of the transformer. This fine-grained quantization approach can be categorized based on the level of granularity in the quantization scheme:

Block-wise scheme:

Table 1. FLOPs and on-device latency on a Galaxy N10+ contribution of each transformer block (src).

The contribution of each block in a transformer (encoder, decoder, and embedding) to the total latency is depicted in Table 1. While the encoder block demonstrates minimal overall overhead, the decoder and embedding-related functions account for the majority of the inference time. It is important to note that the actual latency overhead differs from FLOPs measurements, as factors such as parameter loading-unloading and cache utilization significantly impact overall latency.

Layer-wise scheme:

Table 2. Each layer's sensitivity to quantization (src).

Table 2 illustrates the varying sensitivity of different layers to quantization. Certain layers exhibit higher sensitivity, resulting in a steep degradation in BLEU score when subjected to quantization errors. Taking this into consideration, we propose a quantization scheme that assigns a higher number of bit-precision to sensitive layers while allocating lower bit-precision to more robust layers. In general, we assign higher bit precision to encoder sub-layers compared to decoder sub-layers, considering the block-wise contribution to the overall latency.

Vector-wise scheme:

Figure 8. Frequency of word occurrence in WMT19 translation datasets (src).

Figure 9. BLEU score of transformers -
same bit assignment vs fine grained precision (src).

In real-world text, words exhibit a power law frequency distribution, where certain words are highly frequent while others are rarely encountered. Building upon this observation, Chung and Kim (2020) propose a vector-wise scheme that assigns varying numbers of bits based on word frequency. Specifically, a group of the most frequent words is allocated a higher number of bits (e.g., 4 bits), while subsequent groups receive progressively fewer bits (3, 2, and 1 bits). This fine-grained approach enables efficient compression of the embedding block while preserving accuracy. Notably, in Figure 9, it is demonstrated that employing this fine-grained approach maintains BLEU scores more effectively at lower average bit-precision levels.

Application: on-device translation with sub-3bit transformer

Table 3. BCQ-ed transformers show high compression rate while managing on-par translation performance compared to the FP baseline (src).

Table 3 showcases the results from the study conducted by Chung & Kim et al. (2020). In their research, they introduce a sub-3-bit transformer model. The experimental findings reveal that this transformer model offers remarkable benefits when applied to on-device translation tasks.

Table 4. On-device (i.e. Galaxy N10+) latency and memory footprint of BCQ models. Corresponding BLEU scores are shown in Table 3 (src).

The 2.7-bit transformer implementation brings substantial improvements. It accelerates inference by 3.5x, enhances translation performance on smartphones. It achieves an impressive 8.3x memory compression, reducing runtime memory footprint. Additionally, the model size is reduced by 11.8x, improving on-device translation efficiency.

These findings highlight the potential of the sub-3-bit transformer in enabling faster and more compact translation models, making on-device translation more efficient and accessible to smartphone users.

Conclusion

This article explores the possibilities and benefits of binary code-based quantization. BCQ shines in its ability to facilitate extremely low bit quantization and offers a flexible mixed precision matrix multiplication scheme between binary weights and full precision activation values. Recent research, such as Park et al. (2022) and Kwon et al. (2022), further emphasizes the efficacy of BCQ for LLMs, a topic more relevant today. These studies provide valuable insights into the promising results achieved with BCQ in optimizing LLMs.

P.S. Implementation of BCQ is not as complex as it reads. This repository demonstrates the implementation of 3-bit BCQ on a transformer model for a translation task. It may give you some insight to how you can apply BCQ to your own project. Hope this helps!

References

Chung, I.*, Kim, B.*, Choi, Y., Kwon, S. J., Jeon, Y., Park, B., ... & Lee, D. (2020, November). Extremely Low Bit Transformer Quantization for On-Device Neural Machine Translation. In Findings of the Association for Computational Linguistics: EMNLP 2020 (pp. 4812-4826).
Bhandare, A., Sripathi, V., Karkada, D., Menon, V., Choi, S., Datta, K., & Saletore, V. (2019). Efficient 8-bit quantization of transformer neural machine language translation model. arXiv preprint arXiv:1906.00532.
Jeon, Y.*, Park, B*., Kwon, S. J., Kim, B., Yun, J., & Lee, D. (2020, November). BiQGEMM: matrix multiplication with lookup table for binary-coding-based quantized DNNs. In Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (pp. 1-16).
Dongsoo Lee, Se Jung Kwon, Byeongwook Kim, Yongkweon Jeon, Baeseong Park, Jeongin Yun, & Gu-Yeon Wei. (2020). Decoupling Weight Regularization from Batch Size for Model Compression.
Park, G.*, Park, B.*, Kwon, S. J., Kim, B., Lee, Y., & Lee, D. (2022). nuQmm: Quantized matmul for efficient inference of large-scale generative language models. arXiv preprint arXiv:2206.09557.
Kwon, S. J., Kim, J., Bae, J., Yoo, K. M., Kim, J. H., Park, B., ... & Lee, D. (2022, December). AlphaTuning: Quantization-Aware Parameter-Efficient Adaptation of Large-Scale Pre-Trained Language Models. In Findings of the Association for Computational Linguistics: EMNLP 2022 (pp. 3288-3305).

* Equal contribution.

Page updated

Google Sites

Report abuse