All Courses

Revisiting Quantization Principles for Large Models

While the fundamental goal of quantization remains the same, reducing the numerical precision of model parameters and activations to improve computational efficiency, applying these techniques effectively to Large Language Models (LLMs) requires adapting core principles to their unique scale and structure. Let's revisit these foundational concepts through the lens of multi-billion parameter Transformer models.

At its heart, quantization maps high-precision floating-point values (like FP32 or FP16) to lower-precision representations, typically integers (like INT8 or INT4) or specialized low-bit floating-point formats. This reduction is the source of the efficiency gains: smaller model sizes lead to reduced memory bandwidth requirements and faster computations on hardware optimized for lower-precision arithmetic.

Mapping Schemes: Symmetric vs. Asymmetric

The process of mapping a floating-point value $x$ to its quantized counterpart $q$ involves defining a mapping function. Two primary schemes are common:

Symmetric Quantization: This scheme maps floating-point values symmetrically around zero. The quantization formula is typically:
$q = \text{clip}(\text{round}(x / s), Q_{\text{min}}, Q_{\text{max}})$
Here, $s$ is the scaling factor, calculated based on the range of the absolute values observed in the tensor (e.g., $s = \max(|x|) / Q_{\text{max}}$ , where $Q_{\text{max}}$ is the maximum representable integer value). The clip function ensures the result stays within the valid range of the target integer type (e.g., [-128, 127] for INT8). A significant property of symmetric quantization is that the floating-point value 0.0 maps precisely to the integer 0. This is beneficial for operations involving padding or sparsity.
Asymmetric Quantization: This scheme handles value ranges that are not centered around zero by introducing a zero-point (or offset) $z$ :
$q = \text{clip}(\text{round}(x / s) + z, Q_{\text{min}}, Q_{\text{max}})$
The scaling factor $s$ is calculated based on the full range of values ( $s = (\max(x) - \min(x)) / (Q_{\text{max}} - Q_{\text{min}})$ ), and the zero-point $z$ represents the integer value corresponding to the floating-point 0.0 ( $z = -\text{round}(\min(x) / s) + Q_{\text{min}}$ ). This allows asymmetric quantization to utilize the full integer range more effectively when the data distribution is skewed or shifted.

Difference between symmetric and asymmetric quantization mappings. Symmetric maps 0.0 to integer 0, while asymmetric uses a zero-point $z$ to handle offset ranges.

For LLMs, weights often have distributions reasonably centered around zero, making symmetric quantization a common choice. However, activations, particularly after ReLU or GeLU functions, can be strictly non-negative or have highly asymmetric distributions. In such cases, asymmetric quantization might offer better representation fidelity by utilizing the available integer range more efficiently, although it introduces the zero-point parameter $z$ , adding slight complexity to computations.

Granularity: Per-Tensor vs. Per-Channel (or Per-Axis)

Another fundamental choice is the granularity at which the scaling factor $s$ (and zero-point $z$ , if asymmetric) is applied:

Per-Tensor Quantization: A single $s$ and $z$ are used for the entire tensor (e.g., all weights in a specific linear layer). This is the simplest approach, minimizing the overhead of storing quantization parameters. However, if different parts of the tensor have significantly different value ranges, a single scaling factor might lead to poor precision for the lower-range values or clipping for the higher-range values.
Per-Channel (or Per-Axis/Per-Group) Quantization: Separate $s$ and $z$ values are computed for specific dimensions of the tensor. For the weight matrix of a linear layer (shape [output_features, input_features]), per-channel quantization typically means calculating a unique $s$ (and $z$ ) for each output channel (i.e., each row). This allows the quantization range to adapt more closely to the distribution within each channel or group of parameters.

Example of weight value ranges varying significantly between two channels in a linear layer. Per-tensor quantization would struggle to accurately represent both ranges simultaneously.

In LLMs, linear layers constitute the vast majority of parameters. The weights within these layers often exhibit significant variations in range across different output channels (neurons). Consequently, per-channel quantization is frequently the preferred method for quantizing LLM weights, as it generally yields better accuracy preservation compared to per-tensor quantization for the same bit-width. Per-tensor quantization might still be considered for activations, where the overhead of per-channel parameters could be more substantial relative to the computation, or where distributions are more uniform. Variations like per-group quantization (applying scales to blocks of weights within a channel) also exist as a compromise.

The Role of Calibration

Whether using symmetric or asymmetric schemes, per-tensor or per-channel granularity, determining the optimal quantization parameters ( $s$ and $z$ ) is essential. This process, known as calibration, involves analyzing the statistical distribution of the floating-point values you intend to quantize. For Post-Training Quantization (PTQ), this typically requires running inference on a small, representative dataset (the calibration dataset) to observe the typical ranges of weights and, more importantly, activations. The choice of calibration data and the method used to derive $s$ and $z$ from the observed ranges (e.g., min/max values, percentile ranges) significantly impact the accuracy of the final quantized model.

Relevance to Large Models

Why is revisiting these basics so important for LLMs?

Scale and Sensitivity: With billions of parameters, the impact of quantization choices is amplified. Small improvements in accuracy per parameter can lead to significant overall performance differences. Conversely, poor choices can lead to unacceptable quality degradation. LLMs can be surprisingly sensitive to quantization errors, especially in specific modules like attention mechanisms.
Memory Bandwidth: Reducing the model size via quantization directly translates to lower memory bandwidth usage during inference, often a primary bottleneck for LLMs. The difference between per-tensor and per-channel metadata storage becomes negligible compared to the weight data itself.
Hardware Kernels: The efficiency gains depend heavily on the availability of optimized low-precision compute kernels on the target hardware (CPUs, GPUs, specialized accelerators). Different quantization schemes (symmetric vs. asymmetric) and granularities may map differently to available hardware instructions.

Understanding these fundamental trade-offs between symmetric/asymmetric mapping and per-tensor/per-channel granularity provides the necessary context for evaluating and implementing the more advanced PTQ and QAT techniques tailored specifically for LLMs, which we will cover in subsequent sections. These advanced methods often refine how scaling factors are calculated or selectively apply different granularities to optimize the accuracy-performance balance.

Was this section helpful?