While the core idea of quantization reducing numerical precision to save memory and potentially speed up computation is straightforward, the specific choices made during this process have significant implications, especially for large language models. Before we analyze advanced calibration methods or training-aware techniques, let's refine our understanding of the fundamental building blocks: how we map values and the scope over which we apply these mappings. This revisit ensures we share a precise vocabulary and grounding necessary for tackling the complexities of quantizing massive transformer architectures.
The Quantization Mapping
At its heart, quantization maps a continuous or large set of values (typically 32-bit or 16-bit floating-point numbers, FP32 or FP16/BF16) onto a smaller, discrete set of values, often represented by low-bit integers (like INT8, INT4). The most common approach is affine quantization, defined by a scale factor (S) and a zero-point (Z).
For a given real-valued input X, its quantized representation Xq is calculated as:
Xq=clip(round(X/S+Z))
Where:
- X: The original high-precision value (e.g., an FP32 weight or activation).
- S: The scale factor, a positive floating-point number that determines the step size between quantized levels. It maps the range of real values to the range of quantized integers.
- Z: The zero-point, an integer offset that ensures the real value zero maps correctly to a value within the quantized range. This is crucial for accurately representing distributions that are not centered around zero.
- round(⋅): Typically rounding to the nearest integer.
- clip(⋅): Clamps the result within the representable range of the target integer type (e.g., [0,255] for unsigned INT8, or [−128,127] for signed INT8).
The inverse operation, dequantization, approximates the original value:
Xdq=S(Xq−Z)
Xdq represents the dequantized value, which approximates the original X. While computations can often be performed directly using integer arithmetic on Xq for maximum efficiency (leveraging specialized hardware instructions), dequantization is necessary when interfacing with layers requiring higher precision or for analysis.
Affine vs. Symmetric Quantization
Two primary schemes exist for determining S and Z:
- Affine Quantization: Uses both a scale factor (S) and a zero-point (Z), as defined above. This allows for an asymmetric mapping where the real value 0.0 can be precisely represented by one of the quantized integer values (specifically, Z). This is generally preferred for quantizing activations (like ReLU or GeLU outputs) which often have non-negative or non-symmetric distributions.
- Symmetric Quantization: Simplifies the mapping by setting the zero-point Z implicitly or explicitly to zero. The formula becomes:
Xq=clip(round(X/S))
The dequantization is simply Xdq=SXq. This is often used for model weights, which tend to have distributions centered around zero. While simpler, symmetric quantization might not represent the real value 0.0 exactly if the integer range doesn't perfectly center on zero (e.g., signed INT8 range is [−128,127]). However, it can sometimes offer slight computational advantages on certain hardware.
The choice between affine and symmetric depends on the distribution of the values being quantized and the target hardware's capabilities.
Quantization Granularity: Per-Tensor vs. Finer Scopes
A critical decision is the granularity or scope over which the quantization parameters (S and Z) are calculated and applied.
-
Per-Tensor Quantization: This is the simplest approach. A single pair of (S,Z) values is computed for an entire tensor (e.g., all weights in a linear layer's weight matrix, or all activation values in a feature map).
- Pros: Minimal overhead; requires storing only one S and one Z per tensor.
- Cons: Can suffer significant accuracy loss if the range of values varies substantially across different parts of the tensor. Outliers in one small part of the tensor can drastically shrink the effective range represented by the quantization steps for the rest of the tensor, leading to poor resolution.
-
Per-Channel (or Per-Axis) Quantization: A more refined approach, particularly common for weight tensors in convolutional and linear layers. Instead of one (S,Z) pair for the whole weight matrix, a separate pair is calculated for each output channel (or along a specific axis). For a weight matrix W of shape [output_channels, input_channels]
, you would compute output_channels
pairs of (S,Z).
- Pros: Significantly improves accuracy compared to per-tensor, as it adapts to the potentially different value distributions across channels. Handles variation better.
- Cons: Increased metadata overhead; requires storing multiple S and Z values per tensor.
-
Per-Group (or Block-wise) Quantization: An even finer granularity, primarily applied to weights, especially in very low-bit quantization (e.g., INT4, NF4). The tensor is divided into smaller blocks or groups (e.g., groups of 64 or 128 values), and a separate (S,Z) pair is calculated for each block.
- Pros: Offers the highest potential accuracy among these granularities, particularly effective at isolating the impact of outliers within a tensor. Essential for aggressive quantization like 4-bit.
- Cons: Highest metadata overhead. The storage cost of the S and Z values becomes more substantial relative to the quantized weights themselves. Computational overhead during quantization/dequantization can also increase.
Comparison of quantization granularities for a weight tensor. Per-tensor uses one scale/zero-point pair, per-channel uses one pair per row (output channel), and per-group uses one pair for smaller blocks within the tensor.
Relevance to LLM Quantization
Revisiting these fundamentals is necessary because the sheer scale and specific characteristics of LLMs amplify the importance of these choices:
- Sensitivity: The massive parameter counts and complex interactions within transformers mean that naive per-tensor quantization often leads to unacceptable accuracy degradation. Per-channel quantization is typically the baseline for weights.
- Outliers: LLM activations, particularly after LayerNorm or within attention mechanisms, can exhibit extreme outlier values. These outliers pose a significant challenge for standard quantization calibration, often necessitating finer granularity (like per-group for weights) or more advanced techniques (like outlier handling during PTQ or using QAT) which we will cover shortly.
- Hardware Support: Different hardware platforms (CPUs, GPUs, TPUs, NPUs) have varying levels of optimized support for different quantization schemes (symmetric vs. affine) and granularities. Efficient deployment requires understanding this interplay.
- Mixed Precision: The varying sensitivity of different LLM components (e.g., attention vs. feed-forward networks) suggests that applying a uniform quantization strategy across the entire model might be suboptimal. This motivates mixed-precision approaches, using different bit widths and granularities for different layers, a topic explored later in this chapter.
Understanding these basic choices, affine versus symmetric mapping, and the trade-offs between per-tensor, per-channel, and per-group granularity, provides the essential framework for appreciating the advanced Post-Training Quantization (PTQ), Quantization-Aware Training (QAT), and extreme quantization methods that are critical for effectively optimizing large language models. We will now build upon this foundation to explore how these techniques address the unique challenges posed by LLMs.