Okay, let's map the high-precision world of floating-point numbers to the compact realm of integers. As we discussed, quantization involves converting floating-point values (r, for "real") into lower-precision integer values (q, for "quantized"). This isn't just a simple type cast; it requires a defined mapping process. The core of this mapping involves two parameters:
- Scale (s): A positive floating-point number that determines the step size between adjacent quantized integer values. It relates the magnitude of the real numbers to the integer range.
- Zero-point (z): An integer value that corresponds to the real number 0.0. This ensures that the floating-point zero is correctly represented in the quantized domain.
The fundamental relationship connecting the real value r, the quantized value q, the scale s, and the zero-point z is:
r≈s⋅(q−z)
This is the dequantization formula, showing how to approximate the original real value from the quantized integer. The quantization process, going from r to q, involves dividing by the scale, adding the zero-point, rounding to the nearest integer, and clamping the result to the allowed integer range (e.g., [-128, 127] for INT8):
q=clip(round(r/s)+z,qmin,qmax)
How we determine the scale (s) and zero-point (z) defines the quantization scheme. The two primary schemes are Symmetric and Asymmetric quantization.
Symmetric Quantization
Symmetric quantization, as the name suggests, maps the floating-point values symmetrically around zero. It assumes that the range of important values in your tensor (weights or activations) is centered near zero.
Mapping: It maps a floating-point range [−R,R] to a signed integer range [qmin,qmax], typically like [−127,127] or [−128,127] for INT8. The key characteristic is that the floating-point value 0.0 is mapped directly to the integer value 0.
Scale and Zero-Point:
- The zero-point (z) is fixed at 0.
- The scale (s) is determined by the maximum absolute value (R) in the range you want to represent. Specifically, R=max(∣min(r)∣,∣max(r)∣) or simply R=max(∣r∣) across the tensor/channel/group. The scale is then calculated to map this maximum absolute value R to the edge of the integer range. For signed INT8 mapping to [−127,127], the scale would be s=R/127.
Formulas:
- Quantization: q=clip(round(r/s),qmin,qmax) (with z=0)
- Dequantization: r≈s⋅q
Example: Imagine quantizing values in the range [−3.8,3.2] to signed INT8 (using the range [−127,127] for symmetry).
- Determine range limit R: R=max(∣−3.8∣,∣3.2∣)=3.8.
- Calculate scale s: s=3.8/127≈0.0299.
- Set zero-point z: z=0.
- Quantize a value, e.g., r=1.5: q=round(1.5/0.0299)=round(50.167)=50.
- Quantize a value, e.g., r=−0.8: q=round(−0.8/0.0299)=round(−26.75)=−27.
- Dequantize q=50: r≈0.0299⋅50=1.495.
- Dequantize q=−27: r≈0.0299⋅(−27)=−0.8073.
Symmetric quantization mapping. The floating-point range is centered around 0.0, which maps directly to the integer zero-point (z=0). The scale (s) ensures the maximum absolute value maps to the edge of the integer range.
Pros:
- Simplicity: The zero-point is fixed at 0, simplifying calculations.
- Efficiency: Some hardware accelerators can perform integer computations more efficiently when the zero-point is 0.
Cons:
- Potential Inefficiency: If the original floating-point distribution is skewed (not centered around zero), one side of the integer range might be underutilized or unused, effectively reducing the available precision compared to asymmetric quantization for the same number of bits.
Asymmetric Quantization
Asymmetric quantization maps the exact minimum and maximum floating-point values observed in the tensor to the minimum and maximum values of the integer range. This is particularly useful when the data distribution is not centered around zero.
Mapping: It maps a floating-point range [min(r),max(r)] to the full integer range [qmin,qmax], such as [0,255] for unsigned INT8 (UINT8) or [−128,127] for signed INT8.
Scale and Zero-Point:
- The scale (s) is calculated based on the full floating-point range: s=(max(r)−min(r))/(qmax−qmin).
- The zero-point (z) is calculated to ensure that the floating-point value 0.0 maps correctly to its corresponding quantized value. It's typically calculated as an integer offset: z=round(qmin−min(r)/s). Note that z might not be 0. It represents the integer value that corresponds to the real number 0.0.
Formulas:
- Quantization: q=clip(round(r/s)+z,qmin,qmax)
- Dequantization: r≈s⋅(q−z)
Example: Imagine quantizing activation values known to be in the range [0.5,6.2] to unsigned INT8 (UINT8, range [0, 255]).
- Determine range: min(r)=0.5, max(r)=6.2.
- Calculate scale s: s=(6.2−0.5)/(255−0)=5.7/255≈0.02235.
- Calculate zero-point z: z=round(0−0.5/0.02235)=round(−22.37)=−22. (Note: The final integer z must often be clamped to the target integer range, e.g., [0, 255] for UINT8. If clamping is needed, the scale might be slightly adjusted. Let's assume for this example z=−22 is usable, though practical frameworks ensure z is within [qmin,qmax]. A different definition uses z=round(qmax−max(r)/s). Let's recalculate with the z=qmin−min(r)/s definition but ensure z is an integer. zfloat=0−0.5/0.02235=−22.37. Often z is chosen such that float 0 maps to z. q0=round(0.0/s)+zfloat. Wait, this is getting confusing. Let's use the standard definition: r=s(q−z). min(r)=s(qmin−z) and max(r)=s(qmax−z). s=(max(r)−min(r))/(qmax−qmin). z=qmin−min(r)/s. Let's recompute z making sure it's an integer. z=round(qmin−min(r)/s). z=round(0−0.5/0.02235)=round(−22.37)=−22. This z should be clamped to [0, 255]. A zero-point of 0 is outside the initial float range [0.5, 6.2] anyway. The zero-point adjusts the mapping. Let's recalculate for INT8 [-128, 127] range instead: s=(6.2−0.5)/(127−(−128))=5.7/255≈0.02235. z=round(qmin−min(r)/s)=round(−128−0.5/0.02235)=round(−128−22.37)=round(−150.37)=−150. This is outside [-128, 127]. Frameworks handle this by adjusting s and z slightly to ensure z is in range and 0 maps well. Let's stick to the UINT8 example: s≈0.02235, z≈−22.
- Quantize a value, e.g., r=1.5: q=clip(round(1.5/0.02235)+(−22),0,255)=clip(round(67.11)−22,0,255)=clip(67−22,0,255)=45.
- Quantize a value, e.g., r=5.0: q=clip(round(5.0/0.02235)−22,0,255)=clip(round(223.7)−22,0,255)=clip(224−22,0,255)=202.
- Dequantize q=45: r≈0.02235⋅(45−(−22))=0.02235⋅67=1.497.
- Dequantize q=202: r≈0.02235⋅(202−(−22))=0.02235⋅224=4.99.
Asymmetric quantization mapping. The exact floating-point minimum and maximum values map to the minimum and maximum of the integer range. The scale (s) and zero-point (z) accommodate this mapping, even if the float range doesn't include 0.0.
Pros:
- Flexibility: Can accurately represent data distributions that are skewed or not centered around zero (e.g., outputs of ReLU activation functions are always non-negative).
- Precision: Maximizes the use of the available integer values, potentially leading to lower quantization error for non-symmetric distributions compared to symmetric quantization using the same number of bits.
Cons:
- Complexity: Requires calculating, storing, and using both a scale factor and a zero-point for dequantization and subsequent computations. This can introduce a small computational overhead compared to the symmetric case where z=0.
Choosing the Right Scheme
The choice between symmetric and asymmetric quantization depends on several factors:
Feature |
Symmetric Quantization |
Asymmetric Quantization |
FP Range |
Maps [−R,R] |
Maps [min(r),max(r)] |
Zero Mapping |
0.0→0 |
0.0→z (integer zero-point) |
Zero-Point |
z=0 (typically) |
z calculated, can be non-zero |
Parameters |
Scale (s) |
Scale (s) and Zero-Point (z) |
Best For |
Distributions centered around zero (e.g., weights) |
Skewed distributions (e.g., activations after ReLU) |
Computation |
Potentially simpler (if z=0) |
Requires handling non-zero z |
- Weights: In many neural networks, weight distributions tend to be roughly symmetric around zero. Therefore, symmetric quantization is often a good default choice for weights, offering simplicity and potential hardware advantages.
- Activations: Activation values, especially after functions like ReLU (f(x)=max(0,x)), often have highly asymmetric distributions (e.g., always non-negative). Asymmetric quantization is usually preferred for activations as it can better utilize the integer range to represent these skewed values, minimizing accuracy loss.
However, these are general guidelines. The optimal choice might vary depending on the specific layer type, model architecture, target hardware capabilities (some hardware might be optimized for one scheme over the other), and empirical evaluation of the accuracy/performance trade-off. Modern quantization libraries and frameworks often allow you to configure these schemes independently for weights and activations.
Understanding these schemes is foundational because the choice directly impacts how accurately the low-precision integers represent the original floating-point values, which in turn affects the overall performance and accuracy of the quantized LLM. Next, we'll look at another dimension of quantization: granularity, which defines how broadly these scale and zero-point parameters are applied.