This chapter establishes the foundation for understanding advanced quantization methods specifically applied to Large Language Models (LLMs). We begin by reviewing core quantization principles, adapting concepts like symmetric versus asymmetric quantization (q=round(x/s) vs q=round(x/s)+z) and per-tensor versus per-channel scaling to the unique characteristics of LLMs.
You will learn about techniques operating below 8-bit precision, such as INT4 and specialized formats like NF4 and FP4, examining their mathematical properties and implications for model accuracy and performance. We will examine critical post-training quantization (PTQ) algorithms developed for LLMs, including GPTQ and AWQ, understanding how they aim to preserve model fidelity with minimal retraining. Considerations for quantization-aware training (QAT) in the context of large models will also be discussed.
The chapter also covers strategies for mixed-precision quantization, selecting appropriate calibration data for PTQ methods, and concludes with a practical exercise applying GPTQ to a sample LLM. By the end, you will have a firm grasp of the theoretical underpinnings and common algorithms used in modern LLM quantization.
1.1 Revisiting Quantization Principles for Large Models
1.2 Low-Bit Quantization Techniques (Below INT8)
1.3 Understanding Quantization Data Types and Formats
1.4 Post-Training Quantization (PTQ) Algorithms for LLMs
1.5 Quantization-Aware Training (QAT) Considerations
1.6 Mixed-Precision Quantization Strategies
1.7 Calibration Data Selection and Preparation
1.8 Hands-on Practical: Applying GPTQ to an LLM
© 2025 ApX Machine Learning