Quantization for Deep Learning Models, PyTorch Contributors, 2023 - Official documentation explaining PyTorch's quantization support, including affine and symmetric schemes, as well as per-tensor and per-channel granularities, providing practical implementation details.
LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale, Tim Dettmers, Mike Lewis, Younes Belkada, and Luke Zettlemoyer, 2022NeurIPS 2022DOI: 10.48550/arXiv.2208.07339 - Introduces 8-bit matrix multiplication for large language models, addressing the challenge of outliers in transformer activations through a mixed-precision approach, highlighting its importance for LLM compression.