Weight Quantization (INT8, INT4)

New · Open Source

Kerb - LLM Development Toolkit

Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.

Was this section helpful?

References

QLoRA: Efficient Finetuning of Quantized LLMs on Consumer GPUs, Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 2023 arXiv preprint arXiv:2305.14314 DOI: 10.48550/arXiv.2305.14314 - Introduces QLoRA and the NormalFloat (NF4) data type, a significant method for 4-bit quantization in large language models, enabling efficient fine-tuning on consumer hardware.
GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers, Elias Frantar, Saleh Ashkboos, Torsten Hoefler, Dan Alistarh, 2022 ICLR 2023 DOI: 10.48550/arXiv.2210.17323 - Presents GPTQ, a highly accurate post-training quantization method specifically designed for large language models, allowing efficient INT4 conversion with minimal accuracy loss.
Quantization for Model Optimization, PyTorch Documentation, 2019 - Official PyTorch documentation providing practical guidance and APIs for implementing post-training quantization and quantization-aware training.