Mixed-Precision Training, Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu, 2018International Conference on Learning Representations (ICLR) 2018DOI: 10.48550/arXiv.1710.03740 - Foundational paper introducing mixed-precision training techniques, including loss scaling, to significantly reduce memory usage and speed up deep learning model training.
Training Deep Nets with Sublinear Memory Cost, Tianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin, 2016arXiv preprint arXiv:1604.06174DOI: 10.48550/arXiv.1604.06174 - Original research paper proposing gradient checkpointing (also known as activation checkpointing) to trade computation for memory, enabling training deeper neural networks.
QLoRA: Efficient Finetuning of Quantized LLMs via LoRA, Tim Dettmers, Artidoro Pagnoni, Ari Holtzman, Luke Zettlemoyer, 2023arXiv preprint arXiv:2305.14314DOI: 10.48550/arXiv.2305.14314 - Introduces QLoRA, a method for fine-tuning quantized large language models, heavily relying on 8-bit optimizers and paged optimizers for memory savings.
DeepSpeed: Large Scale Distributed Training of DL Models with System Optimizations, Saurabh Agarwal, Shuai Che, Michael Gschwind, Hanwen Chang, Minjia Zhang, Reza Yazdani, Jeff Rasley, Elton Zheng, Minmin Gong, Xinggang Wang, Hao Liu, Bo Li, Yuxiong He, 2021Proceedings of the VLDB Endowment, Vol. 15 (VLDB Endowment)DOI: 10.14778/3554821.3554867 - Presents DeepSpeed, a deep learning optimization library that enables efficient large-scale distributed training through various system optimizations, including memory management.
Hugging Face Accelerate Documentation, Hugging Face, 2024 (Hugging Face) - Official documentation for Hugging Face Accelerate, a library that simplifies mixed-precision training, gradient accumulation, and distributed training setups.