Training Deep Nets with Sublinear Memory Cost, Tianqi Chen, Bing Xu, Chiyuan Zhang, Carlos Guestrin, 2016arXiv preprint arXiv:1604.06174 (arXiv)DOI: 10.48550/arXiv.1604.06174 - A foundational paper that introduces gradient checkpointing, a technique for reducing memory consumption during deep neural network training by recomputing intermediate activations instead of storing them.
Mixed-Precision Training, Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu, 2018ICLR 2018DOI: 10.48550/arXiv.1710.03740 - The seminal paper that details the methodology of mixed-precision training, including the use of 16-bit floating-point numbers and loss scaling to improve training speed and reduce memory usage.
Automatic Mixed Precision package - torch.cuda.amp, PyTorch Contributors, 2024 (PyTorch) - Official PyTorch documentation providing practical guidance and examples for implementing mixed-precision training using the torch.cuda.amp package, including details on GradScaler for loss scaling.
Adafactor: Adaptive Learning Rates with Sublinear Memory Cost, Noam Shazeer, Mitchell Stern, 2018Proceedings of the 35th International Conference on Machine Learning (ICML), Vol. 80 (PMLR)DOI: 10.5555/3295304.3295415 - Introduces Adafactor, an adaptive learning rate optimizer designed to significantly reduce memory consumption for optimizer states, making it suitable for training very large models.
8-bit Optimizers via Block-wise Quantization, Tim Dettmers, Mike Lewis, Sam Shleifer, Luke Zettlemoyer, 2021International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.2110.02861 - Presents a method for quantizing optimizer states to 8-bit precision, which substantially decreases the memory footprint of optimizers like Adam while preserving their performance characteristics.