LoRA: Low-Rank Adaptation of Large Language Models, Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, 2021arXiv preprint arXiv:2106.09685DOI: 10.48550/arXiv.2106.09685 - Presents Low-Rank Adaptation (LoRA), a parameter-efficient fine-tuning technique that significantly reduces memory and computational requirements, especially for PPO with LLMs.
ZeRO: Memory Optimizations Toward Training Trillion Parameter Models, Samyam Rajbhandari, Cong Xu, Yuxiong He, Jeff Rasley, Shaden Smith, Olatunji Ruwase, Dean Macy, 2020SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Association for Computing Machinery (ACM))DOI: 10.1145/3429388.3444838 - Introduces ZeRO (Zero Redundancy Optimizer), a suite of memory optimization techniques for distributed training that help train models with billions to trillions of parameters.
Mixed Precision Training, Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu, 2018International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1710.03740 - Describes mixed-precision training using FP16, a technique for reducing memory footprint and speeding up computation on compatible hardware.