Scaling laws for neural language models, Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei, 2020arXiv preprint arXiv:2001.08361DOI: 10.48550/arXiv.2001.08361 - This foundational work introduces empirical scaling laws that relate model size, dataset size, and training compute to model performance.
Training compute-optimal large language models, Jordan Hoffmann, Sebastian Borgeaud, Arthur Mensch, Elena Buchatskaya, Trevor Cai, Eliza Rutherford, Diego de Las Casas, Lisa Anne Hendricks, Johannes Welbl, Aidan Clark, Tom Hennigan, Eric Noland, Katie Millican, George van den Driessche, Bogdan Damoc, Aurelia Guy, Simon Osindero, Karen Simonyan, Erich Elsen, Jack W. Rae, Oriol Vinyals, Laurent Sifre, 2022arXiv preprint arXiv:2203.15556DOI: 10.48550/arXiv.2203.15556 - This paper refines the scaling laws, demonstrating how to optimally allocate compute between model size and data size for training large language models.
Attention Is All You Need, Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin, 2017arXivDOI: 10.48550/arXiv.1706.03762 - This paper introduces the Transformer architecture, which forms the basis for most large language models, explaining the attention mechanism's computational and memory characteristics.
ZeRO: Memory Optimizations Toward Training Gigantic Neural Networks for Model Parallelism, Samyam Rajbhandari, Cong Guo, Erik P. Johnson, Yuxiong He, 2020SC '20: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis (Association for Computing Machinery (ACM))DOI: 10.1145/3416790.3429130 - This paper presents ZeRO, a memory optimization technology that significantly reduces memory consumption for large model training, particularly optimizer states, by partitioning them across data parallel devices.
Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism, Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, Bryan Catanzaro, 2019arXiv preprint arXiv:1909.08053DOI: 10.48550/arXiv.1909.08053 - This paper describes Megatron-LM, an early system for training very large language models using various parallelism techniques, including tensor and pipeline parallelism, to handle compute and memory limitations.