Decoupled Weight Decay Regularization, Ilya Loshchilov and Frank Hutter, 2019International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1711.05101 - Introduces AdamW, which decouples weight decay from adaptive gradient updates, a standard for LLM optimization.
Scaling Laws for Neural Language Models, Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, Dario Amodei, 2020arXiv preprintDOI: 10.48550/arXiv.2001.08361 - Discusses how model performance scales with model size, dataset size, and compute, informing hyperparameter choices for LLMs.