Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - Provides core theory on neural network training, including optimization, loss functions, and issues like exploding/vanishing gradients.
CS224n: Natural Language Processing with Deep Learning, Diyi Yang, Tatsunori Hashimoto, 2023 (Stanford University) - Course materials for deep learning in NLP, covering training practices, optimization, and monitoring for language models.
Scaling Laws for Neural Language Models, Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei, 2020arXiv preprint arXiv:2001.08361DOI: 10.48550/arXiv.2001.08361 - Research paper analyzing the behavior of training loss with model and dataset scale in large language models, setting expectations for performance.