On the difficulty of training recurrent neural networks, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Yoshua Bengio, 2013Journal of Machine Learning Research - Proceedings Track, Vol. 28 (JMLR)DOI: 10.5555/3042079.3042211 - Explains the theoretical causes of exploding and vanishing gradients and introduces gradient clipping as a mitigation strategy, highly relevant to loss spikes.
Mixed Precision Training, Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu, 2018International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1710.03740 - Introduces the concept of mixed-precision training, focusing on FP16, and details the necessity and mechanism of loss scaling to prevent underflow and overflow, which is critical for understanding mixed-precision related loss spikes.
Automatic Mixed Precision examples, PyTorch Team, 2024 - Provides practical examples and explanations for using Automatic Mixed Precision (AMP) in PyTorch, including considerations for numerical stability and loss scaling which directly addresses mixed-precision issues causing loss spikes.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - Offers a comprehensive theoretical and practical foundation for training deep neural networks, including discussions on optimization challenges, gradient instability, and general debugging strategies relevant to loss spikes.