Diagnosing Loss Spikes

New · Open Source

Kerb - LLM Development Toolkit

Python toolkit for building production-ready LLM applications. Modular utilities for prompts, RAG, agents, structured outputs, and multi-provider support.

Was this section helpful?

References

On the difficulty of training recurrent neural networks, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Yoshua Bengio, 2013 Journal of Machine Learning Research - Proceedings Track, Vol. 28 (JMLR) DOI: 10.5555/3042079.3042211 - Explains the theoretical causes of exploding and vanishing gradients and introduces gradient clipping as a mitigation strategy, highly relevant to loss spikes.
Mixed Precision Training, Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu, 2018 International Conference on Learning Representations (ICLR) DOI: 10.48550/arXiv.1710.03740 - Introduces the concept of mixed-precision training, focusing on FP16, and details the necessity and mechanism of loss scaling to prevent underflow and overflow, which is critical for understanding mixed-precision related loss spikes.
Automatic Mixed Precision examples, PyTorch Team, 2024 - Provides practical examples and explanations for using Automatic Mixed Precision (AMP) in PyTorch, including considerations for numerical stability and loss scaling which directly addresses mixed-precision issues causing loss spikes.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - Offers a comprehensive theoretical and practical foundation for training deep neural networks, including discussions on optimization challenges, gradient instability, and general debugging strategies relevant to loss spikes.