Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - A comprehensive textbook covering foundational concepts of deep learning, including backpropagation, gradient issues, activation functions, and initialization techniques.
On the difficulty of training Recurrent Neural Networks, Razvan Pascanu, Tomas Mikolov, Yoshua Bengio, 2013Proceedings of the 30th International Conference on Machine Learning, Vol. 28 (PMLR) - This paper analyzes the vanishing and exploding gradient problems in deep networks and proposes gradient clipping as a solution.
Mixed Precision Training, Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, Hao Wu, 2018International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1710.03740 - The seminal work introducing mixed precision training for deep neural networks, detailing techniques like loss scaling to leverage FP16 efficiently.
Layer Normalization, Jimmy Lei Ba, Jamie Ryan Kiros, and Geoffrey E. Hinton, 2016arXiv preprint arXiv:1607.06450DOI: 10.48550/arXiv.1607.06450 - Introduces Layer Normalization as an alternative to Batch Normalization, which is particularly effective for recurrent neural networks and Transformers due to its independence from batch size.
Understanding the difficulty of training deep feedforward neural networks, Xavier Glorot and Yoshua Bengio, 2010Proceedings of the Thirteenth International Conference on Artificial Intelligence and Statistics (AISTATS), Vol. 9 (JMLR Workshop and Conference Proceedings)DOI: 10.5555/3104322.3104327 - This paper analyzes the initial distribution of activations and gradients in deep networks, proposing Xavier/Glorot initialization to mitigate vanishing/exploding gradients.