Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour, Priya Goyal, Piotr Dollár, Ross Girshick, Pieter Noordhuis, Lukasz Wesolowski, Aapo Kyrola, Andrew Tulloch, Yangqing Jia, Kaiming He, 2017arXiv preprintDOI: 10.48550/arXiv.1706.02677 - Introduces the linear scaling rule for learning rates with large batch sizes and discusses the importance of learning rate warmup.
Deep Learning, Ian Goodfellow, Yoshua Bengio, Aaron Courville, 2016 (MIT Press) - A foundational textbook covering various optimization algorithms and their properties in deep learning.
Don't Decay The Learning Rate, Increase The Batch Size, Samuel L. Smith, Pieter-Jan Kindermans, Chris Ying, Quoc V. Le, 2017International Conference on Learning Representations (ICLR) 2018DOI: 10.48550/arXiv.1711.00489 - Explores the alternative strategy of increasing batch size instead of decaying the learning rate for efficient training.