Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - Provides a foundational understanding of stochastic gradient descent variants, including mini-batch gradient descent, and discusses the impact of batch size on optimization.
On Large-Batch Training for Deep Learning: Generalization Gap and Sharp Minima, Nitish Shirish Keskar, Dheevatsa Mudigere, Jorge Nocedal, Mikhail Smelyanskiy, and Ping Tak Peter Tang, 2017International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1609.04836 - Presents empirical and theoretical evidence for the generalization gap between large and small batch methods, linking it to the sharpness of local minima.
Optimization Methods for Large-Scale Machine Learning, Léon Bottou, Frank E. Curtis, and Jorge Nocedal, 2018SIAM Review, Vol. 60 (Society for Industrial and Applied Mathematics)DOI: 10.1137/16M1080173 - A comprehensive survey discussing various optimization methods for large-scale machine learning, including a detailed analysis of stochastic gradient methods and the implications of batch size.