Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - A foundational textbook covering the theory and methods of deep learning, including comprehensive chapters on optimization algorithms and the challenges of non-convex loss landscapes.
A Stochastic Approximation Method, Herbert Robbins and Sutton Monro, 1951The Annals of Mathematical Statistics, Vol. 22 (Institute of Mathematical Statistics)DOI: 10.1214/aoms/1177729586 - The seminal paper introducing stochastic approximation, which laid the mathematical groundwork for Stochastic Gradient Descent.
Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization, Yann N. Dauphin, Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio, 2014Advances in Neural Information Processing Systems, Vol. 27 (Curran Associates, Inc.)DOI: 10.48550/arXiv.1406.2572 - This paper argues that saddle points are more problematic than local minima in high-dimensional non-convex optimization, significantly hindering the convergence of optimization algorithms.