Adaptive Subgradient Methods for Online Learning and Stochastic Optimization, John Duchi, Elad Hazan, and Yoram Singer, 2011Journal of Machine Learning Research, Vol. 12 (Microtome Publishing)DOI: 10.5555/1953048.2078174 - Introduces AdaGrad, a foundational adaptive learning rate algorithm that scales parameter updates inversely proportional to the square root of the sum of past squared gradients. This paper helps understand the origin of per-parameter learning rate scaling.
Adam: A Method for Stochastic Optimization, Diederik P. Kingma, Jimmy Ba, 20143rd International Conference for Learning Representations, San Diego, 2015DOI: 10.48550/arXiv.1412.6980 - Presents Adam, a widely used adaptive optimization algorithm that combines the benefits of RMSprop and Momentum, offering individual adaptive learning rates for different parameters by estimating first and second moments of the gradients.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - Chapter 8, 'Optimization for Training Deep Models', provides a comprehensive overview of optimization algorithms, including the motivations for adaptive learning rates and detailed discussions of AdaGrad, RMSprop, and Adam.