Decoupled Weight Decay Regularization, Ilya Loshchilov and Frank Hutter, 2019International Conference on Learning Representations (ICLR 2019)DOI: 10.48550/arXiv.1711.05101 - Proposes AdamW, a variant of Adam that decouples weight decay from the adaptive learning rate mechanism, improving regularization.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - Comprehensive textbook covering various optimization algorithms, including SGD, Momentum, RMSprop, and Adam, along with their theoretical foundations and practical considerations.
The Marginal Value of Adaptive Gradient Methods in Machine Learning, Ashia C. Wilson, Rebecca Roelofs, Mitchell Stern, Nathan Srebro, Benjamin Recht, 2017Advances in Neural Information Processing Systems 30 (NeurIPS 2017)DOI: 10.48550/arXiv.1705.08292 - Critically analyzes the generalization performance of adaptive gradient methods (like Adam) compared to SGD, suggesting that SGD can find flatter minima.