Adam: A Method for Stochastic Optimization, Diederik P. Kingma and Jimmy Ba, 2015International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1412.6980 - Introduces Adam, a widely used optimization algorithm combining adaptive learning rates and momentum-like gradient averaging.
Decoupled Weight Decay Regularization, Ilya Loshchilov and Frank Hutter, 2019International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1711.05101 - Proposes AdamW, a modification of Adam that decouples weight decay from the adaptive gradient updates, improving regularization.
Deep Learning, Ian Goodfellow, Yoshua Bengio, and Aaron Courville, 2016 (MIT Press) - A textbook covering optimization algorithms including SGD, Momentum, AdaGrad, RMSprop, and Adam within deep learning.