Decoupled Weight Decay Regularization, Ilya Loshchilov and Frank Hutter, 2019International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1711.05101 - Demonstrates how decoupled weight decay improves performance and generalization for adaptive optimizers like Adam, leading to AdamW.
Lookahead Optimizer: k steps forward, 1 step back, Michael R. Zhang, James Lucas, Geoffrey Hinton, and Jimmy Ba, 2019Advances in Neural Information Processing Systems (NeurIPS)DOI: 10.48550/arXiv.1907.08610 - Proposes the Lookahead optimizer, a wrapper that improves stability and convergence through interpolation of fast and slow weights.
Adam: A Method for Stochastic Optimization, Diederik P. Kingma and Jimmy Ba, 2015International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1412.6980 - The foundational paper introducing the Adam optimizer, which uses adaptive estimates of first and second moments of gradients.