Decoupled Weight Decay Regularization, Ilya Loshchilov, Frank Hutter, 2019International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1711.05101 - Introduces the concept of decoupling weight decay from adaptive optimizers like Adam, leading to the AdamW algorithm.
Lookahead Optimizer: K steps forward, 1 step back, Michael R. Zhang, James Lucas, Geoffrey Hinton, Jimmy Ba, 2019Advances in Neural Information Processing Systems (NeurIPS)DOI: 10.48550/arXiv.1907.08610 - Presents the Lookahead optimization algorithm, which improves stability and accelerates convergence by combining fast and slow weights.
On the Variance of the Adaptive Learning Rate and Stochastic Optimizers, Liyuan Liu, Haoming Jiang, Pengcheng He, Weizhu Chen, Xiaodong Liu, Jianfeng Gao, Jiawei Han, 2019International Conference on Learning Representations (ICLR)DOI: 10.48550/arXiv.1908.03265 - Introduces Rectified Adam (RAdam), which addresses issues with adaptive learning rate variance during early training, a core component of the Ranger optimizer.