While the Adam optimizer, discussed previously, has become a default choice for many deep learning tasks due to its efficiency and effectiveness, further analysis revealed potential issues with its convergence proof. Specifically, the use of the exponentially decaying average of past squared gradients (vt) doesn't guarantee that this term remains non-decreasing. In some scenarios, particularly late in training or with very informative gradients appearing after less informative ones, vt could decrease significantly. This decrease, when used in the denominator of the parameter update step (v^t+ϵη), could lead to undesirably large learning steps, potentially causing the optimizer to diverge or fail to converge to an optimal solution.
AMSGrad (Adaptive Moment Estimation with improved convergence guarantees) was proposed to directly address this theoretical flaw in Adam's convergence analysis. The core idea is simple yet effective: ensure that the adaptive learning rate term used for scaling gradients does not increase over time.
Instead of directly using the current estimate of the second moment (vt), AMSGrad maintains the maximum value of this estimate seen so far. Let's break down the update steps:
Compute Gradient: Calculate the gradient gt of the loss function with respect to the parameters θt at timestep t.
gt=∇θL(θt)Update Biased First Moment Estimate: This is the same as in Adam, using the hyperparameter β1.
mt=β1mt−1+(1−β1)gtUpdate Biased Second Moment Estimate: This is also the same as in Adam, using the hyperparameter β2.
vt=β2vt−1+(1−β2)gt2(Note: gt2 denotes element-wise squaring).
Maintain Maximum of Second Moment Estimates: This is the key difference introduced by AMSGrad. A new variable, v^t, keeps track of the maximum value of v encountered up to timestep t.
v^t=max(v^t−1,vt)Initialize v^0=0. The max operation is performed element-wise.
Compute Bias-Corrected First Moment Estimate: Same as Adam.
m^t=1−β1tmtUpdate Parameters: Use the maximum second moment estimate v^t in the denominator, instead of the bias-corrected v^t used in Adam's update. Notice that v^t itself is not bias-corrected in the standard AMSGrad formulation used for the update step.
θt+1=θt−v^t+ϵηm^tHere, η is the step size (learning rate) and ϵ is a small constant for numerical stability.
By ensuring that the denominator term v^t+ϵ is non-decreasing, AMSGrad prevents the large, potentially destabilizing steps that could occur in Adam if vt were to shrink unexpectedly. This modification provides stronger theoretical convergence guarantees, particularly in online and non-convex settings.
Does the theoretical improvement translate to better practical performance? The results are mixed and often problem-dependent.
When to Consider AMSGrad:
Most deep learning frameworks (like TensorFlow and PyTorch) offer AMSGrad as a simple boolean flag within their Adam optimizer implementation (e.g., tf.keras.optimizers.Adam(amsgrad=True)
or torch.optim.Adam(amsgrad=True)
). This makes experimentation straightforward.
While Adam remains a very strong baseline and often the first choice, AMSGrad provides a valuable alternative, grounded in addressing a specific theoretical shortcoming of Adam. Knowing its mechanism and motivation allows you to make informed decisions when tuning optimization strategies for complex models.
© 2025 ApX Machine Learning