As we saw with Newton's method, incorporating curvature information via the Hessian matrix can significantly accelerate convergence towards a minimum. However, the practical application of Newton's method in large-scale machine learning often hits a wall due to the computational burden of forming and inverting the Hessian matrix at every iteration. For a model with N parameters, the Hessian is an N×N matrix, requiring O(N2) storage and typically O(N3) computation to invert. When N is in the millions or billions, this becomes completely infeasible.
This is where Quasi-Newton methods come into play. They offer a clever compromise: retain the superlinear convergence benefits associated with second-order information without explicitly computing, storing, or inverting the Hessian. The central idea is to build up an approximation of the Hessian, or more commonly, its inverse, using only information from successive gradient evaluations and parameter updates.
Instead of calculating ∇2f(xk) directly, Quasi-Newton methods maintain a matrix Bk that approximates the Hessian ∇2f(xk), or a matrix Hk that approximates the inverse Hessian (∇2f(xk))−1. The update step then looks similar to Newton's method, but uses the approximation:
The matrix Hk is often preferred in practice as it avoids solving a linear system at each step, replacing it with a simpler matrix-vector multiplication. The step size αk is usually determined by a line search procedure, ensuring sufficient decrease in the objective function.
How is the approximation Bk or Hk updated from one iteration to the next? The key lies in the secant condition (also known as the Quasi-Newton condition).
Let's define the change in parameters and gradients between two iterations:
sk=xk+1−xk yk=∇f(xk+1)−∇f(xk)Recall from calculus that for a quadratic function, the Hessian relates the change in gradient to the change in position: ∇f(xk+1)−∇f(xk)=∇2f(xk+1−xk). Quasi-Newton methods enforce a similar relationship for the next approximation, Bk+1. We require that the new approximation correctly maps the step just taken (sk) to the observed change in gradient (yk):
Bk+1sk=yk(Secant condition for Hessian approximation)Equivalently, if we are approximating the inverse Hessian Hk, the condition becomes:
Hk+1yk=sk(Secant condition for inverse Hessian approximation)This condition uses the readily available gradient information (yk) and parameter change (sk) from the latest step to impose a constraint on the updated approximation. It effectively incorporates information about the function's curvature along the direction sk.
The secant condition provides one constraint, but it doesn't uniquely determine the entire matrix Bk+1 or Hk+1. Different Quasi-Newton methods arise from different choices for how to update the matrix while satisfying the secant condition. Generally, the update is formulated as adding a low-rank correction to the current approximation:
Bk+1=Bk+ΔBk Hk+1=Hk+ΔHkThe specific form of the update term (ΔBk or ΔHk) defines the particular algorithm. Popular choices aim to:
The most successful and widely used update formula resulting from these principles is the Broyden–Fletcher–Goldfarb–Shanno (BFGS) update, which we will detail in the next section. Its memory-efficient variant, Limited-memory BFGS (L-BFGS), is a workhorse algorithm for large-scale machine learning optimization problems.
In essence, Quasi-Newton methods cleverly sidestep the computational bottleneck of Newton's method by building curvature information incrementally through gradient differences. They represent a powerful class of algorithms that often converge much faster than first-order methods, especially on problems where curvature plays a significant role, without incurring the prohibitive cost of exact second-order approaches.
© 2025 ApX Machine Learning