In the previous sections, we established how gradient descent uses the calculated gradients (the direction of steepest ascent of the loss function) to guide the adjustment of network parameters. The core idea is captured in the parameter update rules:
Wnew=Wold−η∂Wold∂Loss bnew=bold−η∂bold∂Loss
Here, W represents the weights, b represents the biases, and ∂parameter∂Loss is the gradient of the loss with respect to that parameter. But what exactly is the role of the symbol η (eta)? This is the learning rate, a small positive value that acts as a scaling factor for the gradients.
Think of the training process as trying to find the bottom of a valley (the minimum loss). The gradient tells you which direction is downhill. The learning rate, η, determines how large a step you take in that downhill direction at each iteration. It's a hyperparameter, meaning it's a configuration setting you choose before the training process begins, rather than a parameter learned during training.
Choosing an appropriate learning rate is significant for successful model training. The value of η directly influences both the speed and the ultimate success of the convergence process.
Learning Rate Too High: If η is too large, the steps taken during gradient descent can be excessive. Imagine taking giant leaps down the valley; you might completely overshoot the bottom and end up on the other side, potentially even higher up than where you started. This can cause the loss function to fluctuate wildly, failing to decrease consistently, or even diverge (increase indefinitely). The optimization process becomes unstable.
Learning Rate Too Low: Conversely, if η is too small, the steps taken are tiny. While this ensures you're moving cautiously downhill, progress can be incredibly slow. It might take an impractically long time to reach the minimum loss value. Furthermore, very small steps might make the optimizer more likely to get stuck in shallow local minima, failing to find a better, deeper minimum elsewhere in the loss landscape.
The goal is to find a learning rate that is "just right" – large enough to make reasonable progress towards the minimum in a decent amount of time, but small enough to avoid overshooting and instability.
Consider how the loss might change over training iterations with different learning rates:
Hypothetical loss curves illustrating different learning rate scenarios. A high learning rate causes oscillations and potential divergence. A low learning rate leads to very slow improvement. An appropriate learning rate shows steady convergence.
There's no single "best" learning rate; the optimal value depends heavily on the specific dataset, network architecture, and loss function. Common starting points often range from 0.1 down to 0.0001. Finding a good value typically involves experimentation:
While using a fixed learning rate is the simplest approach, more advanced techniques involve adjusting η during the training process. A common strategy is learning rate decay (or scheduling), where you start with a relatively higher learning rate to make rapid initial progress and then gradually decrease it over time. This allows the optimizer to take smaller, finer steps as it gets closer to the minimum, potentially leading to better final results. We won't detail specific scheduling algorithms here, but be aware that adaptive learning rates are a standard practice in modern deep learning.
In summary, the learning rate η is a small but influential hyperparameter in gradient descent. It dictates the step size taken during parameter updates. Selecting an appropriate value, often through experimentation, is necessary for achieving stable convergence and training effective neural networks. An incorrect choice can lead to slow training or complete failure to converge.
© 2025 ApX Machine Learning