In the additive framework of Gradient Boosting, we build the model sequentially, adding one base learner (typically a decision tree) at each step to correct the errors made by the existing ensemble. The update rule at step m is generally formulated as:
Fm(x)=Fm−1(x)+hm(x)Here, Fm−1(x) is the model ensemble built over the previous m−1 steps, and hm(x) is the new base learner trained to fit the negative gradient (pseudo-residuals) of the loss function with respect to Fm−1(x).
While this seems straightforward, adding the full prediction of the newly trained tree hm(x) at each step can be overly aggressive. If a tree perfectly fits the current pseudo-residuals, it might overfit to the specific errors present at that stage, potentially harming the model's ability to generalize to unseen data. This is where shrinkage, also known as the learning rate, comes into play.
Shrinkage introduces a scaling factor, typically denoted by η (eta) or sometimes ν (nu), which reduces the contribution of each new tree added to the ensemble. The modified update rule becomes:
Fm(x)=Fm−1(x)+ηhm(x)where 0<η≤1.
The primary purpose of shrinkage is regularization. By reducing the influence of each individual tree (η<1), we slow down the learning process. Instead of allowing a single tree to make a large correction based on the current pseudo-residuals, shrinkage forces the model to take smaller steps in the function space towards minimizing the loss.
Consider the implications:
n_estimators
) are typically needed to achieve a similar level of performance on the training data compared to a model without shrinkage (η=1).Think of it like gradient descent optimization. The learning rate η in GBM serves a purpose analogous to the step size in numerical optimization algorithms. A smaller step size requires more iterations to reach a minimum but can help avoid overshooting the optimal point and may find a better, more stable minimum, especially in complex loss landscapes.
There's an inherent trade-off between the learning rate (η) and the number of boosting iterations (M or n_estimators
).
The following chart illustrates this concept. With a lower learning rate, both training and validation error decrease more slowly, but the validation error often reaches a lower minimum or starts increasing later (indicating overfitting onset) compared to a higher learning rate.
Comparison of error curves for different learning rates. A smaller learning rate (e.g., 0.1) converges more slowly but can potentially achieve better generalization (lower validation error) before overfitting, compared to a larger learning rate (e.g., 0.5). The optimal number of estimators differs significantly.
n_estimators
. Techniques like grid search, random search, or Bayesian optimization are used, usually evaluating performance via cross-validation. Early stopping is often employed to find the optimal number of iterations for a given learning rate.max_depth
) or less aggressive subsampling (subsample
) without immediately overfitting, as the impact of each potentially more complex tree is dampened.In summary, shrinkage is a simple yet highly effective technique for regularizing Gradient Boosting models. By scaling down the contribution of each newly added tree, it prevents individual trees from dominating the prediction, encourages the model to utilize more boosting rounds for a refined fit, and significantly improves the model's ability to generalize to new data. Finding the right balance between the learning rate and the number of estimators is a fundamental aspect of building high-performing GBMs.
© 2025 ApX Machine Learning