In the previous chapter, we reviewed the additive modeling framework where models are built sequentially. Now, let's formalize how the standard Gradient Boosting Machine (GBM) algorithm constructs this sequence by minimizing a loss function using a process analogous to gradient descent, but operating in function space.
Our objective is to find a function F(x) that minimizes the expected value of some specified loss function L(y,F(x)), where y is the true target value and F(x) is our model's prediction. Since we work with a finite training set {(xi,yi)}i=1N, we aim to minimize the empirical loss:
L(F)=i=1∑NL(yi,F(xi))Gradient Boosting builds the model F(x) in an additive manner:
F(x)=FM(x)=m=0∑Mhm(x)=F0(x)+m=1∑Mhm(x)Here, F0(x) is an initial guess for the model (often the mean of the target values for regression, or the log-odds for classification), and hm(x) are the base learners (typically decision trees) added sequentially at each iteration m. The term βm representing the step size or weight is often absorbed into the learner hm(x) or handled by a separate learning rate parameter, as we will see.
Suppose we have already built the model up to iteration m−1, resulting in Fm−1(x). We want to find the next base learner hm(x) such that adding it to the current model improves the fit and reduces the overall loss:
Fm(x)=Fm−1(x)+hm(x)Ideally, we want hm(x) to point in the direction that maximally reduces the loss function L. Consider the loss at iteration m:
L(Fm)=i=1∑NL(yi,Fm−1(xi)+hm(xi))Think of this as taking a step hm(x) from the current position Fm−1(x) in function space. In standard gradient descent, we update parameters by moving in the direction of the negative gradient of the loss function with respect to those parameters. In Gradient Boosting, we do something analogous: we find a function hm(x) that points in the direction of the negative gradient of the loss function L, evaluated with respect to the current model's predictions Fm−1(xi) for each data point i.
Let's calculate the gradient of the loss function L with respect to the function values F(xi) at each point xi, evaluated at the current model Fm−1:
gim=[∂F(xi)∂L(yi,F(xi))]F(x)=Fm−1(x)The direction we want to move in for each observation i is the negative gradient, −gim. These negative gradient values are often called pseudo-residuals:
rim=−gim=−[∂F(xi)∂L(yi,F(xi))]F(x)=Fm−1(x)Why pseudo-residuals? If we use the squared error loss L(y,F)=21(y−F)2, the gradient is ∂F∂L=−(y−F). The negative gradient is then −(−(y−F))=y−F, which is exactly the residual (the difference between the true value and the prediction). For other loss functions, −gim behaves like a residual, indicating the direction and magnitude of error for each point according to that specific loss function.
The core idea is to fit the next base learner, hm(x), to approximate these pseudo-residuals {rim}i=1N. That is, we train hm(x) using the original features xi but with rim as the target variable:
hm≈arghmini=1∑N(rim−h(xi))2Usually, hm(x) is constrained to be a shallow decision tree (e.g., CART). The tree is built to predict the pseudo-residuals based on the input features.
Once the structure of the tree hm(x) (i.e., the splits) is determined by fitting to the pseudo-residuals, the optimal values γjm for the terminal nodes (leaves) Rjm of the tree need to be determined. Instead of simply using the average pseudo-residual in each leaf, we can find the constant value γjm for each leaf j that minimizes the original loss function L for the samples falling into that leaf:
γjm=argγminxi∈Rjm∑L(yi,Fm−1(xi)+γ)This step ensures that the contribution from the new tree directly minimizes the loss function, given the current model Fm−1. For some loss functions (like squared error), this simplifies to the average of the pseudo-residuals in the leaf, but for others (like Log Loss or Absolute Error), it requires a different calculation (e.g., median for Absolute Error). The base learner hm(x) is then defined such that hm(xi)=γjm if xi∈Rjm.
Finally, we update the overall model F(x) by adding the newly trained base learner hm(x), scaled by a learning rate ν (also known as shrinkage):
Fm(x)=Fm−1(x)+νhm(x)The learning rate ν (typically a small value between 0.01 and 0.3) scales the contribution of each new tree. This reduces the impact of individual trees and helps prevent overfitting, forcing the model to build its prediction more gradually over many iterations. It effectively provides regularization.
We can now outline the steps of the generic Gradient Boosting algorithm:
Initialize Model: Start with an initial constant prediction F0(x)=argminγ∑i=1NL(yi,γ).
Iterate for m=1 to M (Number of Trees): a. Compute Pseudo-Residuals: For each sample i=1,...,N: rim=−[∂F(xi)∂L(yi,F(xi))]F(x)=Fm−1(x) b. Fit Base Learner: Fit a base learner (e.g., a regression tree) hm(x) to the pseudo-residuals {(xi,rim)}i=1N. This determines the regions (leaves) Rjm of the tree. c. Compute Optimal Leaf Values: For each terminal node (leaf) j=1,...,Jm of the tree hm(x): γjm=argminγ∑xi∈RjmL(yi,Fm−1(xi)+γ) d. Update Model: Update the ensemble model using the learning rate ν: Fm(x)=Fm−1(x)+ν∑j=1JmγjmI(x∈Rjm) (Where I(⋅) is the indicator function. Effectively, hm(x) now represents the tree with optimal leaf values γjm.)
Output Final Model: The final prediction is given by FM(x). For classification, this might be converted to probabilities (e.g., using the sigmoid function for binary classification).
Iterative process of the Gradient Boosting Machine (GBM) algorithm. Each iteration involves computing gradients (pseudo-residuals), fitting a base learner to these gradients, optimizing the learner's contribution, and updating the ensemble model.
This step-by-step derivation shows how GBM iteratively refines its predictions by sequentially adding base learners trained to correct the errors (as defined by the negative gradient of the loss function) of the existing ensemble. The choice of loss function L dictates the specific form of the pseudo-residuals and potentially the leaf optimization step, allowing GBM to be adapted to various regression and classification tasks, which we will explore next.
© 2025 ApX Machine Learning