In the previous section, we saw how gradient descent could be used to learn the parameters θ of our function approximator v^(s,θ). The idea was to minimize the squared error between the predicted value v^(St,θ) and some target value Ut. For Monte Carlo methods, this target Ut was the actual return Gt from that episode, which doesn't depend on our current value estimates. This allows us to perform true gradient descent.
Now, let's consider Temporal Difference (TD) learning. Recall from Chapter 5 that TD methods update the value estimate for a state St based on the observed reward Rt+1 and the estimated value of the next state St+1. The TD target for the update at time step t is:
Yt=Rt+1+γv^(St+1,θt)Here, v^(St+1,θt) is the current estimate of the value of the next state, using the current parameters θt. This is where things get interesting when combined with function approximation.
If we try to apply the same gradient descent approach as before, aiming to minimize the Mean Squared Error (MSE) between the prediction v^(St,θ) and the TD target Yt, we run into a subtle issue. The loss for a single transition looks like:
L(θ)=21[Yt−v^(St,θ)]2=21[Rt+1+γv^(St+1,θ)−v^(St,θ)]2Notice that the target Yt itself depends on the parameters θ because it includes the term v^(St+1,θ). A true gradient descent update would require taking the gradient of the entire expression with respect to θ. This involves calculating the gradient of the target value v^(St+1,θ), which can be complex and computationally expensive.
More significantly, the target Yt is based on an estimate which is inherently noisy and biased (since it depends on the current, possibly inaccurate, weights θ). Updating our parameters based on the gradient of this potentially flawed target can lead to instability or slow convergence.
To overcome this, TD methods with function approximation typically employ what's called a semi-gradient method. The core idea is simple but effective: when calculating the gradient for the update, we treat the TD target Yt as if it were a fixed, observed value, just like the return Gt in Monte Carlo methods. We ignore the fact that Yt depends on the current parameters θt.
Essentially, we compute the gradient only with respect to our prediction v^(St,θ), not the target part Rt+1+γv^(St+1,θt).
The gradient of the simplified loss (treating Yt as constant) is:
∇θL(θ)≈∇θ21[(Rt+1+γv^(St+1,θt))−v^(St,θ)]2 ∇θL(θ)≈−[Rt+1+γv^(St+1,θt)−v^(St,θ)]∇θv^(St,θ)Remember that gradient descent updates parameters in the opposite direction of the gradient. So, the update rule for the weights θ becomes:
θt+1←θt−α∇θL(θ) θt+1←θt+α[Rt+1+γv^(St+1,θt)−v^(St,θt)]∇θv^(St,θt)Let's break this down:
This is called a "semi-gradient" method because we are only using part of the true gradient. We're taking the gradient of our prediction v^(St,θ) but ignoring the gradient of the target v^(St+1,θ).
Let's make this concrete for the case of linear function approximation, where our value estimate is v^(s,θ)=θTx(s), and x(s) is the feature vector for state s.
As we saw earlier, the gradient of the linear value function with respect to the parameters θ is simply the feature vector itself:
∇θv^(s,θ)=x(s)Substituting this into the general semi-gradient TD update rule gives us the update rule for Linear Semi-gradient TD(0):
θt+1←θt+α[Rt+1+γθtTx(St+1)−θtTx(St)]x(St)This update rule is computationally efficient and often performs very well in practice. After observing a transition (St,At,Rt+1,St+1), we:
The diagram below illustrates the flow of information in one step of a semi-gradient TD update.
Flow diagram for a single update step in Semi-gradient TD(0) with function approximation.
While semi-gradient methods are widely used and often effective, it's important to understand that they are not true gradient descent methods on the Bellman error. Because we ignore the gradient dependency in the target, we lose the theoretical convergence guarantees associated with standard gradient descent on a fixed objective function.
In some cases, particularly when using non-linear function approximators or off-policy learning (like trying to learn Q-values for a target policy π while following a different behavior policy b), semi-gradient methods can become unstable and the parameters might diverge. This is often referred to as the "deadly triad": function approximation, bootstrapping (TD updates), and off-policy learning.
However, for on-policy TD learning with linear function approximation, semi-gradient methods are generally stable and converge reliably, not necessarily to the absolute best possible weights, but to a reasonable approximation near the optimal solution within the limits of the chosen features and function approximator. The convergence is typically to a fixed point minimizing the Projected Bellman Error, a detailed discussion of which is beyond our current scope but signifies a well-defined objective.
Semi-gradient methods provide a pragmatic and computationally feasible way to combine the power of TD learning with the necessity of function approximation for large-scale problems. They form the foundation for many advanced algorithms, including the Deep Q-Networks we will touch upon later. Next, we'll briefly consider using more powerful, non-linear function approximators like neural networks.
© 2025 ApX Machine Learning