All Courses

Semi-gradient TD Methods

In the previous section, we saw how gradient descent could be used to learn the parameters $\theta$ of our function approximator $\hat{v}(s, \theta)$ . The idea was to minimize the squared error between the predicted value $\hat{v}(S_t, \theta)$ and some target value $U_t$ . For Monte Carlo methods, this target $U_t$ was the actual return $G_t$ from that episode, which doesn't depend on our current value estimates. This allows us to perform true gradient descent.

Now, let's consider Temporal Difference (TD) learning. Recall from Chapter 5 that TD methods update the value estimate for a state $S_t$ based on the observed reward $R_{t+1}$ and the estimated value of the next state $S_{t+1}$ . The TD target for the update at time step $t$ is:

Y_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \theta_t)

Here, $\hat{v}(S_{t+1}, \theta_t)$ is the current estimate of the value of the next state, using the current parameters $\theta_t$ . This is where things get interesting when combined with function approximation.

If we try to apply the same gradient descent approach as before, aiming to minimize the Mean Squared Error (MSE) between the prediction $\hat{v}(S_t, \theta)$ and the TD target $Y_t$ , we run into a subtle issue. The loss for a single transition looks like:

L(\theta) = \frac{1}{2} [ Y_t - \hat{v}(S_t, \theta) ]^2 = \frac{1}{2} [ R_{t+1} + \gamma \hat{v}(S_{t+1}, \theta) - \hat{v}(S_t, \theta) ]^2

Notice that the target $Y_t$ itself depends on the parameters $\theta$ because it includes the term $\hat{v}(S_{t+1}, \theta)$ . A true gradient descent update would require taking the gradient of the entire expression with respect to $\theta$ . This involves calculating the gradient of the target value $\hat{v}(S_{t+1}, \theta)$ , which can be complex and computationally expensive.

More significantly, the target $Y_t$ is based on an estimate which is inherently noisy and biased (since it depends on the current, possibly inaccurate, weights $\theta$ ). Updating our parameters based on the gradient of this potentially flawed target can lead to instability or slow convergence.

The Semi-gradient Approach

To overcome this, TD methods with function approximation typically employ what's called a semi-gradient method. The core idea is simple but effective: when calculating the gradient for the update, we treat the TD target $Y_t$ as if it were a fixed, observed value, just like the return $G_t$ in Monte Carlo methods. We ignore the fact that $Y_t$ depends on the current parameters $\theta_t$ .

Essentially, we compute the gradient only with respect to our prediction $\hat{v}(S_t, \theta)$ , not the target part $R_{t+1} + \gamma \hat{v}(S_{t+1}, \theta_t)$ .

The gradient of the simplified loss (treating $Y_t$ as constant) is:

\nabla_{\theta} L(\theta) \approx \nabla_{\theta} \frac{1}{2} [ (R_{t+1} + \gamma \hat{v}(S_{t+1}, \theta_t)) - \hat{v}(S_t, \theta) ]^2

\nabla_{\theta} L(\theta) \approx - [ R_{t+1} + \gamma \hat{v}(S_{t+1}, \theta_t) - \hat{v}(S_t, \theta) ] \nabla_{\theta} \hat{v}(S_t, \theta)

Remember that gradient descent updates parameters in the opposite direction of the gradient. So, the update rule for the weights $\theta$ becomes:

\theta_{t+1} \leftarrow \theta_t - \alpha \nabla_{\theta} L(\theta)

\theta_{t+1} \leftarrow \theta_t + \alpha [ R_{t+1} + \gamma \hat{v}(S_{t+1}, \theta_t) - \hat{v}(S_t, \theta_t) ] \nabla_{\theta} \hat{v}(S_t, \theta_t)

Let's break this down:

TD Error ( $\delta_t$ ): The term in the square brackets, $R_{t+1} + \gamma \hat{v}(S_{t+1}, \theta_t) - \hat{v}(S_t, \theta_t)$ , is the familiar TD error we encountered in tabular TD(0). It represents the difference between the estimated value at $S_t$ and the better estimate derived from the immediate reward and the value of the next state.
Gradient of Value Function ( $\nabla_{\theta} \hat{v}(S_t, \theta_t)$ ): This term tells us how changing each parameter in $\theta$ would affect the value estimate for the current state $S_t$ . It directs the update towards the parameters most responsible for the current prediction.
Learning Rate ( $\alpha$ ): Controls the step size of the update.

This is called a "semi-gradient" method because we are only using part of the true gradient. We're taking the gradient of our prediction $\hat{v}(S_t, \theta)$ but ignoring the gradient of the target $\hat{v}(S_{t+1}, \theta)$ .

Semi-gradient TD(0) with Linear Function Approximation

Let's make this concrete for the case of linear function approximation, where our value estimate is $\hat{v}(s, \theta) = \theta^T x(s)$ , and $x(s)$ is the feature vector for state $s$ .

As we saw earlier, the gradient of the linear value function with respect to the parameters $\theta$ is simply the feature vector itself:

\nabla_{\theta} \hat{v}(s, \theta) = x(s)

Substituting this into the general semi-gradient TD update rule gives us the update rule for Linear Semi-gradient TD(0):

\theta_{t+1} \leftarrow \theta_t + \alpha [ R_{t+1} + \gamma \theta_t^T x(S_{t+1}) - \theta_t^T x(S_t) ] x(S_t)

This update rule is computationally efficient and often performs very well in practice. After observing a transition $(S_t, A_t, R_{t+1}, S_{t+1})$ , we:

Get the feature vectors for the current state, $x(S_t)$ , and the next state, $x(S_{t+1})$ .
Calculate the current value estimates: $\hat{v}(S_t, \theta_t) = \theta_t^T x(S_t)$ and $\hat{v}(S_{t+1}, \theta_t) = \theta_t^T x(S_{t+1})$ .
Compute the TD error: $\delta_t = R_{t+1} + \gamma \hat{v}(S_{t+1}, \theta_t) - \hat{v}(S_t, \theta_t)$ .
Update the weight vector: $\theta_{t+1} \leftarrow \theta_t + \alpha \delta_t x(S_t)$ .

The diagram below illustrates the flow of information in one step of a semi-gradient TD update.

Flow diagram for a single update step in Semi-gradient TD(0) with function approximation.

Why "Semi"? Convergence Considerations

While semi-gradient methods are widely used and often effective, it's important to understand that they are not true gradient descent methods on the Bellman error. Because we ignore the gradient dependency in the target, we lose the theoretical convergence guarantees associated with standard gradient descent on a fixed objective function.

In some cases, particularly when using non-linear function approximators or off-policy learning (like trying to learn Q-values for a target policy $\pi$ while following a different behavior policy $b$ ), semi-gradient methods can become unstable and the parameters might diverge. This is often referred to as the "deadly triad": function approximation, bootstrapping (TD updates), and off-policy learning.

However, for on-policy TD learning with linear function approximation, semi-gradient methods are generally stable and converge reliably, not necessarily to the absolute best possible weights, but to a reasonable approximation near the optimal solution within the limits of the chosen features and function approximator. The convergence is typically to a fixed point minimizing the Projected Bellman Error, a detailed discussion of which is outside our current scope but indicates a well-defined objective.

Semi-gradient methods provide a pragmatic and computationally feasible way to combine the power of TD learning with the necessity of function approximation for large-scale problems. They form the foundation for many advanced algorithms, including the Deep Q-Networks we will touch upon later. Next, we'll briefly consider using more powerful, non-linear function approximators like neural networks.

Was this section helpful?