As we saw in the introduction to this chapter, storing values for every state or state-action pair in a table becomes infeasible when dealing with large or continuous state spaces. Imagine trying to create a Q-table for a self-driving car where the state includes sensor readings like camera images and Lidar data, the state space is practically infinite! Tabular methods simply don't scale.
The solution is to move from explicit storage to estimation. Instead of learning the exact value vπ(s) or qπ(s,a) for every state or state-action pair, we learn a parameterized function that approximates these values. This technique is known as Value Function Approximation (VFA).
We introduce a function, let's call it v^, which takes a state s and a parameter vector w as input, and outputs an estimated state value:
v^(s,w)≈vπ(s)Similarly, we can approximate the action-value function qπ(s,a) with a function q^ that takes the state s, action a, and parameters w:
q^(s,a,w)≈qπ(s,a)Here, w is a vector of weights or parameters (e.g., coefficients in a linear model or weights in a neural network). The key idea is that the number of parameters in w is significantly smaller than the total number of states ∣S∣ or state-action pairs ∣S∣×∣A∣. For instance, we might have millions of states but only need a few hundred or thousand parameters in w.
The primary advantage of using function approximation is generalization. Because the function approximator learns a relationship based on the parameters w, it can estimate values even for states it hasn't encountered before, or hasn't encountered very often. If two states s1 and s2 are considered "similar" (often determined by how we represent them, which we'll discuss soon), the function approximator will likely produce similar value estimates v^(s1,w) and v^(s2,w). This allows the agent to leverage experience gained in one part of the state space to make better decisions in other, similar parts. Tabular methods, in contrast, treat each state independently; learning about s1 tells you nothing about s2.
We can use various types of functions for v^ and q^. Common choices include:
In this course, we will primarily focus on linear methods and introduce the concepts behind using neural networks.
Our objective when using VFA is to find the parameter vector w that makes our approximation v^(s,w) or q^(s,a,w) as close as possible to the true value function vπ(s) or qπ(s,a) (or the optimal v∗(s) or q∗(s,a)). This is typically framed as minimizing an error objective, such as the Mean Squared Value Error (MSVE) over the distribution of states encountered:
MSVE(w)=s∈S∑d(s)[vπ(s)−v^(s,w)]2where d(s) is some weighting indicating how much we care about the error in state s.
While minimizing the MSVE directly is often the goal, the algorithms we'll use (like adaptations of TD learning) actually optimize slightly different objectives due to the nature of RL updates.
You might notice that finding the parameters w sounds similar to supervised learning. We have inputs (states s, or state-action pairs (s,a)) and we want to predict target outputs (the true values vπ(s) or qπ(s,a)). Indeed, we will use techniques like gradient descent, familiar from supervised learning, to update w.
However, there's a significant difference: in RL, we usually don't know the true target values vπ(s) or qπ(s,a). Instead, we use estimates of these values derived from interaction with the environment (e.g., observed rewards and subsequent estimated values, as in TD learning). This means our target values are often noisy, biased, and non-stationary (they change as the policy and value estimates improve), which presents unique challenges compared to standard supervised learning.
In the following sections, we'll look at how to represent states using features and how to apply gradient-based methods to learn the parameters w for both linear and non-linear function approximators within the RL framework.
© 2025 ApX Machine Learning