All Courses

Value Function Approximation (VFA)

As we saw in the introduction to this chapter, storing values for every state or state-action pair in a table becomes infeasible when dealing with large or continuous state spaces. Imagine trying to create a Q-table for a self-driving car where the state includes sensor readings like camera images and Lidar data, the state space is practically infinite! Tabular methods simply don't scale.

The solution is to move from explicit storage to estimation. Instead of learning the exact value $v_\pi(s)$ or $q_\pi(s, a)$ for every state or state-action pair, we learn a parameterized function that approximates these values. This technique is known as Value Function Approximation (VFA).

We introduce a function, let's call it $\hat{v}$ , which takes a state $s$ and a parameter vector $\mathbf{w}$ as input, and outputs an estimated state value:

\hat{v}(s, \mathbf{w}) \approx v_\pi(s)

Similarly, we can approximate the action-value function $q_\pi(s, a)$ with a function $\hat{q}$ that takes the state $s$ , action $a$ , and parameters $\mathbf{w}$ :

\hat{q}(s, a, \mathbf{w}) \approx q_\pi(s, a)

Here, $\mathbf{w}$ is a vector of weights or parameters (e.g., coefficients in a linear model or weights in a neural network). The main idea is that the number of parameters in $\mathbf{w}$ is significantly smaller than the total number of states $|S|$ or state-action pairs $|S| \times |A|$ . For instance, we might have millions of states but only need a few hundred or thousand parameters in $\mathbf{w}$ .

The Power of Generalization

The primary advantage of using function approximation is generalization. Because the function approximator learns a relationship based on the parameters $\mathbf{w}$ , it can estimate values even for states it hasn't encountered before, or hasn't encountered very often. If two states $s_1$ and $s_2$ are considered "similar" (often determined by how we represent them, which we'll discuss soon), the function approximator will likely produce similar value estimates $\hat{v}(s_1, \mathbf{w})$ and $\hat{v}(s_2, \mathbf{w})$ . This allows the agent to leverage experience gained in one part of the state space to make better decisions in other, similar parts. Tabular methods, in contrast, treat each state independently; learning about $s_1$ tells you nothing about $s_2$ .

Types of Function Approximators

We can use various types of functions for $\hat{v}$ and $\hat{q}$ . Common choices include:

Linear Functions: These approximate the value as a linear combination of features representing the state (or state-action pair). They are simple and well-understood.
Neural Networks: These are powerful non-linear function approximators capable of learning complex relationships between states/actions and their values. Deep neural networks form the basis of Deep Reinforcement Learning.
Other methods like decision trees, tile coding, radial basis functions, etc., can also be used.

In this course, we will primarily focus on linear methods and introduce the concepts behind using neural networks.

The Learning Goal

Our objective when using VFA is to find the parameter vector $\mathbf{w}$ that makes our approximation $\hat{v}(s, \mathbf{w})$ or $\hat{q}(s, a, \mathbf{w})$ as close as possible to the true value function $v_\pi(s)$ or $q_\pi(s, a)$ (or the optimal $v_*(s)$ or $q_*(s, a)$ ). This is typically framed as minimizing an error objective, such as the Mean Squared Value Error (MSVE) over the distribution of states encountered:

MSVE(\mathbf{w}) = \sum_{s \in S} d(s) [v_\pi(s) - \hat{v}(s, \mathbf{w})]^2

where $d(s)$ is some weighting indicating how much we care about the error in state $s$ .

While minimizing the MSVE directly is often the goal, the algorithms we'll use (like adaptations of TD learning) actually optimize slightly different objectives due to the nature of RL updates.

Connection to Supervised Learning

You might notice that finding the parameters $\mathbf{w}$ sounds similar to supervised learning. We have inputs (states $s$ , or state-action pairs $(s,a)$ ) and we want to predict target outputs (the true values $v_\pi(s)$ or $q_\pi(s,a)$ ). Indeed, we will use techniques like gradient descent, familiar from supervised learning, to update $\mathbf{w}$ .

However, there's a significant difference: in RL, we usually don't know the true target values $v_\pi(s)$ or $q_\pi(s,a)$ . Instead, we use estimates of these values derived from interaction with the environment (e.g., observed rewards and subsequent estimated values, as in TD learning). This means our target values are often noisy, biased, and non-stationary (they change as the policy and value estimates improve), which presents unique challenges compared to standard supervised learning.

In the following sections, we'll look at how to represent states using features and how to apply gradient-based methods to learn the parameters $\mathbf{w}$ for both linear and non-linear function approximators within the RL framework.

Was this section helpful?