Now that we've reviewed the essential components of Markov Decision Processes (MDPs) and the Bellman equations, let's examine two fundamental dynamic programming algorithms used to solve MDPs when the model (transitions and rewards) is known: Policy Iteration and Value Iteration. These methods provide a basis for many reinforcement learning techniques that operate without a known model.
Policy Iteration is an approach that finds the optimal policy π∗ by iteratively performing two steps: evaluating the current policy and then improving it based on that evaluation. It starts with an arbitrary initial policy π0 and alternates between these two phases until the policy no longer changes.
In this step, we calculate the state-value function Vπ(s) for the current policy π. Recall the Bellman expectation equation for Vπ:
Vπ(s)=a∑π(a∣s)s′∑P(s′∣s,a)[R(s,a,s′)+γVπ(s′)]For a finite MDP with ∣S∣ states, this equation defines a system of ∣S∣ linear equations in ∣S∣ unknowns (Vπ(s) for all s∈S). This system can be solved directly. However, a common iterative approach involves repeatedly applying the Bellman expectation backup operator to an initial estimate Vk(s) until the values converge:
Vk+1(s)←a∑π(a∣s)s′∑P(s′∣s,a)[R(s,a,s′)+γVk(s′)]This iterative update is performed for all states s until the change in the value function between iterations is sufficiently small (e.g., maxs∣Vk+1(s)−Vk(s)∣<θ). The result is an accurate estimate of Vπ.
Once we have the value function Vπ for the current policy π, we check if we can improve it. We do this by considering, for each state s, whether acting differently than prescribed by π would yield a better expected return. This involves calculating the action-value function Qπ(s,a) using the computed Vπ:
Qπ(s,a)=s′∑P(s′∣s,a)[R(s,a,s′)+γVπ(s′)]Then, we define a new, potentially improved policy π′ that acts greedily with respect to Qπ(s,a):
π′(s)=argamaxQπ(s,a)This means π′(s) selects the action that maximizes the expected one-step lookahead value, assuming we follow the original policy π thereafter (as captured by Vπ). The Policy Improvement Theorem guarantees that this new policy π′ is at least as good as, and possibly better than, π. That is, Vπ′(s)≥Vπ(s) for all s∈S.
The overall algorithm alternates these two steps:
Because a finite MDP has a finite number of policies, and each policy improvement step yields a policy that is strictly better or equal, Policy Iteration is guaranteed to converge to the optimal policy π∗ in a finite number of iterations.
The Policy Iteration process alternates between evaluating the current policy and improving it greedily based on the evaluation.
A potential drawback of PI is that the Policy Evaluation step can be computationally intensive, requiring many sweeps through the state space until Vπ converges accurately.
Value Iteration offers a different approach. Instead of alternating between full policy evaluation and improvement, VI directly iteratively computes the optimal value function V∗ by repeatedly applying the Bellman optimality backup operator.
Recall the Bellman optimality equation for V∗:
V∗(s)=amaxs′∑P(s′∣s,a)[R(s,a,s′)+γV∗(s′)]Value Iteration turns this equation into an update rule. It starts with an initial guess V0(s) (often all zeros) and iteratively refines it:
Vk+1(s)←amaxs′∑P(s′∣s,a)[R(s,a,s′)+γVk(s′)]This update is applied synchronously (or asynchronously) to all states s in each iteration k. Notice how the maximization over actions (policy improvement aspect) is directly incorporated into the value update (policy evaluation aspect).
Value Iteration is guaranteed to converge to the optimal value function V∗ because the Bellman optimality operator is a contraction mapping, ensuring that the value estimates get progressively closer to the true V∗ with each iteration. Once V∗ (or a close approximation) is found, the optimal policy π∗ can be extracted by performing a single step of policy improvement, as shown in the final output step.
Example convergence of the value function for a specific state during Value Iteration. The value approaches the optimal value V∗(s) over iterations.
In practice, Value Iteration often converges faster, especially when the number of possible policies is very large. Variations exist, such as modified policy iteration, which performs only a fixed number of Bellman expectation backups during the evaluation phase instead of iterating until full convergence.
Both VI and PI are fundamental dynamic programming methods for finding optimal policies in known MDPs. They illustrate the core concepts of evaluating policies and improving them using Bellman backups, which are central ideas adapted by many model-free reinforcement learning algorithms we will study later.
© 2025 ApX Machine Learning