Implementing the First-Visit Monte Carlo Prediction algorithm involves estimating the state-value function $V^\pi$ for a given policy $\pi$. This estimation uses only sample episodes generated by following that policy in an environment. There is no need for knowledge of the environment's transition probabilities or reward dynamics.For this exercise, we'll use the classic Blackjack environment, available through the Gymnasium library. Recall from the previous section that MC Prediction works by averaging the returns observed after the first visit to each state across many episodes.The Blackjack EnvironmentFirst, ensure you have Gymnasium installed (pip install gymnasium[classic_control]). The Blackjack environment (Blackjack-v1) simulates the popular card game.State: A tuple (player_sum, dealer_showing_card, usable_ace).player_sum: The current sum of the player's cards (integer between 4 and 21).dealer_showing_card: The value of the dealer's face-up card (integer between 1 and 10, where 1 represents an Ace).usable_ace: Whether the player has an Ace that can count as 11 without busting (boolean, 0 or 1).Actions: Two actions are available:0: Stick (stop receiving cards).1: Hit (take another card).Rewards:+1: Player wins.-1: Player loses.0: Draw (push).Episode End: An episode ends when the player sticks or busts (sum > 21), or when the dealer finishes their turn according to standard Blackjack rules (dealer hits until their sum is 17 or more).The Policy to EvaluateWe need a policy $\pi$ to generate episodes. We'll use a simple, fixed policy often used as a baseline:Policy $\pi$: Stick if the player's current sum is 20 or 21. Otherwise, hit.This policy is reasonable but not optimal. Our goal is not to find the best policy yet, but simply to evaluate this specific policy using MC Prediction.Implementing First-Visit MC PredictionLet's outline the steps and then implement them in Python.Initialization:Create dictionaries to store the sum of returns for each state (state_returns_sum) and the count of visits to each state (state_visit_count).Initialize the state-value function $V(s)$ as a dictionary, potentially starting all values at 0.Episode Loop: Run for a specified number of episodes.Generate Episode: Play one full game of Blackjack using the fixed policy $\pi$. Store the sequence of states and rewards: $(S_0, R_1, S_1, R_2, ..., S_{T-1}, R_T)$.Calculate Returns: Iterate backwards from the end of the episode ($t = T-1, T-2, ..., 0$).Calculate the return $G$ starting from step $t$: $G = R_{t+1} + \gamma R_{t+2} + ... + \gamma^{T-t-1} R_T$. Since Blackjack is episodic and we care about the final outcome, we'll use $\gamma = 1$. So, $G$ becomes the final reward obtained from state $S_t$ onwards. In practice, we just need the final reward $R_T$ because intermediate rewards are 0.First-Visit Check: Keep track of the states visited within the current episode. If this is the first time state $S_t$ has been visited in this episode, proceed to update.Update Values:Increment the visit count for state $S_t$: state_visit_count[S_t] += 1.Add the calculated return $G$ to the sum for state $S_t$: state_returns_sum[S_t] += G.Update the value estimate: $V(S_t) = \text{state_returns_sum}[S_t] / \text{state_visit_count}[S_t]$.Now, let's translate this into Python code.import gymnasium as gym import numpy as np from collections import defaultdict import random # Used only if making policy stochastic, not needed for this fixed policy # Initialize the environment env = gym.make('Blackjack-v1', sab=True) # sab=True uses slightly different rules, similar results # Define the simple policy: Stick on 20 or 21, otherwise Hit. def simple_policy(state): """ Args: state: Tuple (player_sum, dealer_showing, usable_ace) Returns: action: 0 (stick) or 1 (hit) """ player_sum, _, _ = state if player_sum >= 20: return 0 # Stick else: return 1 # Hit # MC Prediction Algorithm (First-Visit) def mc_prediction_first_visit(policy, env, num_episodes, gamma=1.0): """ Performs First-Visit Monte Carlo Prediction. Args: policy: A function that takes a state and returns an action. env: The OpenAI Gym environment. num_episodes: Number of episodes to generate. gamma: Discount factor. Returns: V: A dictionary mapping state -> value estimate. """ # Initialize dictionaries # defaultdict avoids checking if exists state_returns_sum = defaultdict(float) state_visit_count = defaultdict(int) V = defaultdict(float) # Final value estimates for i_episode in range(num_episodes): # Print progress if (i_episode + 1) % 10000 == 0: print(f"\rEpisode {i_episode+1}/{num_episodes}", end="") # Generate an episode following the policy episode = [] state, _ = env.reset() # Get initial state terminated = False truncated = False # Handle potential truncation by gymnasium while not terminated and not truncated: action = policy(state) next_state, reward, terminated, truncated, _ = env.step(action) episode.append((state, reward)) # Store state and *next* reward state = next_state # Process the episode for First-Visit MC states_in_episode = set([s for (s, r) in episode]) # Get unique states visited visited_in_episode = set() # Track first visits within this episode # Loop backwards through the episode G = 0.0 # Initialize return # The episode stores (S_t, R_{t+1}) pairs. # We iterate backwards to calculate G_t = R_{t+1} + gamma*R_{t+2} + ... for t in range(len(episode) - 1, -1, -1): state, reward = episode[t] G = gamma * G + reward # If this is the first visit to 'state' in this episode if state not in visited_in_episode: visited_in_episode.add(state) state_returns_sum[state] += G state_visit_count[state] += 1 # Update value estimate incrementally V[state] = state_returns_sum[state] / state_visit_count[state] print(f"\nFinished {num_episodes} episodes.") return V # Run the prediction num_episodes = 500000 V_simple_policy = mc_prediction_first_visit(simple_policy, env, num_episodes) # Print some example values (optional) # Example: Value of state (18, 7, False) - Player sum 18, dealer shows 7, no usable ace print(f"\nExample state value V(18, 7, False): {V_simple_policy.get((18, 7, False), 'Not Visited')}") # Example: Value of state (21, 10, True) - Player sum 21, dealer shows 10, usable ace print(f"Example state value V(21, 10, True): {V_simple_policy.get((21, 10, True), 'Not Visited')}") env.close()Code Explanationsimple_policy(state): This function implements our fixed strategy. It takes the current state tuple and returns 0 (stick) if the player's sum is 20 or 21, and 1 (hit) otherwise.mc_prediction_first_visit(...):Initializes state_returns_sum, state_visit_count, and V using defaultdict for convenience. defaultdict(float) initializes new keys with 0.0, and defaultdict(int) initializes with 0.The main loop runs for num_episodes.Inside the loop, an episode is generated step-by-step using env.step(policy(state)). We store tuples of (state, reward) where reward is $R_{t+1}$ received after being in state $S_t$.After the episode ends, we find all unique states visited (states_in_episode). Note: This line isn't strictly necessary for the logic but can be useful for debugging or analysis. The visited_in_episode set is the important one for the algorithm.The inner loop iterates backwards through the episode steps (t = T-1, ..., 0).$G$ accumulates the discounted reward from step $t$ onwards. Since $\gamma=1$ and intermediate rewards are 0 in Blackjack, $G$ effectively becomes the final reward of the episode ($+1, -1, 0$) once calculated for the first time.The if state not in visited_in_episode: check ensures we only process the return for the first occurrence of the state in this specific episode.If it's the first visit, we add the state to visited_in_episode, update the total returns sum and visit count for that state, and recalculate the average return, which is our estimate $V(state)$.Running the Code: We instantiate the environment, run the prediction function for a large number of episodes (MC methods often require many samples), and print a couple of example state values.Visualizing ResultsEstimating values for all possible Blackjack states is useful, but visualizing them provides more insight. Since the state space is three-dimensional (player_sum, dealer_showing, usable_ace), we can create 2D plots by fixing one dimension. Let's plot the estimated state value $V(s)$ against the player_sum, separately for cases with and without a usable ace, holding the dealer_showing_card constant (e.g., dealer shows a 7).import matplotlib.pyplot as plt def plot_blackjack_values(V, dealer_card=7): """Helper function to plot V(s) against player sum""" player_sums = np.arange(12, 22) # Relevant player sums # Values for states without usable ace values_no_ace = [V.get((s, dealer_card, False), 0) for s in player_sums] # Values for states with usable ace values_usable_ace = [V.get((s, dealer_card, True), 0) for s in player_sums] fig, ax = plt.subplots(figsize=(10, 6)) ax.plot(player_sums, values_no_ace, label='No Usable Ace', marker='o') ax.plot(player_sums, values_usable_ace, label='Usable Ace', marker='x') ax.set_xlabel("Player Sum") ax.set_ylabel("State Value (Estimated V)") ax.set_title(f"Value Function vs Player Sum (Dealer Showing {dealer_card})") ax.legend() ax.grid(True) plt.show() # Generate the plot after running the prediction # Using V_simple_policy calculated earlier plot_blackjack_values(V_simple_policy, dealer_card=7) plot_blackjack_values(V_simple_policy, dealer_card=2) # Example for a different dealer cardThe plots show the estimated value (expected return) for different player sums, given the dealer shows a specific card (e.g., 7 or 2). You'll typically observe that higher player sums generally have higher values, and having a usable ace often increases the value, especially for lower sums where busting is not an immediate risk. The simple policy's effect (sticking at 20/21) should be reflected in the values near those sums.This practical exercise demonstrates how MC Prediction learns state values directly from experience, without needing a model of the environment. By running episodes and averaging the observed returns for each state, we can effectively estimate $V^\pi$. This forms the foundation for MC Control methods, where we'll use similar ideas to estimate action-values ($Q^\pi$) and find optimal policies.