Implementation of the REINFORCE algorithm demonstrates how this method, also known as Monte Carlo Policy Gradient, works by directly adjusting the parameters $\theta$ of a policy $\pi_\theta(a|s)$ to maximize expected future rewards. The main idea involves sampling complete episodes, calculating the return $G_t$ for each step, and then updating the policy parameters in the direction suggested by the policy gradient theorem. Specifically, the log probability of actions taken at state $s_t$ should increase if they led to a high return $G_t$, and decrease otherwise.The update rule, in its simplest form (without a baseline), looks like this:$$ \theta \leftarrow \theta + \alpha \nabla_\theta \log \pi_\theta(a_t|s_t) G_t $$Here, $\alpha$ is the learning rate, $\nabla_\theta \log \pi_\theta(a_t|s_t)$ is the gradient of the log probability of the action taken, and $G_t$ is the return from time step $t$ onwards.We'll use the classic CartPole-v1 environment from the Gymnasium library. In this environment, the goal is to balance a pole on a cart by moving the cart left or right. The state is represented by four continuous values (cart position, cart velocity, pole angle, pole angular velocity), and the action space is discrete (0 for left, 1 for right). This is a good fit for REINFORCE because the policy needs to map continuous states to discrete actions.Setting Up the Environment and Policy NetworkFirst, ensure you have Gymnasium and a deep learning library like PyTorch installed.import gymnasium as gym import numpy as np import torch import torch.nn as nn import torch.optim as optim import torch.nn.functional as F from collections import namedtuple, deque import matplotlib.pyplot as plt # Check if GPU is available device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(f"Using device: {device}") # Create the environment env = gym.make('CartPole-v1') # Get state and action dimensions state_dim = env.observation_space.shape[0] action_dim = env.action_space.n print(f"State dimensions: {state_dim}") print(f"Action dimensions: {action_dim}")Now, let's define our policy network. A simple Multi-Layer Perceptron (MLP) should suffice for CartPole. This network will take the state as input and output the probabilities for each possible action (left or right).class PolicyNetwork(nn.Module): def __init__(self, state_dim, action_dim, hidden_dim=128): super(PolicyNetwork, self).__init__() self.fc1 = nn.Linear(state_dim, hidden_dim) self.fc2 = nn.Linear(hidden_dim, action_dim) def forward(self, state): x = F.relu(self.fc1(state)) # Output raw scores (logits) for actions action_scores = self.fc2(x) # Convert scores to probabilities using softmax # Use dim=-1 to apply softmax across the action dimension return F.softmax(action_scores, dim=-1) # Instantiate the policy network and optimizer policy_net = PolicyNetwork(state_dim, action_dim).to(device) optimizer = optim.Adam(policy_net.parameters(), lr=1e-3) # Learning rateThe REINFORCE Training LoopThe core of REINFORCE involves interacting with the environment to collect trajectories (sequences of states, actions, and rewards) and then using these trajectories to update the policy network.Collect Trajectories: Run one or more complete episodes using the current policy. For each step, store the state, the action taken, and the reward received.Calculate Returns: For each step $t$ in an episode, calculate the discounted return $G_t = \sum_{k=t}^{T} \gamma^{k-t} r_k$, where $T$ is the episode length and $\gamma$ is the discount factor (often set to 0.99 or 1.0 for episodic tasks).Compute Policy Loss: Calculate the loss term for the episode. This is typically the negative sum of $\log \pi_\theta(a_t|s_t) G_t$ over all steps $t$. We use the negative because optimizers perform gradient descent, while we want to perform gradient ascent on the expected return.Update Policy Network: Perform backpropagation on the loss and update the network parameters $\theta$ using the optimizer.Let's structure the training loop:# Define a structure to store trajectory data SavedAction = namedtuple('SavedAction', ['log_prob', 'value']) # Value here means Return G_t def select_action(state): """Selects an action based on the policy network's output probabilities.""" state = torch.from_numpy(state).float().unsqueeze(0).to(device) probs = policy_net(state) # Create a categorical distribution over the list of probabilities of actions m = torch.distributions.Categorical(probs) # Sample an action from the distribution action = m.sample() # Store the log probability of the sampled action policy_net.saved_action = SavedAction(m.log_prob(action), 0) # Placeholder for G_t return action.item() def calculate_returns(rewards, gamma=0.99): """Calculates discounted returns for an episode.""" R = 0 returns = [] # Iterate backwards through the rewards for r in reversed(rewards): R = r + gamma * R returns.insert(0, R) # Prepend to maintain order # Normalize returns (optional but often helpful) returns = torch.tensor(returns, device=device) returns = (returns - returns.mean()) / (returns.std() + np.finfo(np.float32).eps.item()) return returns def finish_episode(episode_rewards): """Performs the REINFORCE update at the end of an episode.""" policy_loss = [] returns = calculate_returns(episode_rewards) # Retrieve saved log probabilities and associate them with calculated returns for (log_prob, _), G_t in zip(policy_net.all_saved_actions, returns): # REINFORCE objective: maximize log_prob * G_t # Loss is the negative objective policy_loss.append(-log_prob * G_t) # Sum the losses for the episode optimizer.zero_grad() # Reset gradients loss = torch.stack(policy_loss).sum() # Combine losses loss.backward() # Compute gradients optimizer.step() # Update network weights # Clear saved actions for the next episode del policy_net.all_saved_actions[:] # Training parameters num_episodes = 1000 gamma = 0.99 log_interval = 50 # Print status every 50 episodes max_steps_per_episode = 1000 # Prevent excessively long episodes all_episode_rewards = [] episode_durations = [] # Main training loop for i_episode in range(num_episodes): state, _ = env.reset() episode_rewards = [] policy_net.all_saved_actions = [] # Store log_probs for the episode for t in range(max_steps_per_episode): action = select_action(state) next_state, reward, terminated, truncated, _ = env.step(action) done = terminated or truncated policy_net.all_saved_actions.append(policy_net.saved_action) episode_rewards.append(reward) state = next_state if done: break # Episode finished, update policy finish_episode(episode_rewards) # Logging total_reward = sum(episode_rewards) all_episode_rewards.append(total_reward) episode_durations.append(t + 1) if i_episode % log_interval == 0: avg_reward = np.mean(all_episode_rewards[-log_interval:]) print(f'Episode {i_episode}\tAverage Reward (last {log_interval}): {avg_reward:.2f}\tLast Duration: {t+1}') # Optional: Stop training if the problem is solved # CartPole-v1 is considered solved if avg reward > 475 over 100 consecutive episodes if len(all_episode_rewards) > 100: if np.mean(all_episode_rewards[-100:]) > 475: print(f'\nSolved in {i_episode} episodes!') break env.close()Visualizing ResultsAfter training, it's useful to plot the rewards per episode to see if the agent learned effectively.# Plotting the results plt.figure(figsize=(12, 6)) plt.plot(all_episode_rewards) plt.title('Episode Rewards over Time (REINFORCE)') plt.xlabel('Episode') plt.ylabel('Total Reward') # Calculate and plot rolling average rolling_avg = np.convolve(all_episode_rewards, np.ones(100)/100, mode='valid') plt.plot(np.arange(99, len(all_episode_rewards)), rolling_avg, label='100-episode rolling average', color='orange') plt.legend() plt.grid(True) plt.show(){"data":[{"type":"scatter","mode":"lines","name":"Episode Reward","x":[0,1,2,3,4,5,6,7,8,9],"y":[10,20,15,25,30,22,18,28,35,29],"line":{"color":"#339af0"}},{"type":"scatter","mode":"lines","name":"100-episode Rolling Average","x":[99,100,101,102,103,104,105,106,107,108],"y":[18.5,19.2,19.8,20.5,21.0,20.7,20.3,20.9,21.5,21.2],"line":{"color":"#fd7e14"}}],"layout":{"title":"Episode Rewards over Time (REINFORCE)","xaxis":{"title":"Episode"},"yaxis":{"title":"Total Reward"},"template":"plotly_white"}}Plot showing the total reward obtained in each episode during training, along with a 100-episode rolling average to visualize the learning trend.DiscussionThis implementation demonstrates the basic REINFORCE algorithm. You should observe the agent's performance gradually improving, indicated by longer episode durations and higher total rewards.High Variance: One characteristic you might notice (or experience if you run this multiple times) is the high variance in learning. Sometimes REINFORCE learns quickly, other times it might struggle or take longer. This is because the gradient estimate relies on Monte Carlo sampling (full episodes), which can be noisy.Importance of Normalization: Normalizing the returns (returns = (returns - returns.mean()) / (returns.std() + eps)) is a common technique that helps stabilize training. It centers the returns around zero, meaning actions leading to above-average returns increase in probability, while those leading to below-average returns decrease.Baseline: As mentioned earlier in the chapter, subtracting a baseline (like the state value estimate $V(s_t)$) from the return $G_t$ can significantly reduce variance without introducing bias. The update would involve $(G_t - b(s_t))$ instead of just $G_t$. Implementing this would typically lead to an Actor-Critic method, where one network (the critic) estimates the baseline $V(s_t)$ and another (the actor) updates the policy based on the advantage $G_t - V(s_t)$.This hands-on example provides a foundation for understanding and implementing policy gradient methods. Feel free to experiment with different hyperparameters (learning rate, network architecture, discount factor) or try implementing a baseline to see how it affects performance.