In the preceding chapters, we established the framework of Markov Decision Processes (MDPs) and used Dynamic Programming methods to find optimal policies. A key limitation of DP is the requirement of a complete model of the environment, including state transition probabilities and reward functions. Often, such a model is unavailable.
This chapter introduces Monte Carlo (MC) methods, a class of model-free Reinforcement Learning algorithms. MC methods learn directly from episodes of experience, without needing prior knowledge of the environment's dynamics. They operate by averaging the sample returns obtained from interaction sequences (episodes).
Here, you will focus on:
By working through MC methods, you'll gain insight into learning optimal behavior purely from sampled experience, a necessary step towards tackling problems where the environment's rules are unknown.
4.1 Learning from Complete Episodes
4.2 Monte Carlo Prediction: Estimating Vπ
4.3 Monte Carlo Control: Estimating Qπ
4.4 On-Policy vs Off-Policy Learning
4.5 MC Control without Exploring Starts
4.6 On-Policy First-Visit MC Control Implementation
4.7 Off-Policy MC Prediction and Control Intro
4.8 Practice: Implementing MC Prediction
© 2025 ApX Machine Learning