Monte Carlo Methods

Monte Carlo methods are the first class of practical algorithms for estimating value functions and finding optimal policies.

Review of the Basic RL Framework

Model-Free

Starting from Monte Carlo methods, we enter the domain of classical reinforcement learning. Reinforcement learning is generally divided into three eras:

The Dynamic Programming era, where the agent has full knowledge of all environment states
The Tabular Methods era, which is model-free and based on sampling
The Function Approximation and Deep Reinforcement Learning era (which of course contains many further subdivisions)

The Tabular Methods era is based on sampling and no longer requires complete knowledge of the environment. Instead, we can iteratively improve policies through sampling.

Model-free means that the agent learns to evaluate values and find optimal policies directly through "trial and error" (interacting with the environment and collecting real experience trajectories), without any knowledge of how the environment works. It does not attempt to understand "how the world operates" — it only cares about "what action to take in what situation to get the highest score."

Policy and Value Functions

Before this chapter, we have already discussed the multi-armed bandit problem, finite MDPs, and dynamic programming. Let us briefly review the two most important concepts: policy and value functions.

A policy is a mapping from states to actions. It is like an "action manual" that tells the agent what action to take in a given situation.

Policies are typically divided into two types:

Deterministic Policy: For each state, the agent always selects the same specific action. $$ \pi(s) = a $$
Stochastic Policy: For each state, the policy specifies a probability for each action being selected. $$ \pi(a|s) = P(A_t = a | S_t = s) $$

Analogy: If you are playing a video game, the policy is your "move list." When you see an enemy jump (state), you press the block button (action).

A value function is a long-term prediction of future rewards. It does not merely consider the immediate reward; rather, it focuses on the cumulative sum of rewards obtainable from the current time step onward.

There are two common types of value functions:

1. State-Value Function ($V$)

It represents the expected cumulative return an agent can obtain when starting in state $s$ and acting according to policy $\pi$. It measures how good it is to be in a given state.

\[ V_\pi(s) = \mathbb{E}_\pi [ \sum_{k=0}^\infty \gamma^k R_{t+k+1} | S_t = s ] \]

(where $\gamma$ is the discount factor, determining how much we value future rewards)

2. Action-Value Function ($Q$)

Commonly referred to as the Q-function. It represents the expected cumulative return when, in state $s$, the agent first takes action $a$ and then follows policy $\pi$ thereafter. It measures how good it is to take a particular action in a given state.

\[ Q_\pi(s, a) = \mathbb{E}_\pi [ \sum_{k=0}^\infty \gamma^k R_{t+k+1} | S_t = s, A_t = a ] \]

In short, the policy determines what to do, while the value function evaluates how good that policy is.

Generalized Policy Iteration

Generalized Policy Iteration (GPI) is not a specific algorithm but rather a general framework and core idea for finding optimal policies in reinforcement learning. It describes two processes that alternate continuously, competing with yet complementing each other:

Policy Evaluation: Given the current policy, compute the value of each state (this is the "prediction" task).
Policy Improvement: Based on the computed values, greedily select actions that yield the greatest value, thereby generating a better policy.

These two processes alternate: the policy is improved based on the values, and the values are re-evaluated under the new policy, until both converge to the optimum (the optimal value function $V^*$ and the optimal policy $\pi^*$).

Suppose the current policy is $\pi$ and evaluation yields the action-value function $Q_\pi(s, a)$.

When improving the policy, we simply choose the action that maximizes the $Q$ value:

\[ \pi'(s) = \arg\max_a Q_\pi(s, a) \]

Since $\pi'$ selects the best-looking action in every state, $\pi'$ is always at least as good as $\pi$ (and often strictly better).

Prediction and Control

In reinforcement learning, the term "Prediction" can be somewhat misleading — it is better understood as "Evaluation." In simple terms:

Prediction is about computing the value of each state under a given policy.
Control is about modifying the policy based on those computed values to find the optimal policy.

The objective of Prediction is: given a specific policy $\pi$, compute the value function ($V_\pi$ or $Q_\pi$) under that policy.

Core goal: Value estimation.
Question answered: "If I keep following the current policy, what is the average score I can expect from a given state?"
Input: Environment $M$ + Policy $\pi$.
Output: Value function $V_\pi$.
Example: In weather forecasting, predicting an 80% chance of rain tomorrow.

The objective of Control is: without fixing any policy, find an optimal policy $\pi^*$ that maximizes cumulative reward.

Core goal: Policy optimization.
Question answered: "What actions should I take to achieve the highest score?"
Input: Environment $M$.
Output: Optimal policy $\pi^*$ and optimal value function $V^*$.
Example: In an autonomous driving system, upon predicting an obstacle ahead, immediately deciding to brake.

During prediction, you may only care about how much a state is worth ($V$). But in the control (optimization) phase, knowing a state's value is not enough — you need to know which action leads to that high score.

Key insight: In the model-free setting, you must estimate the value of state-action pairs $Q(s, a)$.
Takeaway: Understand why, in model-free reinforcement learning, the $Q$ function is more practical than the $V$ function.

On-Policy vs Off-Policy

When performing "control," the agent needs to interact with the environment to collect data (play the game). This raises a fundamental tension: to find the optimal policy, we should always select the action with the highest current value (exploitation); but if we never try unexplored paths, we may never discover that they are better (exploration). How we generate interaction data gives rise to the distinction between on-policy and off-policy methods.

The core idea of on-policy is: the policy used to interact with the environment and collect data (the behavior policy) is the same policy being evaluated and optimized (the target policy).

To ensure both optimization and exploration, on-policy algorithms typically employ an $\epsilon$-greedy policy: with probability $1-\epsilon$, select the current best action; with probability $\epsilon$, select a random action.

Mathematical manifestation: When updating the $Q$ value, the update uses the action $A_{t+1}$ actually taken by the agent to update $Q(S_t, A_t)$.
Analogy: "Learning from your own mistakes." You are learning to play basketball (target policy), and you personally play in the game (behavior policy is also you). You miss a shot, learn from it, and adjust your shooting form. You are both the data generator and the policy optimizer.

The core idea of off-policy is: the policy that collects data (the behavior policy $b$) and the policy being optimized (the target policy $\pi$) are separate.

Target policy $\pi$: This is the optimal policy we want to learn; it is typically purely greedy (always selects the best action, no exploration).
Behavior policy $b$: This is the policy that interacts with the environment to generate data; it must be highly exploratory (e.g., fully random) to ensure all states can be visited.

Analogy:

"Standing on the shoulders of giants" or "the bystander sees more clearly." You want to learn to play Go (optimizing the target policy $\pi$), but instead of playing yourself, you watch recordings of Go masters' games (data generated by the masters' behavior policy $b$), or watch two beginners play randomly. By observing others' actions and outcomes, you develop an optimal strategy in your own mind.

Exploration vs Exploitation

This is the central challenge of the control task. If you discover that action A yields 10 points, you might always choose A, thereby missing action B which could yield 100 points.

Solution A: Exploring Starts (ES)
- Force each episode to begin from a random state and action.
- Takeaway: Understand its mathematical elegance and the practical limitation that "teleporting" to a random state is often infeasible.
Solution B: $\epsilon$-Greedy Policy
- Instead of requiring random starting points, when choosing an action, there is an $\epsilon$ probability of taking a random action (exploration) and a $1-\epsilon$ probability of choosing the best action.
- Takeaway: Master how this simple probabilistic mechanism balances the search space.

Monte Carlo Prediction

Underlying Principle

Monte Carlo is not a specific algorithm but rather a class of methods that rely on repeated random sampling to solve mathematical problems. For example, suppose you want to determine the area of an irregularly shaped lake but have no formula. You can randomly "scatter beans" over a square region containing the lake. As long as enough beans are scattered, the proportion of beans landing in the water $\times$ the total area of the square gives an approximation of the lake's area.

Why is "scattering beans" reliable? It is entirely thanks to the Law of Large Numbers. In a random experiment, as the number of trials $$ approaches infinity, the sample arithmetic mean

converges to the theoretical expected value $\mu$:

\[ \lim_{n \to \infty} \frac{1}{n} \sum_{i=1}^{n} X_i = \mathbb{E}[X] \]

Therefore, Monte Carlo methods can substitute computation with sampling and approximate probability with frequency.

In reinforcement learning, we want to know the value $V(s)$ of state $s$, and $V(s)$ is defined as the expected return starting from state $s$. Since we cannot compute this expectation directly, we invoke the Law of Large Numbers: run multiple episodes and average the returns.

In reinforcement learning, the goal of prediction is: given a policy $\pi$, estimate the value $V_\pi(s)$ of each state under that policy.

Core Workflow:

Complete sampling: The agent must follow policy $\pi$ and run a complete episode until reaching a terminal state.
Record returns: Compute the total (discounted) return $G_t$ following each state $s_t$ in the episode.
Average: As more episodes are played, store all $G_t$ values received for each state and compute their average.

Note: Monte Carlo methods are model-free. They do not need to know the environment's transition probabilities $P$; learning happens entirely through firsthand experience.

First-Visit Variant

First-visit MC prediction is the most classic variant of Monte Carlo prediction. It addresses a specific detail: what if the agent visits the same state multiple times within a single episode?

For example, suppose you are navigating a maze: you pass through the "living room" state, wander around, and return to the "living room" before finally exiting the maze. Under the first-visit rule, within this episode we only count the return $G$ obtained after the first entry into the "living room." Returns from subsequent visits to the "living room" in the same episode are ignored.

This is because each episode's "first visit" return is an independent, identically distributed sample of $V_\pi(s)$; it is statistically easier to prove convergence to the true expected value (low bias).

# First-visit MC Prediction
# Goal: Estimate the state-value function V(s) under policy pi

Initialize:
    For all states s:
        V(s) = 0               # Initial value
        Returns(s) = []        # List storing the "total return" for state s across episodes
        gamma = 0.9            # Discount factor (set as appropriate)

Loop forever (for each episode):
    1. Generate an episode following policy pi, recording:
       State sequence: S0, S1, S2, ..., ST
       Reward sequence: R1, R2, ..., RT

    2. G = 0                   # Initialize return

    3. Traverse the episode backwards (t = T-1, T-2, ..., 0):
       G = gamma * G + R_{t+1} # Compute discounted return from time t

       # [KEY: First-visit check]
       If state St did not appear in the preceding steps (S0, S1, ..., S_{t-1}):
           Append G to Returns(St)
           V(St) = average(Returns(St))  # Update value: average of all returns for this state

.

Let us look at an example:

Monte Carlo Control

Off-Policy via Importance Sampling

In off-policy learning, we collect data (e.g., returns $G$) using policy $b$, but we want to evaluate policy $\pi$. Because the two policies assign different probabilities to the same action, we cannot simply treat the average score from policy $b$ as the score for policy $\pi$****. This is where importance sampling is needed to correct the bias.

Importance sampling is a statistical technique for estimating expected values under one distribution using data from another distribution. In reinforcement learning, it computes the probability ratio of the two policies producing the same trajectory (the importance sampling ratio) and uses it to weight the return $G$.

Let the target policy's probability of selecting an action be $\pi(a|s)$ and the behavior policy's be $b(a|s)$. The importance sampling ratio for the trajectory from time $t$ to $T-1$ is:

\[ \rho_{t:T-1} = \prod_{k=t}^{T-1} \frac{\pi(A_k|S_k)}{b(A_k|S_k)} \]

We then multiply the actual return $G_t$ by this ratio. If the target policy assigns a higher probability to this trajectory than the behavior policy, $\rho > 1$, the return's weight is amplified; otherwise, it is diminished.

The estimated value is:

\[ V(s) = \frac{\sum \rho_{t:T-1} G_t}{\text{样本数}} \]

For example, suppose you want to estimate the "average height of professional basketball players (policy $\pi$)" but you only have "height data from ordinary college students (policy $b$)." You cannot simply compute the average height of college students. Here you need importance sampling: when you draw a 2-meter-tall person from the college students, you assign their data a very high weight (because this is common among professional players — $\pi$ is large, $b$ is small, so the ratio exceeds 1); when you draw a 1.7-meter-tall person, you assign a very low weight (because this is rare among professional players). Through this weighting, you can accurately estimate the average height of basketball players using college student data.

Monte Carlo ES

Monte Carlo ES essentially combines "prediction" and "improvement" together, following the logic of Generalized Policy Iteration (GPI):

MC Prediction estimates V or Q under a given policy; MC Control does not fix the policy — its goal is to learn the optimal policy.

In the Model-Free + Control setting, we almost always learn $Q(s, a)$ directly, and once we have Q, the policy can be derived greedily:

\[ \pi(s) = \arg\max_{a} Q(s, a) \]

If every episode starts from a fixed starting point and the policy becomes increasingly greedy, the agent may never reach certain state-action combinations, leaving some $(s,a)$ pairs unsampled. This means Q will never be accurate, and the policy is very likely to get stuck at a local optimum.

We address this problem with ES (Exploring Starts), which requires that each episode's initial state S0 and initial action A0 can be selected with nonzero probability. In other words, any $(s,a)$ pair has the chance to be the "opening move." This ensures that all $(s,a)$ pairs are continually explored, allowing Q estimates to converge to their true values and the policy to genuinely reach optimality.

MC-ES performs two tasks in each episode:

(1) Policy Evaluation: Estimate $Q(s,a)$ using "First-visit MC." If the same $(s,a)$ appears multiple times in an episode trajectory, only the return G following its first occurrence is used:

\[ Q(s, a) \leftarrow \text{average of returns following first visit of } (s, a) \]

(2) Policy Improvement: Immediately after the episode ends, make $\pi$ greedy:

\[ \pi(s) \leftarrow \arg\max_{a} Q(s, a) \]

Pseudocode:

Initialize:
    For all s, a:
        Q(s,a) = 0
        Returns(s,a) = []
    For all s:
        π(s) = random (or arbitrarily initialized)

Repeat for each Episode:
    1) ES random start:
        Randomly select S0
        Randomly select A0
    2) Run a complete episode using "A0 + subsequently follow π," recording the trajectory:
        (S0,A0,R1), (S1,A1,R2), ..., (ST-1,AT-1,RT)
    3) Compute returns G backwards and perform First-visit (s,a) updates:
        G=0
        for t=T-1..0:
            G = γG + R_{t+1}
            If (St,At) did not appear in (S0,A0)...(S_{t-1},A_{t-1}):
                Returns(St,At).append(G)
                Q(St,At)=average(Returns(St,At))
    4) Policy improvement (greedy):
        For each state s visited in the episode:
            π(s)=argmax_a Q(s,a)

.

\[ \text{MC ES} = \underbrace{\text{MC Prediction}}_\text{预测 (估算 Q)} + \underbrace{\text{Policy Improvement}}_\text{改进 (贪婪策略)} + \underbrace{\text{Exploring Starts}}_\text{探索机制} \]

Specifically:

Prediction component: It uses the First-visit MC logic described earlier to estimate the value of state-action pairs $Q(s, a)$.
Improvement component: After each episode, it immediately updates the policy so that each state selects the action with the highest $Q$ value (greedy).
ES component: To prevent the agent from being "stuck in its ways" (only running along previously discovered good paths), it mandates that each episode's starting point (state and action) is randomly chosen.

.

On-Policy First-Visit MC Control

In the experiment in programming/grid_world.py, we observed that:

Since the policy only stores the single best action for each state, if we do not use ES and the agent strictly follows the policy, it becomes extremely short-sighted and stubborn.
Even with random initialization, there is still a high chance of reaching a deadlock, so a maximum step limit must be imposed.

Due to these issues, in practice we almost certainly use $\epsilon$-greedy rather than ES. The agent dutifully starts from the initial state each time. However, at each cell it visits, instead of strictly following the highest-scoring recommendation in the Q-table, it does the following:

With probability $1 - \epsilon$ (e.g., 90%), it behaves obediently and selects the action with the highest $Q$ value (greedy).
With probability $\epsilon$ (e.g., 10%), it suddenly "rebels" and blindly picks a random action (exploration).

This method is highly practical: it does not require manipulating the environment's initial state and maintains exploration at all times. However, note that its optimal solution is not the "absolutely perfect" optimum, but rather the best solution under the constraint of having a 10% chance of acting randomly.

MC Control using the $\epsilon$-greedy mechanism is formally known as On-policy First-Visit MC Control.

In reinforcement learning, the agent effectively has two "minds" (i.e., two policies):

Behavior Policy: The policy that actually controls how the agent moves through the maze.
Target Policy: The policy the agent silently evaluates and optimizes in its head.

It is called on-policy because, in this algorithm, the behavior policy = the target policy = the $\epsilon$-greedy policy.

When the agent explores the maze, it uses the slightly randomized $\epsilon$-greedy policy to collect experience (ensuring sufficient exploration and avoiding infinite loops).
After completing an episode, it evaluates and updates this same randomized policy.

This is "unity of knowledge and action" — it learns from exactly how it acts, hence the name on-policy.

Where there is "on," there is naturally "off." Off-policy means a kind of "split personality": the agent uses a randomized policy to wander through the maze (to encounter more pitfalls and collect more data), but it is secretly optimizing and recording a separate perfectly deterministic, purely greedy policy. The famous Q-Learning algorithm is the quintessential off-policy method. We will study off-policy methods in later chapters.