Skip to content

Learning and Planning

The elegance of Dyna-Q lies in the fact that it uses the same update rule to handle two fundamentally different sources of information:

  • Learning: Update the Q-table using real samples \((s, a, r, s')\) obtained from the actual environment.
  • Planning: Update the Q-table using simulated samples \((s, a, r, s')\) generated by the learned model.

Since both share the same TD update formula in the tabular setting:

\[ Q(s, a) \leftarrow Q(s, a) + \alpha [r + \gamma \max_{a'} Q(s', a') - Q(s, a)] \]

The textbook uses Dyna-Q to convey a key insight: planning is essentially learning from "simulated experience." This unification is most clearly demonstrated in the tabular methods chapter.

Dyna-Q is not an entirely "new algorithm" but rather an architecture.

  • It wraps a model-based shell around a model-free core (such as Q-learning).
  • The reason it is introduced in the tabular methods chapter is to illustrate how introducing "bias (from model estimates)" can reduce "variance (from sampling noise)," thereby improving sample efficiency.

Core Concepts

Trade-off between Model-based and Model-free

The sole criterion for determining whether an algorithm is model-free is: whether it acquires information through sampling via interaction with the real environment, rather than relying on an environment model (e.g., transition probabilities \(P\)) for mental planning. The methods we discussed earlier — MC and TD (Q-learning and SARSA) — are all model-free:

  • MC approach: The agent interacts with the environment under a given policy until the episode terminates, recording the actual rewards at every step. It does not care how the environment transitions; it only cares about the sequence of rewards that actually occurred.
  • TD approach: Although it updates after just one step, that step's reward \(R_t\) and next state \(S_{t+1}\) are both obtained from the real environment. It does not need to know "what is the probability of ending up in each state given this action" — it only observes where it actually ended up.

The trade-off between model-based and model-free algorithms is primarily reflected in the tension between sample complexity (learning speed) and expected return (final performance).

Sample Complexity — "How fast does it learn?"

  • Model-based advantage: With access to an environment model, the agent can conduct additional simulated interactions. This "mental rehearsal" dramatically reduces the need for real-world samples, typically resulting in lower sample complexity.
  • Model-free disadvantage: The agent relies entirely on interactions with the real environment. Without model assistance, it requires massive amounts of real data to overcome the high variance introduced by stochasticity, leading to higher sample complexity.

Expected Return — "How well does it learn?"

  • Model-free advantage: It updates directly from real environment feedback without relying on any subjective assumptions. Although slower, the converged policy tends to be closest to the true optimum, typically achieving higher expected return.
  • Model-based disadvantage: The learned environment model may be inaccurate (model bias) and cannot fully substitute for the real environment. If the model itself is flawed, the optimal policy learned in the "simulated environment" will perform poorly in the real one, so the expected return may be inferior to that of model-free methods.

Summary:

Dimension Model-free Model-based
Interaction Target Real environment only Real environment + learned model
Sample Requirement Large (high sample complexity) Small (low sample complexity)
Policy Quality High (unbiased, higher ceiling) Potentially lower (limited by model accuracy)
Computational Cost Lower (direct value function updates) Higher (must learn and maintain a model)

.

Dyna-Q

Dyna-Q is an architecture proposed by Richard Sutton, one of the founding figures of reinforcement learning. Its core logic can be summarized in one sentence: While sampling from the real environment, simultaneously learn an environment model, then use that model for "daydream"-style simulated training.

Its workflow:

  1. Direct RL: Just like standard Q-learning, take one step in the real environment, receive a real reward \(R\), and perform one update to the \(Q\) table.
  2. Model Learning: Based on that real experience, update the agent's understanding of the world. For example, memorize: "In state \(s\), taking action \(a\) leads to state \(s'\) with reward \(R\)."
  3. Planning: This is the soul of Dyna-Q. During idle time, the agent randomly samples some previously visited states from memory and uses the learned model to conduct simulated interactions, updating the \(Q\) table once more.

The update target remains the familiar TD form:

\[ Q(s, a) \leftarrow Q(s, a) + \alpha [R + \gamma \max_{a'} Q(s', a') - Q(s, a)] \]

The only difference is that \((s, a, R, s')\) can come from either real sampling or model simulation.


Although Dyna-Q is theoretically elegant, it faces several practical bottlenecks in the modern deep learning (Deep RL) era:

  1. It is a product of the "tabular era." Dyna-Q was originally designed for discrete, finite state spaces (e.g., grid mazes). In simple grid worlds, building a model (storing a table) is straightforward. However, in modern complex tasks (such as Atari games or robotic control), the state space is infinitely continuous, making it extremely difficult to build an accurate environment model.
  2. The fatal blow of model bias. As the trade-off discussion above suggests, environment models are often inaccurate. In Dyna-Q, if the model is poorly learned, the agent will drift further and further down the wrong path during the "daydream (Planning)" phase. In deep learning, this bias can be amplified indefinitely by neural networks, causing policy collapse.
  3. The tension between sample efficiency and computational cost. Modern model-free algorithms (such as PPO and SAC) have become powerful enough through massively parallel sampling. Practitioners have found that rather than expending enormous computational resources to train a potentially flawed model (the Dyna-Q approach), it is often more effective to simply run more parallel environment instances and learn directly from real experience (the model-free approach).

Nevertheless, Dyna-Q has not disappeared — it has evolved into more advanced forms:

  • MBPO (Model-Based Policy Optimization): Can be viewed as the deep learning version of Dyna-Q.
  • World Models: Not only learn a model, but also use it to generate "dreams."
  • DreamerV3: One of the most powerful model-driven algorithms to date, whose core idea is essentially a modern, enhanced version of Dyna-Q.

评论 #