Offline Reinforcement Learning

In standard reinforcement learning, an agent must continuously interact with the environment to collect experience and improve its policy. However, in many real-world scenarios, interacting with the environment is extremely expensive, dangerous, or even impossible. Offline RL (also known as Batch RL) poses a fundamentally different problem: Can we learn a good policy entirely from a pre-collected, static dataset without any additional environment interaction?

The appeal of this problem setting is enormous — the medical field has vast patient treatment records, autonomous driving has massive volumes of human driving data, robotics has extensive teleoperation data, and recommendation systems have billions of user interaction logs. If we could learn good decision-making policies directly from such data, there would be no need to risk patient lives or vehicle collisions for trial-and-error learning.

Problem Formulation

The input to Offline RL is a static dataset \(\mathcal{D}\), collected by some (unknown) behavior policy \(\pi_\beta\) interacting with the environment:

\[ \mathcal{D} = \{(s_i, a_i, r_i, s_i')\}_{i=1}^{N} \]

where \(s_i\) is the state, \(a_i\) is the action selected by the behavior policy, \(r_i\) is the reward received, and \(s_i'\) is the next state transitioned to. The behavior policy \(\pi_\beta\) can be a human expert, a previous RL policy, or even a random policy — it can also be a mixture of multiple policies.

Objective: Using only the dataset \(\mathcal{D}\), learn a policy \(\pi\) that maximizes expected return when deployed in the real environment:

\[ \pi^* = \arg\max_\pi \mathbb{E}_\pi \left[ \sum_{t=0}^{T} \gamma^t r_t \right] \]

The key constraint is: no interaction with the environment is allowed during learning.

The overall methodological landscape can be understood through the following diagram:

+---------------------------------------------------+
|              Offline RL 方法分类                    |
+---------------------------------------------------+
|                                                     |
|  1. 保守估值方法 (Conservative Value Methods)        |
|     +--------+    +---------+    +----------+      |
|     |  CQL   |    |   IQL   |    |  TD3+BC  |      |
|     +--------+    +---------+    +----------+      |
|     惩罚OOD Q值    避免查询OOD    添加BC正则项       |
|                                                     |
|  2. 序列建模方法 (Sequence Modeling)                 |
|     +------------------------+                      |
|     |  Decision Transformer  |                      |
|     +------------------------+                      |
|     把RL当做序列预测问题                             |
|                                                     |
+---------------------------------------------------+

The Core Challenge: Distribution Shift

Intuitively, you might think: "What's so hard about Offline RL? Just take an off-policy algorithm (like SAC or DQN) and train it on the dataset, right?"

The answer is: no. Directly applying standard off-policy algorithms to a static dataset almost always leads to catastrophic failure. The fundamental cause is distribution shift, which manifests at several levels.

Out-of-Distribution (OOD) Actions

In standard Q-learning, we need to compute the target value:

\[ y = r + \gamma \max_{a'} Q_\theta(s', a') \]

The term \(\max_{a'} Q_\theta(s', a')\) requires maximizing over all possible actions. But in Offline RL, the dataset \(\mathcal{D}\) only contains actions that the behavior policy \(\pi_\beta\) actually executed. For actions that \(\pi_\beta\) never selected (i.e., OOD actions), the values of \(Q_\theta\) are entirely determined by the neural network's extrapolation, with no data to support them.

Q-Value Overestimation

The problem goes beyond mere inaccuracy. The \(\max\) operator systematically selects actions whose Q-values are overestimated — this is a form of selection bias. In online RL, this issue is naturally mitigated: when an agent selects and executes an action with an overestimated Q-value, it receives the true (lower) reward, which corrects the Q-value estimate. But in Offline RL, we never get to execute these actions, so overestimation is never corrected.

Extrapolation Error Accumulation

Worse still, Q-learning is a bootstrapping process — current Q-values are updated based on Q-values at the next state. If the Q-value at the next state is overestimated due to OOD actions, this overestimation propagates backward through Bellman updates, eventually corrupting the entire Q-function.

Mathematically, suppose the estimation error at some state-action pair \((s', a')\) is \(\epsilon(s', a')\). After one Bellman update:

\[ Q(s, a) \leftarrow r + \gamma \left[ Q^*(s', a') + \epsilon(s', a') \right] \]

After \(k\) steps of propagation, the error can accumulate to \(O(\gamma^k \cdot k \cdot \epsilon_{\max})\), where \(\epsilon_{\max}\) is the maximum single-step error. In deep neural networks, \(\epsilon_{\max}\) in OOD regions can be very large, causing the entire Q-function to diverge.

An Intuitive Analogy

This is like a student who has only studied the worked examples in a textbook (the dataset) and then encounters a completely unfamiliar type of problem on an exam (OOD actions). The student might fabricate a confident-looking answer (Q-value overestimation), and there is no way to receive feedback from the exam results to correct themselves (no environment interaction).

Conservative Methods: Constraining Policy Deviation

To address distribution shift, the first class of methods follows a core idea: prevent the learned policy from deviating too far from the behavior policy in the dataset. This can be achieved by constraining the Q-function, constraining the policy itself, or adding regularization terms to the policy optimization objective.

CQL (Conservative Q-Learning)

CQL was proposed by Kumar et al. in 2020 ("Conservative Q-Learning for Offline Reinforcement Learning") and is one of the most important works in Offline RL. Its core idea is remarkably elegant: rather than trying hard to estimate Q-values accurately, learn Q-values that are systematically lower — a lower bound on the true Q-values.

Why is a lower bound desirable? Consider this: if our Q-estimates are a lower bound on the true Q-values, then the optimal action selected under this pessimistic Q-function can only perform better in the real environment, never worse. This is the pessimism principle at work in Offline RL.

CQL Loss Function

CQL adds a conservative regularizer on top of the standard TD loss:

\[ L_{\text{CQL}}(\theta) = \underbrace{\alpha \left( \mathbb{E}_{s \sim \mathcal{D}, a \sim \mu(a|s)} [Q_\theta(s, a)] - \mathbb{E}_{s, a \sim \mathcal{D}} [Q_\theta(s, a)] \right)}_{\text{保守正则项}} + \underbrace{\frac{1}{2} \mathbb{E}_{(s,a,r,s') \sim \mathcal{D}} \left[ \left( Q_\theta(s,a) - \hat{\mathcal{B}}^\pi \hat{Q}^k(s,a) \right)^2 \right]}_{\text{标准TD损失}} \]

where \(\mu(a|s)\) is a distribution used for sampling OOD actions (typically uniform or the current policy), \(\hat{\mathcal{B}}^\pi\) is the Bellman operator, and \(\alpha > 0\) is a hyperparameter controlling the degree of conservatism.

Intuition: The regularizer accomplishes two things:

First term \(\mathbb{E}_{a \sim \mu}[Q_\theta(s, a)]\): For actions sampled from \(\mu\) (which may include many OOD actions), penalize their Q-values — push them down.
Second term \(\mathbb{E}_{a \sim \mathcal{D}}[Q_\theta(s, a)]\): For actions that actually appear in the dataset, boost their Q-values — push them up.

This push-and-pull achieves the effect of "high Q-values for in-distribution actions, low Q-values for out-of-distribution actions."

Q(s, a)
  ^
  |     * (OOD动作: 被压低)
  |   *
  | *
  |           +------+
  |           | 数据集 |  <-- Q值被抬高
  |           | 内动作 |
  |           +------+
  +--*--*--*-----------*--*---> a
        OOD            OOD

CQL(H) Variant

In practice, using a uniform distribution to sample OOD actions is inefficient — in high-dimensional action spaces, uniform sampling is unlikely to hit "dangerous" OOD regions. CQL(H) uses the current policy \(\pi_\theta\) for sampling and adds an entropy regularizer to encourage policy diversity:

\[ \min_Q \max_\mu \; \alpha \left( \mathbb{E}_{s \sim \mathcal{D}, a \sim \mu(a|s)} [Q(s,a)] - \mathbb{E}_{s, a \sim \mathcal{D}} [Q(s,a)] \right) + \frac{1}{2} L_{\text{TD}}(Q) + \text{entropy}(\mu) \]

By maximizing over \(\mu\), CQL(H) automatically identifies the most "dangerous" OOD actions (those with the highest Q-values) and specifically suppresses them. This is far more efficient than blindly suppressing all out-of-distribution actions.

Theoretical Guarantees of CQL

Kumar et al. proved that, under certain conditions, the Q-function learned by CQL satisfies:

\[ \hat{Q}^{\text{CQL}}(s, a) \leq Q^\pi(s, a), \quad \forall s, a \]

That is, CQL's Q-values are a lower bound on the true Q-values. This means that policy evaluation using CQL's Q-values yields conservative performance estimates — actual performance will never be worse than estimated.

Sensitivity to \(\alpha\)

A practical issue with CQL is that the hyperparameter \(\alpha\) is highly sensitive:

\(\alpha\) too large: Q-values are excessively suppressed, the policy becomes overly conservative, and nearly degenerates into Behavioral Cloning.
\(\alpha\) too small: The conservative penalty is insufficient, Q-values still suffer from overestimation, and the policy may select OOD actions.
The optimal \(\alpha\) depends on dataset quality and coverage, requiring extensive tuning.

IQL (Implicit Q-Learning)

IQL was proposed by Kostrikov et al. in 2022 ("Offline Reinforcement Learning with Implicit Q-Learning") and takes a fundamentally different approach to circumventing distribution shift.

CQL's approach: Penalize Q-values for OOD actions. IQL's approach: Never query Q-values for OOD actions in the first place.

This is a fundamental shift. In standard Q-learning, computing target values requires \(\max_{a'} Q(s', a')\), which involves maximizing over all actions — precisely the step that introduces the OOD problem. IQL completely bypasses this operation.

Expectile Regression

The key technique in IQL is expectile regression, which uses a separate value function \(V_\psi(s)\) to approximate \(\max_a Q(s, a)\) using only \((s, a)\) pairs present in the dataset.

The expectile regression loss is:

\[ L_\tau^V(\psi) = \mathbb{E}_{(s, a) \sim \mathcal{D}} \left[ L_\tau^2 \left( Q_{\hat{\theta}}(s, a) - V_\psi(s) \right) \right] \]

where the asymmetric loss is:

\[ L_\tau^2(u) = |\tau - \mathbf{1}(u < 0)| \cdot u^2 \]

Here \(\tau \in (0, 1)\) is a key hyperparameter.

Intuition: The effect of this asymmetric quadratic loss is that as \(\tau\) approaches \(1\), \(V_\psi(s)\) is "pulled toward" the upper expectile of \(Q(s, a)\). Imagine that for a given state \(s\), the dataset may contain multiple different actions \(a_1, a_2, \ldots\) with corresponding Q-values. Ordinary mean squared error would cause \(V(s)\) to learn the mean of these Q-values, whereas expectile regression causes \(V(s)\) to learn a value close to the maximum (but not the exact max).

         Q(s, a1) = 5.0
         Q(s, a2) = 3.0    普通MSE: V(s) ≈ 4.0 (均值)
         Q(s, a3) = 8.0    Expectile(τ=0.9): V(s) ≈ 7.5 (接近最大值)
         Q(s, a4) = 2.0    Expectile(τ=0.99): V(s) ≈ 7.9 (更接近最大值)

The crucial point is that this "approximate max" operation is performed only over actions that appear in the dataset — it never queries the Q-value of an unseen OOD action.

Q-Function Update

With \(V_\psi(s)\) in hand, the Q-function update becomes straightforward:

\[ L^Q(\theta) = \mathbb{E}_{(s, a, r, s') \sim \mathcal{D}} \left[ \left( r + \gamma V_\psi(s') - Q_\theta(s, a) \right)^2 \right] \]

Note that \(V_\psi(s')\) replaces \(\max_{a'} Q(s', a')\). Since \(V_\psi\) is learned via expectile regression over in-distribution actions, no OOD queries are involved.

Policy Extraction: AWR (Advantage Weighted Regression)

After learning the Q-function and V-function, IQL needs to extract an executable policy. It uses Advantage Weighted Regression (AWR):

\[ L^\pi(\phi) = \mathbb{E}_{(s, a) \sim \mathcal{D}} \left[ \exp \left( \beta \cdot (Q_{\hat{\theta}}(s, a) - V_\psi(s)) \right) \cdot \log \pi_\phi(a|s) \right] \]

where \(\beta\) is an inverse temperature parameter. This amounts to weighted behavioral cloning: state-action pairs with high advantage values receive higher weights, so the policy preferentially imitates "good" actions from the dataset.

IQL vs CQL

Dimension	CQL	IQL
Handling OOD actions	Explicitly penalizes OOD Q-values	Completely avoids querying OOD
Requires max operation	Yes (in the regularizer)	No
Implementation complexity	Higher	Lower
Hyperparameter sensitivity	\(\alpha\) is very sensitive	\(\tau\) is relatively stable
Theoretical guarantees	Q-value lower bound	No strict lower bound guarantee
Practical performance	Strong, but hard to tune	Strong and stable

TD3+BC (TD3 with Behavioral Cloning)

TD3+BC was proposed by Fujimoto and Gu in 2021 ("A Minimalist Approach to Offline Reinforcement Learning"). Its design philosophy is: Offline RL does not require complex algorithm design — simply adding a behavioral cloning regularizer to a good online algorithm suffices.

Algorithm Design

TD3+BC is simply the standard TD3 algorithm (Twin Delayed DDPG), with the sole modification of adding a BC term to the actor's policy optimization objective:

\[ \pi = \arg\max_\pi \mathbb{E}_{(s, a) \sim \mathcal{D}} \left[ \lambda Q_\theta(s, \pi(s)) - (\pi(s) - a)^2 \right] \]

The first term is standard Q-value maximization; the second term is behavioral cloning — keeping the policy's output close to actions in the dataset.

Normalization Trick

Naively adding Q-values and BC loss together causes a scale mismatch. TD3+BC uses a simple but effective normalization:

\[ \lambda = \frac{\alpha}{\frac{1}{N} \sum_{(s_i, a_i)} |Q(s_i, \pi(s_i))|} \]

This normalizes by the mean absolute Q-value in the batch, keeping the Q term and BC term at comparable scales. \(\alpha\) is typically set to \(2.5\).

Why TD3+BC Works

The success of TD3+BC reveals an important insight about Offline RL: much of the complex design in offline RL algorithms (CQL's conservative regularizer, IQL's expectile regression) is fundamentally doing the same thing — preventing the policy from deviating too far from the dataset. And the simplest way to achieve this is to directly add a BC loss.

That said, TD3+BC has limitations. It assumes the action distribution in the dataset is unimodal (since it uses mean squared error as the BC loss), which can be problematic when the dataset contains a mixture of different behavior policies.

Sequence Modeling Methods

Decision Transformer

Decision Transformer (DT) was proposed by Chen et al. in 2021 ("Decision Transformer: Reinforcement Learning via Sequence Modeling") and represents a paradigm shift — completely reframing the reinforcement learning problem as a sequence modeling problem, discarding all traditional RL concepts such as Bellman equations, TD learning, and Q-functions.

Core Idea

The traditional RL approach:

\[ \text{Learn value functions/policies} \xrightarrow{\text{Bellman equation}} \text{Optimal decisions} \]

The Decision Transformer approach:

\[ \text{Sequence prediction} \xrightarrow{\text{Transformer}} \text{Conditionally generated decisions} \]

DT treats a trajectory \((s_0, a_0, r_0, s_1, a_1, r_1, \ldots)\) as a "sentence" and uses a GPT-style Transformer to learn the patterns in this sequence, then generates good decisions via conditional generation.

Input Representation

One of DT's key innovations is its input design. Rather than using the reward \(r_t\) directly, it uses the return-to-go \(\hat{R}_t\):

\[ \hat{R}_t = \sum_{t'=t}^{T} r_{t'} \]

The input sequence is organized as a sequence of triplets:

\[ \tau = (\hat{R}_1, s_1, a_1, \hat{R}_2, s_2, a_2, \ldots, \hat{R}_T, s_T, a_T) \]

Each element in the triplet is embedded into the same dimension using separate linear layers, then fed into a GPT architecture.

输入序列 (每个时间步3个token):

+------+    +------+    +------+    +------+    +------+    +------+
| R̂_1  |    |  s_1 |    |  a_1 |    | R̂_2  |    |  s_2 |    |  a_2 |
+------+    +------+    +------+    +------+    +------+    +------+
   |            |            |          |            |            |
   v            v            v          v            v            v
+------+    +------+    +------+    +------+    +------+    +------+
| Embed|    | Embed|    | Embed|    | Embed|    | Embed|    | Embed|
+------+    +------+    +------+    +------+    +------+    +------+
   |            |            |          |            |            |
   +------+-----+------+-----+----+-----+------+-----+------+----+
                                  |
                          +-------v--------+
                          |   GPT (Causal  |
                          |  Transformer)  |
                          +-------+--------+
                                  |
                     +------+-----+-----+------+
                     |      |           |      |
                     v      v           v      v
                  (pred_a1)(...)     (pred_a2)(...)

Inference (Test Time)

At test time, DT is used in an elegant manner:

Set a desired return-to-go \(\hat{R}_1\) (e.g., the highest trajectory return in the dataset)
Observe the current state \(s_1\)
Use the Transformer to autoregressively generate action \(a_1\)
Execute \(a_1\), receive reward \(r_1\), transition to \(s_2\)
Update \(\hat{R}_2 = \hat{R}_1 - r_1\)
Repeat

Intuition: This is like telling the model "I want a total return of 100," and the model generates "if you want 100 points, here is the action you should take in the current state." At its core, this is conditional generation.

Advantages of DT

Simplicity: No TD learning, no Q-function, no Bellman equation — just a standard Transformer with cross-entropy or mean squared error loss.
Long-horizon credit assignment: The Transformer's self-attention mechanism naturally models long-range dependencies, without needing to propagate information step-by-step through bootstrapping as in TD learning.
Conditional control: By varying the return-to-go, one can flexibly control how conservative or aggressive the policy is.

Limitations of DT

1. Inability to Perform Trajectory Stitching

This is DT's most fundamental limitation. Consider a simple scenario with two trajectories in the dataset:

Trajectory A: Performs well from start to midpoint, poorly from midpoint to end.
Trajectory B: Performs poorly from start to midpoint, well from midpoint to end.

An ideal policy should be able to "stitch" the strengths of both trajectories: learn the first half from A and the second half from B. Traditional TD-based methods (e.g., CQL, IQL) can achieve this because Bellman equations naturally support cross-trajectory information propagation. But DT merely imitates trajectories in the dataset and cannot automatically perform such stitching.

2. Dependence on Dataset Quality

DT is essentially an advanced form of behavioral cloning. If the dataset contains no high-return trajectories, DT cannot generate good actions even with a high return-to-go target.

3. Limitations of Return-to-Go

In stochastic environments, the same action sequence may lead to different returns. Conditioning the policy on a scalar return-to-go may be insufficient.

Comparison of Offline RL Methods

Dimension	CQL	IQL	TD3+BC	Decision Transformer
Core idea	Penalize Q-values for OOD actions	Avoid querying OOD actions	Add BC regularizer	Sequence modeling + conditional generation
Q-function	Conservative Q (lower bound)	Standard Q + expectile V	Standard Twin Q	No Q-function
Policy extraction	Max from conservative Q	AWR weighted cloning	Direct Q+BC optimization	Transformer autoregression
Uses Bellman equation	Yes	Yes	Yes	No
Trajectory stitching	Supported	Supported	Supported	Not supported
Hyperparameter sensitivity	High (\(\alpha\))	Medium (\(\tau\))	Low (\(\alpha\) fixed at 2.5)	Low
Implementation complexity	High	Medium	Low	Medium (requires Transformer)
Computational cost	High	Medium	Low	High (large model)
Suitable scenarios	General-purpose	General-purpose	High-quality datasets	Long-sequence decision-making

Selection guidelines:

Simplicity and baselines: TD3+BC
Stability and generality: IQL
Theoretical guarantees: CQL
Long-sequence datasets of high quality: Decision Transformer

Offline-to-Online Fine-tuning

Policies learned through pure Offline RL are limited by the quality and coverage of the dataset. A natural question arises: Can we first use Offline RL to learn a decent initial policy from the dataset, and then further improve it through a small amount of online interaction?

This is the research direction of Offline-to-Online (O2O) fine-tuning.

Why Pure Offline Is Not Enough

Limited dataset coverage: The behavior policy may not have visited certain critical state-action pairs, causing the learned policy to perform poorly in those regions.
Excessive conservatism: Methods like CQL may be overly conservative for safety, potentially missing better policies.
Inability to adapt to environmental changes: If there are discrepancies between the deployment environment and the data collection environment, a purely offline policy may fail.

Core Challenges

The main challenge in O2O fine-tuning is: when switching from offline to online training, policy performance may first drop sharply before slowly recovering. This is known as "initial performance collapse."

The reason is that the conservative Q-values learned during offline training are rapidly "corrected" (typically adjusted upward) in the first few steps of online training, and this sudden change in Q-values causes dramatic policy oscillations.

Representative Methods

Cal-QL (Calibrated CQL): Calibrates CQL's Q-values so that the Q-values learned during the offline phase are not excessively conservative, thereby reducing performance collapse at the beginning of online fine-tuning.

IQL-to-Online: IQL is naturally suited for O2O fine-tuning because its training process does not depend on interaction with the current policy. In the online phase, one simply continues training by adding newly collected data to the replay buffer.

Balanced Replay: During the online phase, training uses a mixture of offline data and newly collected online data to prevent catastrophic forgetting.

Offline-to-Online流程:

Phase 1: Offline Training                Phase 2: Online Fine-tuning
+----------------------------+          +----------------------------+
|                            |          |                            |
|  静态数据集 D              |          |  环境交互                  |
|       |                    |          |       |                    |
|       v                    |          |       v                    |
|  Offline RL算法            |  ------> |  Online RL算法             |
|  (CQL/IQL/TD3+BC)         |  初始化  |  + Replay Buffer           |
|       |                    |          |  (offline数据 + online数据)|
|       v                    |          |       |                    |
|  初始策略 π_0              |          |       v                    |
|                            |          |  改进策略 π*               |
+----------------------------+          +----------------------------+

Summary and Outlook

Offline RL is a critical step toward bringing reinforcement learning into practical applications. It enables us to leverage vast amounts of existing data to learn decision-making policies without incurring the enormous costs of online trial-and-error.

The key methodological differences lie in how distribution shift is handled:

Conservative value methods (CQL, IQL) constrain the Q-function to avoid excessive optimism about OOD actions.
Policy constraint methods (TD3+BC) use behavioral cloning regularization to limit policy deviation.
Sequence modeling methods (Decision Transformer) bypass the problem entirely by abandoning the Bellman framework.

The future trend is to combine Offline RL with large-scale pretraining and foundation models. Just as language models can learn general linguistic capabilities from massive text corpora, decision models may similarly learn general decision-making capabilities from massive interaction data. Decision Transformer has taken the first step by recasting RL as a sequence modeling problem, opening the possibility that Transformer scaling laws may be replicated in the decision-making domain.