Multi-task Learning and Generalization

Overview

One of the primary limitations of current robot learning is insufficient generalization: policies trained in the lab can typically only complete specific tasks with specific objects in specific environments. Facing the infinite diversity of the real world — different object shapes, materials, layouts, lighting, task descriptions — single-task policies are far from sufficient.

This article discusses core methods and challenges for achieving multi-task capability and generalization in robot learning.

Dimensions of Generalization

Robot generalization must simultaneously handle variations across multiple dimensions:

Generalization Dimension	Source of Variation	Difficulty
Intra-object generalization	Different instances of the same object class (different cups)	Medium
Inter-object generalization	Different object categories (cup vs. bowl)	High
Scene generalization	Different table layouts, backgrounds	Medium
Lighting generalization	Different light source directions and intensities	Low–Medium
Task generalization	Different skills (grasping vs. placing vs. pouring)	High
Instruction generalization	Different language expressions for the same task	Medium
Robot generalization	Different robot morphologies and sensors	Very High

Combinatorial Explosion

With $N_o$ object types, $N_e$ environments, and $N_t$ tasks, the number of combinations requiring generalization is $O(N_o \times N_e \times N_t)$. Even with only 100 variations per dimension, the total combinations reach $10^6$ — far exceeding the capability of manually collecting data for each case.

Multi-task Learning

Problem Formulation

The goal of multi-task learning is to train a single policy $\pi_\theta$ to simultaneously accomplish $M$ tasks $\{\mathcal{T}_1, \ldots, \mathcal{T}_M\}$. Each task is defined by a reward function $r_m$ or demonstration dataset $\mathcal{D}_m$.

Objective Function:

\[ \min_\theta \sum_{m=1}^M w_m \mathbb{E}_{(o, a) \sim \mathcal{D}_m} \left[ \mathcal{L}(\pi_\theta(o, c_m), a) \right] \]

where $c_m$ is the task condition (e.g., language instruction, one-hot encoding, goal image), and $w_m$ is the task weight.

Task Conditioning Methods

Language conditioning (most common):

\[ \pi_\theta(a | o, l), \quad l = \text{``pick up the red cup''} \]

Language instructions are encoded as vectors $e_l \in \mathbb{R}^d$ through pretrained language models (e.g., CLIP, T5).

Goal image conditioning:

\[ \pi_\theta(a | o_t, o_{\text{goal}}) \]

Providing a goal image of the completed task, avoiding language ambiguity.

Task embedding conditioning:

\[ \pi_\theta(a | o, z_m), \quad z_m \in \mathbb{R}^d \]

Learning trainable task embedding vectors.

Shared Representations

The core challenge of multi-task learning is negative transfer: different tasks may require conflicting features, and sharing parameters can actually degrade performance on each individual task.

Shared encoder + task-specific heads:

\[ h = f_{\text{shared}}(o), \quad a_m = g_m(h, c_m) \]

Progressive networks: Add a new column for each new task, reusing existing knowledge through lateral connections.

Soft Modular Networks: Learn module selection weights $w_{m,i}$ for each task:

\[ h = \sum_{i=1}^K w_{m,i} \cdot \text{Module}_i(o) \]

Task Balancing

Different tasks vary greatly in difficulty and data volume; uniform sampling leads to overfitting on easy tasks and underfitting on hard ones.

Dynamic weight adjustment (Uncertainty Weighting, Kendall et al., 2018):

\[ \mathcal{L}_{\text{total}} = \sum_m \frac{1}{2\sigma_m^2} \mathcal{L}_m + \log \sigma_m \]

where $\sigma_m$ is a learnable task uncertainty parameter. Tasks with higher uncertainty automatically receive lower weights.

Few-Shot Adaptation

Problem Setting

Given a pretrained policy $\pi_\theta$ and a small number of demonstrations for a new task $\mathcal{D}_{\text{new}} = \{(o_i, a_i)\}_{i=1}^K$ ($K = 1\text{-}10$), rapidly adapt to the new task.

Fine-tuning Methods

Full parameter fine-tuning:

\[ \theta_{\text{new}} = \theta - \eta \nabla_\theta \mathcal{L}(\pi_\theta; \mathcal{D}_{\text{new}}) \]

Risk: Easy to overfit when $K$ is small.

LoRA adaptation: Only train low-rank incremental matrices:

\[ W_{\text{new}} = W_0 + \Delta W, \quad \Delta W = BA, \quad B \in \mathbb{R}^{d \times r}, A \in \mathbb{R}^{r \times d}, r \ll d \]

Parameter count is reduced to $\frac{2r}{d+d}$ of the original, effectively preventing overfitting.

Adapter fine-tuning: Inserting small trainable modules into the frozen pretrained network.

Meta-Learning

MAML (Model-Agnostic Meta-Learning): Learn a good initialization $\theta_0$ such that a few gradient descent steps from this initialization can adapt to a new task.

Outer-loop optimization:

\[ \theta_0^* = \arg\min_{\theta_0} \sum_{m=1}^M \mathcal{L}_m \left( \theta_0 - \alpha \nabla_{\theta_0} \mathcal{L}_m^{\text{support}}(\theta_0) \right) \]

Inner-loop adaptation (for new task $m$):

\[ \theta_m = \theta_0^* - \alpha \nabla_{\theta_0^*} \mathcal{L}_m^{\text{support}}(\theta_0^*) \]

where the support set consists of a few demonstrations, and the query set is used to evaluate adaptation performance.

In-Context Learning

Inspired by in-context learning in LLMs, a few demonstrations are directly fed as context to the policy network without gradient updates:

\[ a_t = \pi_\theta(o_t | \text{demo}_1, \text{demo}_2, \ldots, \text{demo}_K) \]

Representative work: In-Context Robot Transformer (Zitkovich et al., 2023).

Zero-Shot Transfer

Foundation Model-Driven Zero-Shot Capabilities

Leveraging foundation models pretrained on massive data to achieve direct transfer without any target task data.

Vision-Language-Action Models (VLA):

VLA models (e.g., RT-2, OpenVLA) achieve a degree of zero-shot generalization through joint training on large-scale robot data and internet data:

Novel objects (object categories never seen during training)
Novel instructions (language expressions never seen during training)
Some degree of novel skill composition

Semantic transfer pathway:

graph LR
    A[Internet Knowledge<br/>Billions of Image-Text Data] --> B[Visual-Semantic Understanding<br/>Object Recognition/Spatial Relations/Common Sense]
    B --> C[VLA Model<br/>RT-2 / OpenVLA]
    D[Robot Data<br/>Millions of Trajectories] --> C
    C --> E[Zero-shot Execution<br/>Novel Objects/Novel Instructions]

    style A fill:#e1f5fe
    style B fill:#fff3e0
    style C fill:#e8f5e9
    style D fill:#e1f5fe
    style E fill:#fce4ec

Language-Guided Skill Composition

Through the planning capabilities of LLMs, novel tasks are decomposed into combinations of existing skills:

Task description: "Tidy up the table" → LLM decomposes into:
- Find all items
- Place each into a storage box
- Wipe the table surface
Skill library matching: Map sub-tasks to trained atomic skills
Sequential execution: Call each atomic policy in order

Representative work: SayCan (Ahn et al., 2022) combines the semantic knowledge of LLMs with robot affordance evaluation:

\[ \pi_{\text{skill}}^* = \arg\max_{\text{skill}} \underbrace{P_{\text{LLM}}(\text{skill} | \text{context})}_{\text{semantic relevance}} \cdot \underbrace{P_{\text{affordance}}(\text{skill} | o_t)}_{\text{physical feasibility}} \]

Benchmarks

SIMPLER

SIMPLER (Li et al., 2024) is a simulation-based benchmark for robot policy evaluation:

Design goal: Provide simulation evaluation highly correlated with real-world performance
Tasks: Manipulation tasks on Google Robot and WidowX platforms
Evaluation protocol: Standardized success rate and generalization metrics

LIBERO

LIBERO (Liu et al., 2023) is a multi-task robot learning benchmark:

130 tasks: Spanning 5 task suites
Generalization tests:
- LIBERO-Spatial: Spatial relation generalization
- LIBERO-Object: Object generalization
- LIBERO-Goal: Goal generalization
- LIBERO-Long: Long-horizon tasks

Typical Results:

Method	LIBERO-Spatial	LIBERO-Object	LIBERO-Goal	LIBERO-Long
BC (ResNet)	78.5	82.1	68.3	42.1
BC (ViT)	81.2	85.4	73.1	48.7
Diffusion Policy	86.3	89.1	80.2	56.4
ACT	84.1	87.2	77.8	52.3

RLBench

RLBench (James et al., 2020) is a large-scale simulation benchmark:

100+ tasks: Ranging from simple to complex
Multimodal inputs: RGB, depth, point clouds
Language conditioning: Each task is paired with a natural language description

Open Problems

Long-Horizon Tasks

Current methods perform well on short-horizon tasks (10–50 steps), but long-horizon tasks (100+ steps) remain challenging:

Error accumulation: In a $T$-step task, if the per-step success rate is $p$, the overall success rate is $p^T$. When $p = 0.99, T = 100$, the success rate is only $0.99^{100} \approx 0.37$.

Hierarchical methods:

\[ \text{High-level policy: } g_t = \pi_{\text{high}}(o_t, l) \quad \text{(sub-goal)} $$ $$ \text{Low-level policy: } a_t = \pi_{\text{low}}(o_t, g_t) \quad \text{(atomic action)} \]

The high-level policy runs at a lower frequency, while the low-level policy executes atomic skills at high frequency.

Deformable Objects

Manipulating ropes, cloth, liquids, and other deformable objects is a major challenge:

State space: The state dimension of deformable objects is theoretically infinite
Physical modeling: Deformable body simulation is computationally expensive and imprecise
Representation learning: How to compactly represent the state of deformable objects

Tool Use

Tool use requires understanding the functional affordance of tools, including:

Tool selection: Choosing the appropriate tool for the task
Grasp planning: Grasping the tool in a functionally appropriate manner
Skill transfer: Transferring experience from one tool to another

This requires deep causal understanding of the physical world, far beyond pattern matching.

Generalization Under Safety Constraints

Generalized policies have uncertain behavior in new environments, making safety guarantees more difficult:

Uncertainty estimation: When is the policy "uncertain"?
Safe fallback: Trigger a safety stop when anomalies are detected
Human-robot collaboration: Request human assistance in uncertain situations

Standardized Evaluation Framework

To fairly compare the generalization capabilities of different methods, the community is establishing standardized evaluation:

Evaluation Dimension	Metric	Measurement Method
Training efficiency	Data needed to reach threshold performance	Data scaling curves
In-distribution performance	Success rate under training conditions	Standard test set
OOD generalization	Success rate under out-of-distribution conditions	Systematic environment variations
Adaptation speed	Number of samples needed for few-shot adaptation	K-shot curves
Transfer ratio	Real/sim success rate ratio	Sim2Real evaluation

Connections to Other Chapters

Foundation models: Models and Algorithms — VLA and world models are key to achieving zero-shot generalization
Imitation learning: Imitation Learning — BC and diffusion policies are the underlying algorithms for multi-task learning
Diffusion policy: Diffusion Policy — its multimodal modeling capability is particularly important for multi-task learning
Data collection: Teleoperation and Data Collection — data scaling directly affects generalization capability

References

Reed, S., et al. (2022). A Generalist Agent. TMLR (Gato).
Brohan, A., et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. CoRL.
Liu, B., et al. (2023). LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning. NeurIPS.
James, S., et al. (2020). RLBench: The Robot Learning Benchmark. IEEE RA-L.
Ahn, M., et al. (2022). Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. CoRL.
Li, X., et al. (2024). SIMPLER: Simulated Manipulation Policy Evaluation for Robot Learning. CoRL.
Finn, C., Abbeel, P., & Levine, S. (2017). Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML.