Multi-task Learning and Generalization
Overview
One of the primary limitations of current robot learning is insufficient generalization: policies trained in the lab can typically only complete specific tasks with specific objects in specific environments. Facing the infinite diversity of the real world — different object shapes, materials, layouts, lighting, task descriptions — single-task policies are far from sufficient.
This article discusses core methods and challenges for achieving multi-task capability and generalization in robot learning.
Dimensions of Generalization
Robot generalization must simultaneously handle variations across multiple dimensions:
| Generalization Dimension | Source of Variation | Difficulty |
|---|---|---|
| Intra-object generalization | Different instances of the same object class (different cups) | Medium |
| Inter-object generalization | Different object categories (cup vs. bowl) | High |
| Scene generalization | Different table layouts, backgrounds | Medium |
| Lighting generalization | Different light source directions and intensities | Low–Medium |
| Task generalization | Different skills (grasping vs. placing vs. pouring) | High |
| Instruction generalization | Different language expressions for the same task | Medium |
| Robot generalization | Different robot morphologies and sensors | Very High |
Combinatorial Explosion
With \(N_o\) object types, \(N_e\) environments, and \(N_t\) tasks, the number of combinations requiring generalization is \(O(N_o \times N_e \times N_t)\). Even with only 100 variations per dimension, the total combinations reach \(10^6\) — far exceeding the capability of manually collecting data for each case.
Multi-task Learning
Problem Formulation
The goal of multi-task learning is to train a single policy \(\pi_\theta\) to simultaneously accomplish \(M\) tasks \(\{\mathcal{T}_1, \ldots, \mathcal{T}_M\}\). Each task is defined by a reward function \(r_m\) or demonstration dataset \(\mathcal{D}_m\).
Objective Function:
where \(c_m\) is the task condition (e.g., language instruction, one-hot encoding, goal image), and \(w_m\) is the task weight.
Task Conditioning Methods
Language conditioning (most common):
Language instructions are encoded as vectors \(e_l \in \mathbb{R}^d\) through pretrained language models (e.g., CLIP, T5).
Goal image conditioning:
Providing a goal image of the completed task, avoiding language ambiguity.
Task embedding conditioning:
Learning trainable task embedding vectors.
Shared Representations
The core challenge of multi-task learning is negative transfer: different tasks may require conflicting features, and sharing parameters can actually degrade performance on each individual task.
Shared encoder + task-specific heads:
Progressive networks: Add a new column for each new task, reusing existing knowledge through lateral connections.
Soft Modular Networks: Learn module selection weights \(w_{m,i}\) for each task:
Task Balancing
Different tasks vary greatly in difficulty and data volume; uniform sampling leads to overfitting on easy tasks and underfitting on hard ones.
Dynamic weight adjustment (Uncertainty Weighting, Kendall et al., 2018):
where \(\sigma_m\) is a learnable task uncertainty parameter. Tasks with higher uncertainty automatically receive lower weights.
Few-Shot Adaptation
Problem Setting
Given a pretrained policy \(\pi_\theta\) and a small number of demonstrations for a new task \(\mathcal{D}_{\text{new}} = \{(o_i, a_i)\}_{i=1}^K\) (\(K = 1\text{-}10\)), rapidly adapt to the new task.
Fine-tuning Methods
Full parameter fine-tuning:
Risk: Easy to overfit when \(K\) is small.
LoRA adaptation: Only train low-rank incremental matrices:
Parameter count is reduced to \(\frac{2r}{d+d}\) of the original, effectively preventing overfitting.
Adapter fine-tuning: Inserting small trainable modules into the frozen pretrained network.
Meta-Learning
MAML (Model-Agnostic Meta-Learning): Learn a good initialization \(\theta_0\) such that a few gradient descent steps from this initialization can adapt to a new task.
Outer-loop optimization:
Inner-loop adaptation (for new task \(m\)):
where the support set consists of a few demonstrations, and the query set is used to evaluate adaptation performance.
In-Context Learning
Inspired by in-context learning in LLMs, a few demonstrations are directly fed as context to the policy network without gradient updates:
Representative work: In-Context Robot Transformer (Zitkovich et al., 2023).
Zero-Shot Transfer
Foundation Model-Driven Zero-Shot Capabilities
Leveraging foundation models pretrained on massive data to achieve direct transfer without any target task data.
Vision-Language-Action Models (VLA):
VLA models (e.g., RT-2, OpenVLA) achieve a degree of zero-shot generalization through joint training on large-scale robot data and internet data:
- Novel objects (object categories never seen during training)
- Novel instructions (language expressions never seen during training)
- Some degree of novel skill composition
Semantic transfer pathway:
graph LR
A[Internet Knowledge<br/>Billions of Image-Text Data] --> B[Visual-Semantic Understanding<br/>Object Recognition/Spatial Relations/Common Sense]
B --> C[VLA Model<br/>RT-2 / OpenVLA]
D[Robot Data<br/>Millions of Trajectories] --> C
C --> E[Zero-shot Execution<br/>Novel Objects/Novel Instructions]
style A fill:#e1f5fe
style B fill:#fff3e0
style C fill:#e8f5e9
style D fill:#e1f5fe
style E fill:#fce4ec
Language-Guided Skill Composition
Through the planning capabilities of LLMs, novel tasks are decomposed into combinations of existing skills:
-
Task description: "Tidy up the table" → LLM decomposes into:
- Find all items
- Place each into a storage box
- Wipe the table surface
-
Skill library matching: Map sub-tasks to trained atomic skills
-
Sequential execution: Call each atomic policy in order
Representative work: SayCan (Ahn et al., 2022) combines the semantic knowledge of LLMs with robot affordance evaluation:
Benchmarks
SIMPLER
SIMPLER (Li et al., 2024) is a simulation-based benchmark for robot policy evaluation:
- Design goal: Provide simulation evaluation highly correlated with real-world performance
- Tasks: Manipulation tasks on Google Robot and WidowX platforms
- Evaluation protocol: Standardized success rate and generalization metrics
LIBERO
LIBERO (Liu et al., 2023) is a multi-task robot learning benchmark:
- 130 tasks: Spanning 5 task suites
- Generalization tests:
- LIBERO-Spatial: Spatial relation generalization
- LIBERO-Object: Object generalization
- LIBERO-Goal: Goal generalization
- LIBERO-Long: Long-horizon tasks
Typical Results:
| Method | LIBERO-Spatial | LIBERO-Object | LIBERO-Goal | LIBERO-Long |
|---|---|---|---|---|
| BC (ResNet) | 78.5 | 82.1 | 68.3 | 42.1 |
| BC (ViT) | 81.2 | 85.4 | 73.1 | 48.7 |
| Diffusion Policy | 86.3 | 89.1 | 80.2 | 56.4 |
| ACT | 84.1 | 87.2 | 77.8 | 52.3 |
RLBench
RLBench (James et al., 2020) is a large-scale simulation benchmark:
- 100+ tasks: Ranging from simple to complex
- Multimodal inputs: RGB, depth, point clouds
- Language conditioning: Each task is paired with a natural language description
Open Problems
Long-Horizon Tasks
Current methods perform well on short-horizon tasks (10–50 steps), but long-horizon tasks (100+ steps) remain challenging:
Error accumulation: In a \(T\)-step task, if the per-step success rate is \(p\), the overall success rate is \(p^T\). When \(p = 0.99, T = 100\), the success rate is only \(0.99^{100} \approx 0.37\).
Hierarchical methods:
The high-level policy runs at a lower frequency, while the low-level policy executes atomic skills at high frequency.
Deformable Objects
Manipulating ropes, cloth, liquids, and other deformable objects is a major challenge:
- State space: The state dimension of deformable objects is theoretically infinite
- Physical modeling: Deformable body simulation is computationally expensive and imprecise
- Representation learning: How to compactly represent the state of deformable objects
Tool Use
Tool use requires understanding the functional affordance of tools, including:
- Tool selection: Choosing the appropriate tool for the task
- Grasp planning: Grasping the tool in a functionally appropriate manner
- Skill transfer: Transferring experience from one tool to another
This requires deep causal understanding of the physical world, far beyond pattern matching.
Generalization Under Safety Constraints
Generalized policies have uncertain behavior in new environments, making safety guarantees more difficult:
- Uncertainty estimation: When is the policy "uncertain"?
- Safe fallback: Trigger a safety stop when anomalies are detected
- Human-robot collaboration: Request human assistance in uncertain situations
Standardized Evaluation Framework
To fairly compare the generalization capabilities of different methods, the community is establishing standardized evaluation:
| Evaluation Dimension | Metric | Measurement Method |
|---|---|---|
| Training efficiency | Data needed to reach threshold performance | Data scaling curves |
| In-distribution performance | Success rate under training conditions | Standard test set |
| OOD generalization | Success rate under out-of-distribution conditions | Systematic environment variations |
| Adaptation speed | Number of samples needed for few-shot adaptation | K-shot curves |
| Transfer ratio | Real/sim success rate ratio | Sim2Real evaluation |
Connections to Other Chapters
- Foundation models: Models and Algorithms — VLA and world models are key to achieving zero-shot generalization
- Imitation learning: Imitation Learning — BC and diffusion policies are the underlying algorithms for multi-task learning
- Diffusion policy: Diffusion Policy — its multimodal modeling capability is particularly important for multi-task learning
- Data collection: Teleoperation and Data Collection — data scaling directly affects generalization capability
References
- Reed, S., et al. (2022). A Generalist Agent. TMLR (Gato).
- Brohan, A., et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. CoRL.
- Liu, B., et al. (2023). LIBERO: Benchmarking Knowledge Transfer for Lifelong Robot Learning. NeurIPS.
- James, S., et al. (2020). RLBench: The Robot Learning Benchmark. IEEE RA-L.
- Ahn, M., et al. (2022). Do As I Can, Not As I Say: Grounding Language in Robotic Affordances. CoRL.
- Li, X., et al. (2024). SIMPLER: Simulated Manipulation Policy Evaluation for Robot Learning. CoRL.
- Finn, C., Abbeel, P., & Levine, S. (2017). Model-Agnostic Meta-Learning for Fast Adaptation of Deep Networks. ICML.