Large Model-Driven Robots
Large Language Models (LLMs) and Vision-Language Models (VLMs) possess rich commonsense knowledge and reasoning capabilities, but they do not directly understand the constraints of the physical world. How can the "brain" of LLMs/VLMs be connected to the "body" of robots? This article systematically reviews the core methods for LLM/VLM-driven robotics.
Related notes: Introduction to Robot Foundation Models | VLA Models | Multi-task Learning and Generalization
1. Problem Definition: The Grounding Problem
The core challenge of LLM-driven robotics is the Grounding Problem:
An LLM can generate instructions like "open the refrigerator door," but how do we ensure:
- Physical feasibility: Can the robot execute this action in its current state?
- Semantic accuracy: Does the LLM's understanding of "open" match the robot's action primitives?
- Environment awareness: Does the LLM know where the refrigerator is? The position of the door handle?
2. SayCan: Feasibility-Constrained Language Planning
2.1 Core Idea
SayCan (Ahn et al., Google, 2022) is a foundational work in LLM-driven robotics. Core formula:
where:
- \(\mathcal{C}\): Predefined skill library (e.g., "pick up X", "go to Y", "place X on Y")
- \(l\): User's natural language instruction
- \(h\): Dialogue history (sequence of completed actions)
- \(V^{c_i}(s)\): Value function of skill \(c_i\) in current state \(s\), serving as the affordance score
2.2 Architecture Flow
graph TB
USER[User Instruction<br/>"I spilled my drink, can you help?"] --> LLM[LLM<br/>PaLM 540B]
LLM --> SCORE["Score each skill<br/>p(skill | instruction, history)"]
ENV[Environment State<br/>Robot Observations] --> AFF["Affordance Evaluation<br/>Value function for each skill"]
SCORE --> MUL["Joint Scoring<br/>p(skill) × V(skill)"]
AFF --> MUL
MUL --> SELECT["Select Best Skill<br/>argmax"]
SELECT --> EXEC["Execute Skill<br/>Low-level RL/BC Policy"]
EXEC --> FB["Execution Result<br/>Success/Failure"]
FB --> LLM
style LLM fill:#e3f2fd
style AFF fill:#e8f5e9
style MUL fill:#fff3e0
2.3 Key Details
Skill library design: SayCan predefines 551 short-horizon skills, each corresponding to a pretrained RL policy. Skills are described in natural language:
"pick up the red bull can"
"go to the counter"
"place the sponge in the sink"
"open the top drawer"
Affordance evaluation: The affordance of each skill is estimated by training a value function:
When the robot is far from the target object or the target is not in view, \(V^{c_i}(s)\) is naturally low.
Limitations:
- Depends on a predefined skill library; cannot handle tasks outside the library
- Each skill requires training a separate RL policy
- Cannot handle tasks requiring creative solutions
3. Code as Policies: LLM-Generated Code
3.1 Core Idea
Code as Policies (Liang et al., 2023) has the LLM directly generate Python code to control the robot, rather than selecting predefined skills:
3.2 Example
User instruction: "Line up all the blue blocks in a row"
LLM-generated code:
# Get positions of all blue blocks
blue_blocks = detect_objects("blue block")
# Sort by x-coordinate
blue_blocks.sort(key=lambda b: b.position[0])
# Set target arrangement positions
start_x = 0.3
spacing = 0.08
for i, block in enumerate(blue_blocks):
target_pos = [start_x + i * spacing, 0.5, block.position[2]]
pick_and_place(block.id, target_pos)
3.3 API Design
The key is designing appropriate robot APIs for the LLM to call:
| API Level | Example | Description |
|---|---|---|
| High-level semantic | pick_and_place(obj, target) |
Abstract operations |
| Mid-level motion | move_to(position, orientation) |
Motion planning |
| Low-level control | set_joint_angles(angles) |
Joint control |
| Perception | detect_objects(description) |
Visual detection |
| Spatial reasoning | get_position(obj) |
Position queries |
Advantages:
- Strong compositionality: Code naturally supports loops, conditionals, function calls
- Interpretable: Generated code can be manually reviewed
- Flexible: Not constrained by a predefined skill library
Disadvantages:
- LLM may generate syntactically correct but semantically wrong code
- Requires careful API interface design
- Error handling and exception recovery are difficult
4. ProgPrompt: Programmatic Prompting
ProgPrompt (Singh et al., 2023) combines structured programs with natural language prompts:
Core innovation: Providing in the prompt:
- List of available objects in the environment
- Function signatures of available actions
- Assertion functions (for checking preconditions)
Example prompt structure:
# Available objects: mug, plate, table, sink, sponge
# Available actions: find(obj), pick_up(obj), place(obj, location),
# open(obj), close(obj)
# Assertions: is_holding(obj), is_at(location), is_open(obj)
def clean_the_mug():
find("mug")
pick_up("mug")
find("sink")
place("mug", "sink")
# Check precondition
assert is_at("sink")
open("faucet")
...
Differences from Code as Policies:
- ProgPrompt places more emphasis on precondition checking and assertions
- Provides systematic prompt engineering templates
- Better suited for tasks requiring strict execution order
5. VoxPoser: 3D Spatial Reasoning
5.1 Core Idea
VoxPoser (Huang et al., 2023) maps the language reasoning capabilities of LLMs into 3D voxel space:
5.2 Workflow
- Language understanding: LLM parses the instruction and generates code describing target regions and constraints
- Visual localization: VLM (e.g., OWL-ViT) localizes objects in the scene
- Voxel map generation: - Affordance Map \(\mathbf{V}_{\text{aff}} \in \mathbb{R}^{X \times Y \times Z}\): Target regions score high - Constraint Map \(\mathbf{V}_{\text{con}} \in \mathbb{R}^{X \times Y \times Z}\): Obstacles/forbidden zones score high
- Motion planning: Run MPC on the voxel maps to generate collision-free trajectories
Combined score:
Advantages:
- Requires no robot training data
- Can handle complex spatial constraints ("place the cup between the plate and the vase")
- Combines language reasoning with 3D spatial reasoning
6. Inner Monologue: Closed-Loop Feedback
6.1 Core Idea
Inner Monologue (Huang et al., Google, 2022) introduces perceptual feedback loops into LLM planning:
Previous methods (like SayCan) were open-loop — the LLM generates a plan and executes it directly without considering unexpected events during execution. Inner Monologue lets the LLM continuously receive environmental feedback and dynamically adjust plans.
6.2 Feedback Types
| Feedback Type | Source | Example |
|---|---|---|
| Success/failure detection | Visual classifier | "pick up succeeded" |
| Scene description | Image captioning model | "I see a red cup on the table" |
| Object detection | Object detection model | "Objects: cup(0.3, 0.5), plate(0.6, 0.4)" |
| Human feedback | Human operator | "No, I meant the other cup" |
6.3 Closed-Loop Execution Flow
graph TB
INST[User Instruction] --> LLM[LLM Planner]
LLM --> PLAN[Sub-task Plan]
PLAN --> EXEC[Execute Current Sub-task]
EXEC --> SENSE[Perception System]
SENSE --> FB["Feedback Summary<br/>1. Success/Failure<br/>2. Scene Description<br/>3. Object Detection"]
FB --> LLM
LLM --> |"Adjust plan based on feedback"| PLAN
EXEC --> |"Task complete"| DONE[Done]
style LLM fill:#e3f2fd
style SENSE fill:#e8f5e9
style FB fill:#fff3e0
Key difference from SayCan:
- SayCan: LLM selects skill → execute → select next skill (no feedback)
- Inner Monologue: LLM selects skill → execute → collect environment feedback → LLM adjusts based on feedback (with feedback)
7. Statler: State Maintenance
Statler (Yoneda et al., 2023) introduces explicit state representation into LLM planning:
Problem: LLMs have limited context windows and may "forget" the environment state during long-sequence tasks.
Solution: Maintain a structured world state representation:
World State:
- Robot Location: kitchen counter
- Holding: nothing
- Objects:
- red_cup: on table, upright, empty
- blue_plate: on counter, clean
- sponge: in sink, wet
Two LLM modules:
- Inner State LLM: Updates the world state based on execution results
- Outer State LLM: Decides the next action based on the current world state
This separation allows the LLM to more reliably track environmental changes during long-horizon tasks.
8. Method Comparison Summary
| Method | Year | Action Output Method | Closed-Loop | Requires Training | 3D Reasoning |
|---|---|---|---|---|---|
| SayCan | 2022 | Select predefined skills | Partial | Each skill needs training | No |
| Code as Policies | 2023 | Generate Python code | No | No | Limited |
| ProgPrompt | 2023 | Generate programs + assertions | Partial (assertions) | No | No |
| VoxPoser | 2023 | 3D voxel value maps → MPC | No | No | Yes |
| Inner Monologue | 2022 | Select skills + feedback | Yes | Low-level skills need training | No |
| Statler | 2023 | Select skills + state | Yes | No | No |
Key Trade-offs
Flexibility vs. Reliability:
- Code as Policies is most flexible (can generate arbitrary code) but least reliable (code may have bugs)
- SayCan is most reliable (each skill is trained) but least flexible (constrained by skill library)
Zero-shot vs. Training Required:
- VoxPoser, Code as Policies, ProgPrompt, and Statler are all zero-shot methods, requiring no robot training data
- SayCan and Inner Monologue require pretrained low-level skills
9. Limitations and Future Directions
9.1 Current Limitations
- Inference latency: LLM inference typically takes hundreds of milliseconds to seconds, unsuitable for real-time control
- Hallucination problem: LLMs may generate plans that seem reasonable but are physically infeasible
- Weak spatial reasoning: Pure text LLMs have limited spatial reasoning capabilities
- Error accumulation: Errors compound during long-sequence tasks
- Safety: LLM-generated code may lead to dangerous actions
9.2 Development Trends
- VLM replacing LLM: Using Vision-Language Models (e.g., GPT-4V) to directly understand scenes from images, reducing information loss from verbal descriptions
- Fusion with VLA: High-level LLM planning, low-level VLA execution (e.g., pi0.5 architecture)
- Active perception: LLMs not only passively receive feedback but actively guide robots to collect information ("first look at what's on the table")
- Multi-agent collaboration: Multiple LLM agents coordinating control of multiple robots
- Continual learning: Improving LLM planning capabilities from execution experience
References:
- Ahn et al., "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances" (SayCan), 2022
- Liang et al., "Code as Policies: Language Model Programs for Embodied Control", ICRA 2023
- Singh et al., "ProgPrompt: Generating Situated Robot Task Plans using Large Language Models", ICRA 2023
- Huang et al., "VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models", CoRL 2023
- Huang et al., "Inner Monologue: Embodied Reasoning through Planning with Language Models", CoRL 2022
- Yoneda et al., "Statler: State-Maintaining Language Models for Embodied Reasoning", CoRL 2023