Large Model-Driven Robots

Large Language Models (LLMs) and Vision-Language Models (VLMs) possess rich commonsense knowledge and reasoning capabilities, but they do not directly understand the constraints of the physical world. How can the "brain" of LLMs/VLMs be connected to the "body" of robots? This article systematically reviews the core methods for LLM/VLM-driven robotics.

Related notes: Introduction to Robot Foundation Models | VLA Models | Multi-task Learning and Generalization

1. Problem Definition: The Grounding Problem

The core challenge of LLM-driven robotics is the Grounding Problem:

\[\text{Language Space} \xrightarrow{?} \text{Physical World}\]

An LLM can generate instructions like "open the refrigerator door," but how do we ensure:

Physical feasibility: Can the robot execute this action in its current state?
Semantic accuracy: Does the LLM's understanding of "open" match the robot's action primitives?
Environment awareness: Does the LLM know where the refrigerator is? The position of the door handle?

2. SayCan: Feasibility-Constrained Language Planning

2.1 Core Idea

SayCan (Ahn et al., Google, 2022) is a foundational work in LLM-driven robotics. Core formula:

\[c^* = \arg\max_{c_i \in \mathcal{C}} \underbrace{p_{\text{LLM}}(c_i | l, h)}_{\text{task relevance}} \cdot \underbrace{V^{c_i}(s)}_{\text{affordance score}}\]

where:

\(\mathcal{C}\): Predefined skill library (e.g., "pick up X", "go to Y", "place X on Y")
\(l\): User's natural language instruction
\(h\): Dialogue history (sequence of completed actions)
\(V^{c_i}(s)\): Value function of skill \(c_i\) in current state \(s\), serving as the affordance score

2.2 Architecture Flow

graph TB
    USER[User Instruction<br/>"I spilled my drink, can you help?"] --> LLM[LLM<br/>PaLM 540B]

    LLM --> SCORE["Score each skill<br/>p(skill | instruction, history)"]

    ENV[Environment State<br/>Robot Observations] --> AFF["Affordance Evaluation<br/>Value function for each skill"]

    SCORE --> MUL["Joint Scoring<br/>p(skill) × V(skill)"]
    AFF --> MUL

    MUL --> SELECT["Select Best Skill<br/>argmax"]
    SELECT --> EXEC["Execute Skill<br/>Low-level RL/BC Policy"]
    EXEC --> FB["Execution Result<br/>Success/Failure"]
    FB --> LLM

    style LLM fill:#e3f2fd
    style AFF fill:#e8f5e9
    style MUL fill:#fff3e0

2.3 Key Details

Skill library design: SayCan predefines 551 short-horizon skills, each corresponding to a pretrained RL policy. Skills are described in natural language:

"pick up the red bull can"
"go to the counter"
"place the sponge in the sink"
"open the top drawer"

Affordance evaluation: The affordance of each skill is estimated by training a value function:

\[V^{c_i}(s) \approx \mathbb{E}\left[\sum_{t=0}^T \gamma^t r_t \mid s_0 = s, \pi = \pi_{c_i}\right]\]

When the robot is far from the target object or the target is not in view, \(V^{c_i}(s)\) is naturally low.

Limitations:

Depends on a predefined skill library; cannot handle tasks outside the library
Each skill requires training a separate RL policy
Cannot handle tasks requiring creative solutions

3. Code as Policies: LLM-Generated Code

3.1 Core Idea

Code as Policies (Liang et al., 2023) has the LLM directly generate Python code to control the robot, rather than selecting predefined skills:

\[\text{LLM}(\text{instruction}, \text{API docs}, \text{examples}) \rightarrow \text{Python code}\]

3.2 Example

User instruction: "Line up all the blue blocks in a row"

LLM-generated code:

# Get positions of all blue blocks
blue_blocks = detect_objects("blue block")

# Sort by x-coordinate
blue_blocks.sort(key=lambda b: b.position[0])

# Set target arrangement positions
start_x = 0.3
spacing = 0.08

for i, block in enumerate(blue_blocks):
    target_pos = [start_x + i * spacing, 0.5, block.position[2]]
    pick_and_place(block.id, target_pos)

3.3 API Design

The key is designing appropriate robot APIs for the LLM to call:

API Level	Example	Description
High-level semantic	`pick_and_place(obj, target)`	Abstract operations
Mid-level motion	`move_to(position, orientation)`	Motion planning
Low-level control	`set_joint_angles(angles)`	Joint control
Perception	`detect_objects(description)`	Visual detection
Spatial reasoning	`get_position(obj)`	Position queries

Advantages:

Strong compositionality: Code naturally supports loops, conditionals, function calls
Interpretable: Generated code can be manually reviewed
Flexible: Not constrained by a predefined skill library

Disadvantages:

LLM may generate syntactically correct but semantically wrong code
Requires careful API interface design
Error handling and exception recovery are difficult

4. ProgPrompt: Programmatic Prompting

ProgPrompt (Singh et al., 2023) combines structured programs with natural language prompts:

Core innovation: Providing in the prompt:

List of available objects in the environment
Function signatures of available actions
Assertion functions (for checking preconditions)

Example prompt structure:

# Available objects: mug, plate, table, sink, sponge
# Available actions: find(obj), pick_up(obj), place(obj, location), 
#                    open(obj), close(obj)
# Assertions: is_holding(obj), is_at(location), is_open(obj)

def clean_the_mug():
    find("mug")
    pick_up("mug")
    find("sink")
    place("mug", "sink")
    # Check precondition
    assert is_at("sink")
    open("faucet")
    ...

Differences from Code as Policies:

ProgPrompt places more emphasis on precondition checking and assertions
Provides systematic prompt engineering templates
Better suited for tasks requiring strict execution order

5. VoxPoser: 3D Spatial Reasoning

5.1 Core Idea

VoxPoser (Huang et al., 2023) maps the language reasoning capabilities of LLMs into 3D voxel space:

\[\text{LLM} + \text{VLM} \rightarrow \text{3D Affordance/Constraint Maps} \rightarrow \text{Motion Planning}\]

5.2 Workflow

Language understanding: LLM parses the instruction and generates code describing target regions and constraints
Visual localization: VLM (e.g., OWL-ViT) localizes objects in the scene
Voxel map generation: - Affordance Map \(\mathbf{V}_{\text{aff}} \in \mathbb{R}^{X \times Y \times Z}\): Target regions score high - Constraint Map \(\mathbf{V}_{\text{con}} \in \mathbb{R}^{X \times Y \times Z}\): Obstacles/forbidden zones score high
Motion planning: Run MPC on the voxel maps to generate collision-free trajectories

Combined score:

\[\mathbf{V}_{\text{total}}(x, y, z) = \mathbf{V}_{\text{aff}}(x, y, z) - \lambda \mathbf{V}_{\text{con}}(x, y, z)\]

Advantages:

Requires no robot training data
Can handle complex spatial constraints ("place the cup between the plate and the vase")
Combines language reasoning with 3D spatial reasoning

6. Inner Monologue: Closed-Loop Feedback

6.1 Core Idea

Inner Monologue (Huang et al., Google, 2022) introduces perceptual feedback loops into LLM planning:

Previous methods (like SayCan) were open-loop — the LLM generates a plan and executes it directly without considering unexpected events during execution. Inner Monologue lets the LLM continuously receive environmental feedback and dynamically adjust plans.

6.2 Feedback Types

Feedback Type	Source	Example
Success/failure detection	Visual classifier	"pick up succeeded"
Scene description	Image captioning model	"I see a red cup on the table"
Object detection	Object detection model	"Objects: cup(0.3, 0.5), plate(0.6, 0.4)"
Human feedback	Human operator	"No, I meant the other cup"

6.3 Closed-Loop Execution Flow

graph TB
    INST[User Instruction] --> LLM[LLM Planner]
    LLM --> PLAN[Sub-task Plan]
    PLAN --> EXEC[Execute Current Sub-task]
    EXEC --> SENSE[Perception System]
    SENSE --> FB["Feedback Summary<br/>1. Success/Failure<br/>2. Scene Description<br/>3. Object Detection"]
    FB --> LLM
    LLM --> |"Adjust plan based on feedback"| PLAN

    EXEC --> |"Task complete"| DONE[Done]

    style LLM fill:#e3f2fd
    style SENSE fill:#e8f5e9
    style FB fill:#fff3e0

Key difference from SayCan:

SayCan: LLM selects skill → execute → select next skill (no feedback)
Inner Monologue: LLM selects skill → execute → collect environment feedback → LLM adjusts based on feedback (with feedback)

7. Statler: State Maintenance

Statler (Yoneda et al., 2023) introduces explicit state representation into LLM planning:

Problem: LLMs have limited context windows and may "forget" the environment state during long-sequence tasks.

Solution: Maintain a structured world state representation:

World State:
- Robot Location: kitchen counter
- Holding: nothing
- Objects:
  - red_cup: on table, upright, empty
  - blue_plate: on counter, clean
  - sponge: in sink, wet

Two LLM modules:

Inner State LLM: Updates the world state based on execution results
Outer State LLM: Decides the next action based on the current world state

This separation allows the LLM to more reliably track environmental changes during long-horizon tasks.

8. Method Comparison Summary

Method	Year	Action Output Method	Closed-Loop	Requires Training	3D Reasoning
SayCan	2022	Select predefined skills	Partial	Each skill needs training	No
Code as Policies	2023	Generate Python code	No	No	Limited
ProgPrompt	2023	Generate programs + assertions	Partial (assertions)	No	No
VoxPoser	2023	3D voxel value maps → MPC	No	No	Yes
Inner Monologue	2022	Select skills + feedback	Yes	Low-level skills need training	No
Statler	2023	Select skills + state	Yes	No	No

Key Trade-offs

Flexibility vs. Reliability:

Code as Policies is most flexible (can generate arbitrary code) but least reliable (code may have bugs)
SayCan is most reliable (each skill is trained) but least flexible (constrained by skill library)

Zero-shot vs. Training Required:

VoxPoser, Code as Policies, ProgPrompt, and Statler are all zero-shot methods, requiring no robot training data
SayCan and Inner Monologue require pretrained low-level skills

9. Limitations and Future Directions

9.1 Current Limitations

Inference latency: LLM inference typically takes hundreds of milliseconds to seconds, unsuitable for real-time control
Hallucination problem: LLMs may generate plans that seem reasonable but are physically infeasible
Weak spatial reasoning: Pure text LLMs have limited spatial reasoning capabilities
Error accumulation: Errors compound during long-sequence tasks
Safety: LLM-generated code may lead to dangerous actions

9.2 Development Trends

VLM replacing LLM: Using Vision-Language Models (e.g., GPT-4V) to directly understand scenes from images, reducing information loss from verbal descriptions
Fusion with VLA: High-level LLM planning, low-level VLA execution (e.g., pi0.5 architecture)
Active perception: LLMs not only passively receive feedback but actively guide robots to collect information ("first look at what's on the table")
Multi-agent collaboration: Multiple LLM agents coordinating control of multiple robots
Continual learning: Improving LLM planning capabilities from execution experience

References:

Ahn et al., "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances" (SayCan), 2022
Liang et al., "Code as Policies: Language Model Programs for Embodied Control", ICRA 2023
Singh et al., "ProgPrompt: Generating Situated Robot Task Plans using Large Language Models", ICRA 2023
Huang et al., "VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models", CoRL 2023
Huang et al., "Inner Monologue: Embodied Reasoning through Planning with Language Models", CoRL 2022
Yoneda et al., "Statler: State-Maintaining Language Models for Embodied Reasoning", CoRL 2023