Skip to content

Large Model-Driven Robots

Large Language Models (LLMs) and Vision-Language Models (VLMs) possess rich commonsense knowledge and reasoning capabilities, but they do not directly understand the constraints of the physical world. How can the "brain" of LLMs/VLMs be connected to the "body" of robots? This article systematically reviews the core methods for LLM/VLM-driven robotics.

Related notes: Introduction to Robot Foundation Models | VLA Models | Multi-task Learning and Generalization


1. Problem Definition: The Grounding Problem

The core challenge of LLM-driven robotics is the Grounding Problem:

\[\text{Language Space} \xrightarrow{?} \text{Physical World}\]

An LLM can generate instructions like "open the refrigerator door," but how do we ensure:

  1. Physical feasibility: Can the robot execute this action in its current state?
  2. Semantic accuracy: Does the LLM's understanding of "open" match the robot's action primitives?
  3. Environment awareness: Does the LLM know where the refrigerator is? The position of the door handle?

2. SayCan: Feasibility-Constrained Language Planning

2.1 Core Idea

SayCan (Ahn et al., Google, 2022) is a foundational work in LLM-driven robotics. Core formula:

\[c^* = \arg\max_{c_i \in \mathcal{C}} \underbrace{p_{\text{LLM}}(c_i | l, h)}_{\text{task relevance}} \cdot \underbrace{V^{c_i}(s)}_{\text{affordance score}}\]

where:

  • \(\mathcal{C}\): Predefined skill library (e.g., "pick up X", "go to Y", "place X on Y")
  • \(l\): User's natural language instruction
  • \(h\): Dialogue history (sequence of completed actions)
  • \(V^{c_i}(s)\): Value function of skill \(c_i\) in current state \(s\), serving as the affordance score

2.2 Architecture Flow

graph TB
    USER[User Instruction<br/>"I spilled my drink, can you help?"] --> LLM[LLM<br/>PaLM 540B]

    LLM --> SCORE["Score each skill<br/>p(skill | instruction, history)"]

    ENV[Environment State<br/>Robot Observations] --> AFF["Affordance Evaluation<br/>Value function for each skill"]

    SCORE --> MUL["Joint Scoring<br/>p(skill) × V(skill)"]
    AFF --> MUL

    MUL --> SELECT["Select Best Skill<br/>argmax"]
    SELECT --> EXEC["Execute Skill<br/>Low-level RL/BC Policy"]
    EXEC --> FB["Execution Result<br/>Success/Failure"]
    FB --> LLM

    style LLM fill:#e3f2fd
    style AFF fill:#e8f5e9
    style MUL fill:#fff3e0

2.3 Key Details

Skill library design: SayCan predefines 551 short-horizon skills, each corresponding to a pretrained RL policy. Skills are described in natural language:

"pick up the red bull can"
"go to the counter"
"place the sponge in the sink"
"open the top drawer"

Affordance evaluation: The affordance of each skill is estimated by training a value function:

\[V^{c_i}(s) \approx \mathbb{E}\left[\sum_{t=0}^T \gamma^t r_t \mid s_0 = s, \pi = \pi_{c_i}\right]\]

When the robot is far from the target object or the target is not in view, \(V^{c_i}(s)\) is naturally low.

Limitations:

  • Depends on a predefined skill library; cannot handle tasks outside the library
  • Each skill requires training a separate RL policy
  • Cannot handle tasks requiring creative solutions

3. Code as Policies: LLM-Generated Code

3.1 Core Idea

Code as Policies (Liang et al., 2023) has the LLM directly generate Python code to control the robot, rather than selecting predefined skills:

\[\text{LLM}(\text{instruction}, \text{API docs}, \text{examples}) \rightarrow \text{Python code}\]

3.2 Example

User instruction: "Line up all the blue blocks in a row"

LLM-generated code:

# Get positions of all blue blocks
blue_blocks = detect_objects("blue block")

# Sort by x-coordinate
blue_blocks.sort(key=lambda b: b.position[0])

# Set target arrangement positions
start_x = 0.3
spacing = 0.08

for i, block in enumerate(blue_blocks):
    target_pos = [start_x + i * spacing, 0.5, block.position[2]]
    pick_and_place(block.id, target_pos)

3.3 API Design

The key is designing appropriate robot APIs for the LLM to call:

API Level Example Description
High-level semantic pick_and_place(obj, target) Abstract operations
Mid-level motion move_to(position, orientation) Motion planning
Low-level control set_joint_angles(angles) Joint control
Perception detect_objects(description) Visual detection
Spatial reasoning get_position(obj) Position queries

Advantages:

  • Strong compositionality: Code naturally supports loops, conditionals, function calls
  • Interpretable: Generated code can be manually reviewed
  • Flexible: Not constrained by a predefined skill library

Disadvantages:

  • LLM may generate syntactically correct but semantically wrong code
  • Requires careful API interface design
  • Error handling and exception recovery are difficult

4. ProgPrompt: Programmatic Prompting

ProgPrompt (Singh et al., 2023) combines structured programs with natural language prompts:

Core innovation: Providing in the prompt:

  1. List of available objects in the environment
  2. Function signatures of available actions
  3. Assertion functions (for checking preconditions)

Example prompt structure:

# Available objects: mug, plate, table, sink, sponge
# Available actions: find(obj), pick_up(obj), place(obj, location), 
#                    open(obj), close(obj)
# Assertions: is_holding(obj), is_at(location), is_open(obj)

def clean_the_mug():
    find("mug")
    pick_up("mug")
    find("sink")
    place("mug", "sink")
    # Check precondition
    assert is_at("sink")
    open("faucet")
    ...

Differences from Code as Policies:

  • ProgPrompt places more emphasis on precondition checking and assertions
  • Provides systematic prompt engineering templates
  • Better suited for tasks requiring strict execution order

5. VoxPoser: 3D Spatial Reasoning

5.1 Core Idea

VoxPoser (Huang et al., 2023) maps the language reasoning capabilities of LLMs into 3D voxel space:

\[\text{LLM} + \text{VLM} \rightarrow \text{3D Affordance/Constraint Maps} \rightarrow \text{Motion Planning}\]

5.2 Workflow

  1. Language understanding: LLM parses the instruction and generates code describing target regions and constraints
  2. Visual localization: VLM (e.g., OWL-ViT) localizes objects in the scene
  3. Voxel map generation: - Affordance Map \(\mathbf{V}_{\text{aff}} \in \mathbb{R}^{X \times Y \times Z}\): Target regions score high - Constraint Map \(\mathbf{V}_{\text{con}} \in \mathbb{R}^{X \times Y \times Z}\): Obstacles/forbidden zones score high
  4. Motion planning: Run MPC on the voxel maps to generate collision-free trajectories

Combined score:

\[\mathbf{V}_{\text{total}}(x, y, z) = \mathbf{V}_{\text{aff}}(x, y, z) - \lambda \mathbf{V}_{\text{con}}(x, y, z)\]

Advantages:

  • Requires no robot training data
  • Can handle complex spatial constraints ("place the cup between the plate and the vase")
  • Combines language reasoning with 3D spatial reasoning

6. Inner Monologue: Closed-Loop Feedback

6.1 Core Idea

Inner Monologue (Huang et al., Google, 2022) introduces perceptual feedback loops into LLM planning:

Previous methods (like SayCan) were open-loop — the LLM generates a plan and executes it directly without considering unexpected events during execution. Inner Monologue lets the LLM continuously receive environmental feedback and dynamically adjust plans.

6.2 Feedback Types

Feedback Type Source Example
Success/failure detection Visual classifier "pick up succeeded"
Scene description Image captioning model "I see a red cup on the table"
Object detection Object detection model "Objects: cup(0.3, 0.5), plate(0.6, 0.4)"
Human feedback Human operator "No, I meant the other cup"

6.3 Closed-Loop Execution Flow

graph TB
    INST[User Instruction] --> LLM[LLM Planner]
    LLM --> PLAN[Sub-task Plan]
    PLAN --> EXEC[Execute Current Sub-task]
    EXEC --> SENSE[Perception System]
    SENSE --> FB["Feedback Summary<br/>1. Success/Failure<br/>2. Scene Description<br/>3. Object Detection"]
    FB --> LLM
    LLM --> |"Adjust plan based on feedback"| PLAN

    EXEC --> |"Task complete"| DONE[Done]

    style LLM fill:#e3f2fd
    style SENSE fill:#e8f5e9
    style FB fill:#fff3e0

Key difference from SayCan:

  • SayCan: LLM selects skill → execute → select next skill (no feedback)
  • Inner Monologue: LLM selects skill → execute → collect environment feedback → LLM adjusts based on feedback (with feedback)

7. Statler: State Maintenance

Statler (Yoneda et al., 2023) introduces explicit state representation into LLM planning:

Problem: LLMs have limited context windows and may "forget" the environment state during long-sequence tasks.

Solution: Maintain a structured world state representation:

World State:
- Robot Location: kitchen counter
- Holding: nothing
- Objects:
  - red_cup: on table, upright, empty
  - blue_plate: on counter, clean
  - sponge: in sink, wet

Two LLM modules:

  1. Inner State LLM: Updates the world state based on execution results
  2. Outer State LLM: Decides the next action based on the current world state

This separation allows the LLM to more reliably track environmental changes during long-horizon tasks.


8. Method Comparison Summary

Method Year Action Output Method Closed-Loop Requires Training 3D Reasoning
SayCan 2022 Select predefined skills Partial Each skill needs training No
Code as Policies 2023 Generate Python code No No Limited
ProgPrompt 2023 Generate programs + assertions Partial (assertions) No No
VoxPoser 2023 3D voxel value maps → MPC No No Yes
Inner Monologue 2022 Select skills + feedback Yes Low-level skills need training No
Statler 2023 Select skills + state Yes No No

Key Trade-offs

Flexibility vs. Reliability:

  • Code as Policies is most flexible (can generate arbitrary code) but least reliable (code may have bugs)
  • SayCan is most reliable (each skill is trained) but least flexible (constrained by skill library)

Zero-shot vs. Training Required:

  • VoxPoser, Code as Policies, ProgPrompt, and Statler are all zero-shot methods, requiring no robot training data
  • SayCan and Inner Monologue require pretrained low-level skills

9. Limitations and Future Directions

9.1 Current Limitations

  1. Inference latency: LLM inference typically takes hundreds of milliseconds to seconds, unsuitable for real-time control
  2. Hallucination problem: LLMs may generate plans that seem reasonable but are physically infeasible
  3. Weak spatial reasoning: Pure text LLMs have limited spatial reasoning capabilities
  4. Error accumulation: Errors compound during long-sequence tasks
  5. Safety: LLM-generated code may lead to dangerous actions
  1. VLM replacing LLM: Using Vision-Language Models (e.g., GPT-4V) to directly understand scenes from images, reducing information loss from verbal descriptions
  2. Fusion with VLA: High-level LLM planning, low-level VLA execution (e.g., pi0.5 architecture)
  3. Active perception: LLMs not only passively receive feedback but actively guide robots to collect information ("first look at what's on the table")
  4. Multi-agent collaboration: Multiple LLM agents coordinating control of multiple robots
  5. Continual learning: Improving LLM planning capabilities from execution experience

References:

  • Ahn et al., "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances" (SayCan), 2022
  • Liang et al., "Code as Policies: Language Model Programs for Embodied Control", ICRA 2023
  • Singh et al., "ProgPrompt: Generating Situated Robot Task Plans using Large Language Models", ICRA 2023
  • Huang et al., "VoxPoser: Composable 3D Value Maps for Robotic Manipulation with Language Models", CoRL 2023
  • Huang et al., "Inner Monologue: Embodied Reasoning through Planning with Language Models", CoRL 2022
  • Yoneda et al., "Statler: State-Maintaining Language Models for Embodied Reasoning", CoRL 2023

评论 #