Perception-Action Loop

Overview

The Perception-Action Loop is the core computational framework of embodied intelligence. It describes how an agent continuously interacts with its environment through cycles of perception, decision-making, action, and feedback. This article begins with ecological psychology, discusses Gibson's affordance theory, active perception, sensorimotor integration, and the distinction between open-loop and closed-loop control.

1. Definition of the Perception-Action Loop

1.1 Basic Framework

The perception-action loop is a continuous closed process:

flowchart TD
    A[Sense] --> B[Model]
    B --> C[Plan]
    C --> D[Act]
    D --> E[Environment]
    E -->|Sensory Feedback| A

    style A fill:#e3f2fd
    style B fill:#f3e5f5
    style C fill:#fff8e1
    style D fill:#e8f5e9
    style E fill:#fce4ec

Formal Description:

At time \(t\), the agent is in state \(s_t\) and performs the following steps:

Sense: Obtain observation \(o_t = h(s_t) + \epsilon_t\) from the environment
Model/Estimate: Update belief based on observations \(b_t = p(s_t | o_{1:t}, a_{1:t-1})\)
Plan/Decide: Select action \(a_t = \pi(b_t)\) or \(a_t = \pi(o_{1:t})\)
Execute: Execute the action; environment transitions \(s_{t+1} \sim p(s_{t+1} | s_t, a_t)\)
Feedback: Environmental changes produce new sensory input; the loop repeats

1.2 Timescales

The perception-action loop operates simultaneously across multiple timescales:

Timescale	Period	Level	Example
Reflex	1-10 ms	Spinal/low-level control	Joint servo, force feedback
Motor Control	10-100 ms	Motor cortex/mid-level control	Trajectory tracking, impedance regulation
Behavioral	0.1-10 s	Behavior planning	Grasping objects, obstacle avoidance
Task	10-1000 s	Task planning	Completing a manipulation task
Strategic	Minutes-Hours	High-level decision-making	Choosing a strategy to complete a task

2. Gibson's Ecological Psychology and Affordances

2.1 The Ecological Approach

James J. Gibson (1979) proposed a theory of perception fundamentally different from the information processing paradigm in The Ecological Approach to Visual Perception.

Core Claims:

Perception is direct, requiring no mediation by internal representations
The environment contains rich information waiting to be "picked up"
Perception serves action; perception and action are unified

2.2 Affordances

Definition: Affordances are the action opportunities that the environment offers relative to a given agent.

\[\mu(e, a) = f(\text{environment properties}, \text{agent capabilities})\]

where \(e\) is an environmental element and \(a\) is the agent.

Key Properties:

Relational: Affordances are neither purely environmental nor purely agent properties, but a relationship between the two
Directly Perceived: Affordances can be directly perceived without inference
Action-Oriented: Affordances point directly to possible actions

Examples:

Environmental Element	Human Affordances	Robot Affordances
Chair	Sittable, standable, carriable	Pushable, navigable around
Cup	Graspable, drinkable, pourable	Graspable (depends on gripper morphology)
Stairs	Climbable	Depends on legged/wheeled design
Door Handle	Turnable/pressable	Depends on end-effector

2.3 Affordances in Robotics

SayCan (Ahn et al., 2022) Affordance Scoring:

\[\text{score}(a) = p(\text{useful} | a, l) \cdot p(\text{possible} | a, s)\]

where:

\(p(\text{useful} | a, l)\): Language model evaluates the usefulness of action \(a\) for instruction \(l\)
\(p(\text{possible} | a, s)\): Value function evaluates the feasibility of action \(a\) in state \(s\)

Learning Affordances:

Visual affordance prediction: Given an image, predict possible interaction regions and modes
Contact affordances: Where to grasp, push, or place
Where2Act (Mo et al., 2021): Learning manipulation points on 3D objects

3. Active Perception

3.1 Definition

Active Perception refers to the agent's use of purposeful perceptual actions to acquire information and reduce uncertainty.

Contrast with passive perception:

Passive Perception: Process current sensory input \(\rightarrow\) \(o_t \xrightarrow{\text{process}} \hat{s}_t\)
Active Perception: Select perceptual actions to maximize information gain \(\rightarrow\) \(a_t^{\text{percept}} = \arg\max_a I(s; o_{t+1} | a)\)

3.2 Information Gain Optimization

Active perception can be formalized as an information gain maximization problem:

\[a^* = \arg\max_a \mathbb{E}_{o \sim p(o|a)} \left[ D_{KL}\left[ p(s|o, a) \| p(s) \right] \right]\]

That is, select the perceptual action that maximizes information gain about the world state.

Equivalent Form (Mutual Information Maximization):

\[a^* = \arg\max_a I(S; O | a) = \arg\max_a \left[ H(O|a) - H(O|S, a) \right]\]

3.3 Application Scenarios

Observation Planning Before Grasping:

Before grasping, a robot needs to select the best observation viewpoint:

Initial observation \(\rightarrow\) rough object detection
Select the best viewpoint (maximum information gain)
Move camera/arm to that viewpoint
Obtain a more precise object model
Execute the grasp

Tactile Exploration:

When visual information is insufficient (e.g., transparent or soft objects), the robot actively explores through touch:

\[a_t^{\text{touch}} = \arg\max_a \mathbb{E}\left[ \text{IG}(\text{shape} | \text{contact}_a) \right]\]

4. Sensorimotor Integration

4.1 Multimodal Fusion

Sensorimotor integration is the process of unifying sensory information from different modalities with motor commands:

\[z_t = \text{Fuse}(o_t^{\text{vision}}, o_t^{\text{tactile}}, o_t^{\text{proprio}}, o_t^{\text{audio}})\]

Fusion Strategies:

Strategy	Description	Advantages	Disadvantages
Early Fusion	Concatenate all modalities at the input layer	Simple	Cross-modal interference
Late Fusion	Process each modality independently, then fuse decisions	Good modal independence	Loses cross-modal information
Attention Fusion	Transformer cross-attention	Flexible, learnable weights	Computationally expensive
Hierarchical Fusion	Fuse different modalities at different levels	Biologically plausible	Complex design

4.2 Proprioception

Proprioception is the agent's awareness of its own body state:

Joint angles/velocities: \(q_t, \dot{q}_t\)
End-effector pose: \(T_{\text{ee}} \in SE(3)\)
Joint torques: \(\tau_t\)
Base attitude: IMU data

The role of proprioception in the perception-action loop:

\[a_t = \pi(o_t^{\text{ext}}, o_t^{\text{proprio}})\]

Policies lacking proprioceptive information typically suffer significant performance degradation because it provides:

Current actuator state (essential for closed-loop control)
Contact detection (torque discontinuities indicate contact)
Motion state estimation

4.3 Predictive Coding

The brain uses a predictive coding mechanism in sensorimotor integration:

\[\text{Prediction Error} = o_t - \hat{o}_t = o_t - g(\hat{s}_t, a_{t-1})\]

When the prediction error is large, the system needs to update its internal model; when the prediction error is near zero, the system is in a steady state.

5. Open-Loop vs. Closed-Loop Control

5.1 Open-Loop Control

\[a_{1:T} = \pi(o_1)\]

The agent generates the entire action sequence at once based on the initial observation, without using feedback during execution.

Advantages:

Computationally efficient (only one inference)
Does not depend on real-time perception
Suitable for fast, short-duration actions

Disadvantages:

Cannot adapt to environmental changes
Errors accumulate
Sensitive to perturbations

Typical Applications: Throwing, striking, and other fast actions

5.2 Closed-Loop Control

\[a_t = \pi(o_t) \quad \text{or} \quad a_t = \pi(o_{1:t})\]

At each timestep, the latest observation is used to determine the action.

Advantages:

Can adapt to environmental changes
Can compensate for execution errors
Strong robustness

Disadvantages:

Requires real-time perception (latency constraints)
High computational burden
Sensitive to perception noise

Typical Applications: Precision assembly, dynamic manipulation, navigation

5.3 Hybrid Mode

Practical systems typically adopt a hybrid mode:

flowchart LR
    subgraph Outer Closed Loop
        A[Visual Observation] --> B[High-Level Decision<br/>10-50 Hz]
        B --> C[Target Trajectory]
    end
    subgraph Inner Closed Loop
        C --> D[Trajectory Tracking<br/>100-1000 Hz]
        D --> E[Joint Commands]
        F[Joint Sensors] --> D
    end
    E --> G[Actuators]
    G --> F

Outer Loop: Vision-based task-level closed loop (10-50 Hz)
Inner Loop: Proprioception-based motion-level closed loop (100-1000 Hz)

5.4 Reactive vs. Deliberative Control

Dimension	Reactive	Deliberative
Response Speed	Fast (ms)	Slow (100ms - s)
Perception Requirements	Simple sensors	Complex multimodal
Planning Depth	None / very shallow	Multi-step planning
Adaptability	Fixed behavior patterns	Flexibly adapts to new situations
Representative Architecture	Subsumption	TAMP

Brooks (1986) proposed the Subsumption Architecture, arguing:

Intelligence does not require internal representations; it can emerge from the layering of simple reactive behaviors.

The modern view holds that reactive and deliberative control should work cooperatively within the same system, similar to Kahneman's System 1 (fast intuition) and System 2 (slow reasoning).

6. Modern Implementations of the Perception-Action Loop

6.1 End-to-End Policies

Modern VLA models implement the perception-action loop as:

\[a_t = f_\theta(I_t, l, q_t)\]

where \(I_t\) is the image, \(l\) is the language instruction, and \(q_t\) is the joint state.

6.2 Frequency and Latency

Key constraints in practical systems:

\[\text{Control Frequency} > \frac{1}{\text{Task Characteristic Timescale}}\]

Task Type	Minimum Control Frequency	Policy Inference Constraint
Static Grasping	5-10 Hz	Inference time < 100 ms
Dynamic Manipulation	20-50 Hz	Inference time < 20 ms
Walking Balance	100-500 Hz	Inference time < 2 ms
Dexterous Hand Manipulation	50-200 Hz	Inference time < 5 ms

This places strict requirements on model inference efficiency, which is why low-level control typically uses lightweight controllers rather than large neural networks.

7. Summary

Core insights of the perception-action loop:

Intelligence is cyclical: It is not a unidirectional "perceive \(\rightarrow\) think \(\rightarrow\) act" process, but a continuous closed-loop interaction
Perception serves action: The purpose of perception is to support action; active perception is superior to passive observation
Affordances are the bridge: Environment and agent are linked through affordances
Multi-scale parallelism: From millisecond-level reflexes to minute-level planning running simultaneously
The body participates in cognition: Proprioception is an indispensable part of the perception-action loop

References

Gibson, J. J. (1979). The Ecological Approach to Visual Perception
Brooks, R. A. (1986). "A Robust Layered Control System for a Mobile Robot"
Ahn, M. et al. (2022). "Do As I Can, Not As I Say: Grounding Language in Robotic Affordances"
Bajcsy, R. (1988). "Active Perception"
Mo, K. et al. (2021). "Where2Act: From Pixels to Actions for Articulated 3D Objects"

Related Notes: