V0.2 — Replacing mock skills with a truly autonomous Stretch

v0.1 proved the ANIMA cognition stack could close the loop under language input. v0.2 replaces the mock skills underneath with genuinely autonomous execution — every run queries object poses from the MJCF, solves IK, and steps physics. No script replay.

A pivot happened during implementation, worth recording.

Pivot: from ROS 2 + Gazebo to pure-Python MuJoCo

Original plan: Docker + ROS 2 Humble + Gazebo Harmonic + MoveIt 2 + Nav2 — the canonical "heavy" robot stack.

Three problems surfaced during execution:

  1. hello-robot does not ship a stretch_moveit2 apt package or GitHub config. We would have had to write the SRDF + kinematics config ourselves (3–5 hours of work)
  2. The stretch_mujoco Python API was already sufficient: sim.pull_camera_data() → RGB frames; sim.move_to(Actuators.lift, pos) + wait_until_at_setpoint → joint position control; sim.set_base_velocity(v, ω) → base velocity. The Stretch arm only has 5 DoF (lift Z + arm extension + wrist yaw/pitch/roll) — analytical IK fits in 10 lines
  3. Nav2 is visibly overkill for a closed single-room + hallway scene

New plan: pure-Python stretch_mujoco API + analytical IK + unicycle PID. ANIMA's L3 skill calls stretch_mujoco directly, no ROS nodes.

Side benefits that made this the right call:

  • MuJoCo supports macOS natively, so the laptop can run the sim without VSCode Remote SSH
  • Deployment collapses to three processes (FastAPI + Next.js + MuJoCo); no rosdep / colcon / DDS tuning
  • The real-hardware path isn't closed: v1.x can swap SimSkillBehaviour for Ros2SkillBehaviour while leaving the py_trees orchestration untouched

The ROS 2 Humble + stretch_ros2 workspace stays around as a reference for future hardware work.

Six L3 skill primitives wired up

A new SimSkillBehaviour base class plus six concrete skills, all sharing one blackboard dict:

  • locate — query the target body's pose in simulation
  • navigate — unicycle PID with align / approach hysteresis to prevent oscillation
  • grasp — analytical IK + close gripper
  • lift — raise to a safe height
  • deliver — drive to the bedside pose
  • release — open gripper

L2 Planner also gains a "fall back to v0.1 mock skills if simulation unavailable" branch — so local zero-config still runs end to end.

Simulation view lands in the frontend

Backend sim/manager.py holds the StretchMujocoSimulator and renders a demo_view third-person camera itself. The stretch_mujoco camera pipeline doesn't expose custom MJCF cameras, so this path bypasses it via mujoco.Renderer directly.

Three routes:

  • /api/sim/mjpeg — multipart MJPEG stream
  • /api/sim/reset — reset scene to initial state
  • /api/sim/status — health probe

Frontend adds a SimulationView.tsx component (an <img> fed by the MJPEG stream plus a Reset button). Layout expands from three columns to four (nav / sim live / intent+BT / factors).

L5 starts being data-driven

In l5_assessment.py, p_skill moves from a fixed 0.91 constant to a rolling success-rate read from pea_log.jsonl. PEA finally starts feeding back into GOA instead of being a static number.

Deployment

Hetzner (89.167.35.145), three systemd units + nginx reverse proxy: / → Next.js prod, /api → FastAPI, /ws → FastAPI WebSocket upgrade. Nginx buffering is disabled for /api/sim/ so MJPEG streams in real time.

Acceptance (end-to-end script)

  • Open http://89.167.35.145, see the ward scene (bed, nightstand, water cup, Stretch at starting pose)
  • Type "I want water" → L0 through L5 light up sequentially → TaskSpec emits DRINK_WATER → behavior-tree six nodes turn green
  • MJPEG shows Stretch autonomously navigating, arming toward the cup, closing the gripper, lifting, returning to bedside — each run recomputes IK, not replay
  • Five factors update as the task progresses; PEA logs a new outcome=success row
  • The Reset Sim button returns the scene to initial state within 2 seconds

Deferred to later versions

  • Six-scene branching logic (CALL_HELP / ADJUST_TV / …) → v0.3
  • Failure-fallback narratives → v0.3
  • Videos for non–DRINK_WATER scenes → v0.4