V1.01 -- Teaching the Arm Where Things Are

V1.00 gave the arm the ability to move and see, but it still has no idea where anything is. A chess piece in the camera frame is just a patch of pixels -- the arm cannot answer the question "where is this piece on the physical table?" V1.01 exists to solve exactly that: building the complete mapping chain from pixels to the physical world.

This is not a tweak-a-few-parameters step. It involves three layers of calibration that depend on each other: first, characterize the camera's own optical distortion (intrinsic calibration); then establish the spatial relationship between the camera and the arm's base frame (eye-to-hand calibration); finally, map the chessboard's physical dimensions and workspace boundaries into a unified world coordinate system (workspace grounding). If any layer is off, downstream piece localization will be wrong.

Why Camera Parameters Must Be Locked First

Before any calibration can begin, there is a more fundamental prerequisite: the camera's exposure, white balance, and focus must be stable. If these parameters drift automatically every time the system starts, the same chess piece will look different in brightness, color, and sharpness across sessions, making calibration data unreliable.

The first step in V1.01 was adding camera control capability to the Windows-side TCP bridge sender. Exposure, white balance, and autofocus can now be locked to manual values via command-line arguments or preset profiles. The sender also supports launching the native Windows DirectShow settings dialog on startup for initial parameter tuning, then switching to a locked mode for cross-session consistency.

On top of this, an experimental adaptive camera controller was implemented: it makes small adjustments based on brightness, overexposure ratio, contrast, and sharpness within a workspace ROI, with cooldown timers and hysteresis to prevent oscillation. This path is optional -- the default remains a manually locked baseline.

The Calibration Toolchain: From Images to World Coordinates

The calibration toolchain is organized in three steps, each with a dedicated script and a fixed output path:

Camera intrinsic calibration uses a ChArUco calibration board. After collecting images from multiple angles, the script automatically detects corners, solves for the camera matrix and distortion coefficients, and writes a standardized intrinsics file. The target is reprojection error below 1 pixel.

Eye-to-hand calibration establishes the transform between the camera frame and the arm's base frame. Using a reference image where the board is fully visible, combined with the intrinsics data, it solves for a homography matrix and a pose transform. Once complete, any pixel coordinate can be projected onto the arm's working plane.

Workspace reachability validation confirms that the arm can physically reach every critical position on the board. The script generates a set of test points (board corners, center, piece collection area) that must be verified on the real hardware. Results are written back to the configuration file, and unreachable points are flagged as no-go zones to prevent the perception system from issuing impossible commands.

All calibration artifacts are stored under a unified config/calibration/ directory. Both perception and driver nodes read from the same location, eliminating the problem of multiple nodes maintaining separate coordinate system configurations.

The Perception Interface: /find_object

V1.01 also established the formal interface skeleton for the perception system. A ROS 2 service called /find_object accepts natural-language queries (such as "chessboard", "piece", or "collection box") and returns the target's pixel coordinates and world coordinates.

The current version uses a geometry-first approach: board queries run real chessboard corner detection, container queries use calibrated-ROI contour detection, and generic object queries use a foreground blob heuristic. These methods have limited precision, but they are sufficient to validate that the full pipeline from query to world coordinate is working end to end. Open-vocabulary vision models (Grounding DINO + SAM2) will be integrated later to improve localization accuracy.

The perception node also provides a feedback channel: it can send image quality assessments (brightness, contrast, sharpness) back to the Windows camera sender via localhost JSON, allowing the sender to adjust exposure and white balance in adaptive mode. This channel is more of an architectural placeholder for now, but it prepares the interface for a future "perception-driven imaging" closed loop.

Current Status

The code scaffolding and toolchain for V1.01 are complete, but the core calibration tasks are awaiting real-hardware validation. Specifically:

Camera lock tools and launch entries are in place, awaiting validation on the real workstation
Calibration scripts are written and syntax-checked, awaiting real calibration images
The /find_object node launches successfully, awaiting formal calibration data to drive real queries
The adaptive camera controller has a first implementation, awaiting multi-condition lighting tests

What Comes Next

Once real-hardware calibration is validated, V1.01 will have achieved its core goal: the arm evolves from "can move and can see" to "knows where things are." This provides the spatial foundation needed for V1.02's chess piece manipulation work -- MoveIt2 motion primitives and ACT policy training.

Tech Stack

ROS 2 Humble / Python 3.10 / OpenCV / NumPy / ChArUco calibration / Homography solving / WSL2 + Windows TCP bridge / Waveshare RoArm-M2-S / Logitech C922 Pro