V1.00 -- Four Days from Zero to Teleoperation

The first version of SOMA Arm went from an unboxed robot arm and a blank WSL2 terminal to full gamepad teleoperation with live camera feedback in four days. This was not a follow-the-tutorial affair -- along the way I hit a missing ROS 2 driver, persistent camera frame tearing, and a WSL kernel that did not know what a gamepad was. Every step was earned through real-hardware debugging.

The goal of V1.00 was deliberately narrow: do not aim for intelligence, just connect the hands and eyes. If the arm cannot be reliably controlled, the camera cannot deliver clean frames, and the gamepad cannot drive joints directly, then every perception and cognition layer built on top is doomed. This version is about proving that the lowest-level control chain works.

Why WSL2 Instead of Native Linux

The first decision was the development environment. The laptop ships with an RTX 4090 Laptop GPU (16 GB VRAM), and downstream work -- ACT policy training and vision model inference -- needs every bit of that CUDA performance. Dual-booting native Ubuntu was the obvious choice, but on this particular hardware the GPU driver situation turned out to be unstable, with measurable performance loss.

WSL2 with CUDA passthrough gave a surprisingly clean result: nvidia-smi reported the full RTX 4090, CUDA throughput measured at 100% of native, and PyTorch behaved identically. The trade-off is that USB devices must be forwarded through usbipd-win and the GUI path through WSLg is sluggish -- but for a workflow centered on terminals and ROS 2 nodes, that is an acceptable cost.

The final stack: Windows 11 host, WSL2 Ubuntu 22.04, ROS 2 Humble, CUDA 12.4, Python 3.10 virtual environment. The entire environment was stood up and validated on day one.

Writing the Arm Driver from Scratch

The Waveshare RoArm-M2-S is a 4-DOF tabletop robot arm with an onboard ESP32 that accepts JSON commands over serial. The catch: there is no official ROS 2 driver. Waveshare ships a MoveIt2 configuration package, but it lives inside their own upstream workspace and does not directly expose the /joint_command and /joint_states interfaces I needed.

So I wrote a Python driver from scratch -- roughly 200 lines of code. It translates ROS 2 joint commands into the ESP32's JSON serial protocol and publishes real joint states back to ROS topics. On top of that, a MoveIt2 bridge node enables drag-to-control from RViz to the physical arm.

One important hardware constraint surfaced during driver development: the ESP32's serial port can only be held by one process at a time. This means the driver node, MoveIt2 bridge, and teleop node cannot all talk to the serial port simultaneously -- everything must route through the driver as a single gateway. This constraint became a foundational design principle for the entire control architecture.

There was also a subtler engineering challenge: making a Python virtual environment coexist peacefully with ROS 2's system-level rclpy bindings. The venv must be created with --system-site-packages so it can import both system ROS 2 packages and pip-installed ML dependencies. This sounds straightforward, but getting LeRobot, PyTorch, and rclpy to all resolve correctly in the same environment took real effort.

Three Iterations of Gamepad Teleoperation

The gamepad teleop line went through three versions. Each failure pushed the next design decision.

Version 1: Windows TCP bridge. The first approach was to capture gamepad input on Windows using pygame and send it over TCP to a ROS node in WSL. It worked technically, but the ergonomics were painful -- the Windows window had to stay focused or the gamepad events would stop, requiring constant window switching during operation.

Version 2: USB passthrough. The natural next step was to forward the gamepad into WSL via usbipd-win and let Linux read the input natively. The device showed up in lsusb inside WSL -- but nothing appeared under /dev/input/. The reason: the default WSL2 kernel does not compile the XPAD and JOYDEV modules, so the kernel simply does not recognize gamepad-class input devices.

Version 3: Custom WSL kernel. I pulled the WSL2 kernel source, enabled CONFIG_INPUT_JOYSTICK, CONFIG_JOYSTICK_XPAD, and CONFIG_INPUT_JOYDEV, and rebuilt. The first build had a missing option that made XPAD appear configured but not actually functional. After fixing that and recompiling, the gamepad finally appeared at /dev/input/event*.

The final teleop setup uses an evdev backend to read Linux input events directly. Four joystick axes map one-to-one to 4 joints; LB/RB control the gripper; Start returns to home position; Back triggers an emergency stop. The end-to-end latency from thumbstick to real joint motion is imperceptible.

After the core mapping was working, I polished the feel: the joystick's physical cross-axis bleed (pushing one axis slightly affects the other) was mitigated with a software hard-axis lock, and gripper commands were routed through a separate serial channel so that opening and closing the gripper would not resend the entire arm's joint state.

The Camera Bridge: A Battle Against Frame Tearing

The camera pipeline was the most time-consuming part of V1.00.

The Logitech C922 Pro is a well-established USB webcam. The obvious approach was to forward it into WSL via usbipd, just like the gamepad. The forward itself worked -- OpenCV could open the device and read frames -- but every MJPEG frame showed visible horizontal tearing and corruption artifacts.

To isolate the problem, I tested four resolution/codec combinations: 720p60 MJPG, 720p30 MJPG, 1080p30 MJPG, and 720p30 YUYV. The first three all produced the same horizontal tearing; YUYV avoided tearing but was too slow to be usable. I also tried a GStreamer pipeline, which hung during pipeline initialization and never produced a frame.

The breakthrough came from a controlled comparison: I detached the C922 back to Windows and opened it in the native Windows Camera app. The image was sharp, smooth, high-framerate, and completely free of artifacts. This conclusively proved that the hardware was fine -- the problem lay in the usbipd + WSL + UVC/MJPG forwarding path. A search through the OpenCV and usbipd issue trackers confirmed that others had reported similar high-resolution camera issues.

Since Windows captured frames cleanly, the solution became: let Windows handle image capture and stream JPEG frames over TCP to a ROS node in WSL. I wrote a Windows-side JPEG sender and a WSL-side ROS receiver node, forming a lightweight TCP image bridge.

The first version of the bridge ran and produced clean frames, but latency was three to four seconds -- clearly, buffers were accumulating somewhere. I implemented a latest-only strategy on both ends: the sender uses a background thread to continuously grab the newest frame and only transmits the current one; the receiver only decodes and publishes the most recently received frame, overwriting anything older. Both sides log latest_frame_age_ms for latency diagnosis.

After this optimization round, the 960x540 preset brought latency down to usable levels with clean, correctly-angled frames. The frame rate settled at 4-5 FPS -- not cinematic, but sufficient to support the calibration and perception work ahead.

System Architecture

Hardware topology after V1.00:

PC (Windows 11 + WSL2 Ubuntu 22.04)
 |
 |-- USB (usbipd) --> RoArm-M2-S 4-DOF arm --> /dev/ttyUSB*
 |                    ESP32 JSON serial protocol
 |                    ~200-line Python driver
 |
 |-- USB (usbipd) --> PDP Xbox Controller --> /dev/input/event*
 |                    Custom WSL kernel (XPAD+JOYDEV)
 |                    evdev backend, direct read
 |
 |-- TCP bridge ----> Logitech C922 Pro (stays on Windows side)
                      Windows JPEG encode --> TCP --> WSL ROS node
                      960x540, 4-5 FPS, clean frames

Key Numbers

Metric Value
Zero to full teleop 4 days
Driver code ~200 lines Python
Arm DOF 4-DOF + gripper
Control loop rate 50 Hz
Camera bridge resolution 960 x 540
GPU performance loss 0% (CUDA passthrough)
Teleop design iterations 3

Real-Hardware Validation: Chess Piece Pick-and-Place

On the final day of V1.00, I assembled the full pipeline for an end-to-end hardware test: a real chessboard, standard chess pieces, and a collection box, all arranged in a layout close to the actual task scenario.

Using gamepad teleoperation, the arm successfully completed basic chess piece pick-and-place operations. Compared to earlier tests using foam blocks, standard chess pieces actually performed better than expected -- their wider bases provided more stable standing and more predictable grasping. This test answered a critical question: chess piece manipulation is physically feasible with the current arm's precision and gripper -- no hardware upgrade is needed.

The test also surfaced several practical engineering details: USB device bus IDs can change after every replug, so they must be re-checked each time rather than relying on old records; startup scripts need to handle the "device already attached" state gracefully instead of misreporting it as failure; and the full teleop startup sequence requires waiting for joint state publication and pressing Start to synchronize, not just connecting the devices. All of these were fixed before V1.00 was closed out.

What Comes Next

V1.00 gave the arm hands (teleop control) and eyes (camera feed), but it is still blind -- it can see the image but has no idea where anything in that image sits in the physical world. V1.01 aims to fix that: camera intrinsic calibration and eye-to-hand calibration will establish the mapping from pixel coordinates to world coordinates, letting the arm understand the physical meaning of every position on the chessboard.

Tech Stack

ROS 2 Humble / Python 3.10 / PyTorch + CUDA 12.4 / MoveIt2 / OpenCV / WSL2 + usbipd-win / evdev / Waveshare RoArm-M2-S / Logitech C922 Pro