Visual-Inertial Odometry (VIO)
Overview
Visual-Inertial Odometry (VIO) is the technology of fusing camera and IMU data for real-time pose estimation. Cameras provide rich environmental texture information but have limited frame rates, while IMUs provide high-frequency motion information but suffer from drift -- the two complement each other to achieve robust 6-DOF pose tracking.
Why VIO Is Needed
| Sensor | Strengths | Limitations |
|---|---|---|
| Camera | Rich texture, no drift (loop closure) | Low frame rate (30Hz), motion blur, light-sensitive |
| IMU | High frequency (200-1000Hz), unaffected by lighting | Long-term drift, requires integration |
| VIO Fusion | High-frequency + high-accuracy, robust | Increased system complexity |
Fusion Method Classification
graph TD
A[VIO Fusion Methods] --> B[Loosely Coupled]
A --> C[Tightly Coupled]
B --> B1[Visual odometry runs independently<br>IMU runs independently<br>Backend fuses both results]
C --> C1[Visual features and IMU data<br>in the same optimization framework<br>jointly optimized]
A --> D[By Backend Type]
D --> D1[Filter-Based<br>EKF / MSCKF]
D --> D2[Optimization-Based<br>Bundle Adjustment + IMU]
Loosely vs. Tightly Coupled
| Feature | Loosely Coupled | Tightly Coupled |
|---|---|---|
| Fusion Level | Result-level fusion | Raw data-level fusion |
| Accuracy | Lower | Higher |
| Robustness | Degrades when single sensor fails | More robust (IMU fills in when features are sparse) |
| Implementation Complexity | Low | High |
| Representative | - | VINS-Mono, OKVIS, BASALT |
Tight Coupling Is Mainstream
Current mainstream VIO approaches in both academia and industry use tight coupling. Tight coupling better handles degraded scenarios (e.g., sparse textures, fast motion) and is the superior technical choice.
IMU Preintegration Theory
Motivation
In optimization-based VIO, after each optimization iteration updates the state estimate, IMU data needs to be re-integrated. IMU preintegration pre-integrates IMU measurements between two keyframes into relative motion constraints, avoiding repeated integration.
Preintegration Quantities
Preintegration quantities between keyframes \(i\) and \(j\):
Rotation preintegration:
Velocity preintegration:
Position preintegration:
Where:
- \(\tilde{\boldsymbol{\omega}}_k\) is the gyroscope measurement
- \(\tilde{\mathbf{a}}_k\) is the accelerometer measurement
- \(\mathbf{b}_g\), \(\mathbf{b}_a\) are biases
- \(\text{Exp}(\cdot)\) is the exponential map from \(\mathfrak{so}(3) \to SO(3)\)
Core Advantage of Preintegration
When bias estimates \(\mathbf{b}_g\), \(\mathbf{b}_a\) change slightly, preintegration quantities can be updated through first-order approximation without re-integration:
This greatly reduces computation during optimization iterations.
Preintegration Covariance Propagation
Uncertainty of preintegration quantities is propagated through the covariance matrix:
Where \(\mathbf{A}_k\) is the state transition matrix and \(\mathbf{Q}\) is the IMU noise covariance.
Factor Graph Formulation
VIO can be naturally modeled as a factor graph optimization problem:
graph LR
subgraph "State Variables"
X0["X0<br>(p,R,v,b)"]
X1["X1<br>(p,R,v,b)"]
X2["X2<br>(p,R,v,b)"]
end
subgraph "Landmarks"
L1["L1"]
L2["L2"]
L3["L3"]
end
X0 --- |"IMU<br>Preintegration"| X1
X1 --- |"IMU<br>Preintegration"| X2
X0 --- |"Visual<br>Observation"| L1
X0 --- |"Visual<br>Observation"| L2
X1 --- |"Visual<br>Observation"| L1
X1 --- |"Visual<br>Observation"| L3
X2 --- |"Visual<br>Observation"| L2
X2 --- |"Visual<br>Observation"| L3
P["Prior Factor"] --- X0
State Vector
State at each keyframe:
Optimization Objective
Where:
- \(\mathbf{r}_p\) is the prior residual
- \(\mathbf{r}_{\text{IMU}}^k\) is the IMU preintegration residual
- \(\mathbf{r}_{\text{vis}}^{ij}\) is the visual reprojection residual
- \(\| \cdot \|_{\Sigma}^2 = (\cdot)^T \Sigma^{-1} (\cdot)\) is the Mahalanobis distance
Classic VIO Systems
VINS-Mono
A classic tightly-coupled monocular VIO system open-sourced by HKUST, the standard reference for learning VIO.
System Architecture:
graph TD
A[IMU Data] --> B[Preintegration]
C[Images] --> D[Feature Tracking<br>KLT Optical Flow]
B --> E[Initialization]
D --> E
E --> F[Nonlinear Optimization<br>Sliding Window]
F --> G[Marginalization]
F --> H[Loop Closure Detection<br>DBoW2]
H --> I[Global Pose Graph Optimization]
F --> J[Output Pose]
I --> J
Key Techniques:
| Module | Method |
|---|---|
| Frontend | KLT optical flow feature tracking |
| Initialization | Visual SfM + IMU alignment |
| Backend | Sliding window + marginalization |
| Loop Closure | DBoW2 bag-of-words detection + pose graph optimization |
| Optimizer | Ceres Solver |
OKVIS
Open-sourced by ETH, a keyframe-based tightly-coupled VIO supporting stereo.
MSCKF
An efficient EKF-based VIO that avoids maintaining feature point states through Multi-State Constraint:
Feature points are not added to the state after observation; instead, multi-state constraints are built when tracking is lost.
BASALT
Open-sourced by TUM, a high-accuracy VIO using stereo + IMU, featuring efficient nonlinear optimization and an optical flow frontend.
System Comparison
| System | Coupling | Backend | Camera Config | Highlight |
|---|---|---|---|---|
| VINS-Mono | Tight | Optimization | Monocular | Complete system, loop closure |
| VINS-Fusion | Tight | Optimization | Mono/Stereo/Multi | Multi-sensor |
| OKVIS | Tight | Optimization | Stereo | Keyframe management |
| MSCKF | Tight | EKF | Stereo | Computationally efficient |
| BASALT | Tight | Optimization | Stereo | High accuracy |
| ORB-SLAM3 | Tight | Optimization | Mono/Stereo/RGB-D | Map reuse |
Practical Deployment Considerations
Camera-IMU Calibration
The extrinsic \(\mathbf{T}_{CI}\) (relative pose) between camera and IMU must be precisely calibrated:
# Using Kalibr
rosrun kalibr kalibr_calibrate_imu_camera \
--bag calibration.bag \
--cam cam_config.yaml \
--imu imu_config.yaml \
--target april_grid.yaml
Time Synchronization
The time offset \(t_d\) between camera and IMU severely affects accuracy:
VINS-Mono supports online estimation of \(t_d\), but hardware-level time synchronization (e.g., trigger signals) is more reliable.
Common Failure Scenarios
| Scenario | Cause | Mitigation |
|---|---|---|
| Texture-poor areas | White walls, sky | Rely on IMU short-term prediction |
| Fast rotation | Motion blur | High frame rate camera, global shutter |
| Drastic lighting changes | Feature tracking failure | HDR camera, robust features |
| Long corridors | Degenerate motion | Multi-sensor fusion |
| Pure rotation | Monocular scale unobservable | Stereo or IMU constraints |
For more SLAM-related content, see SLAM Topic
References
- Qin et al., "VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator", IEEE Trans. Robotics, 2018
- Forster et al., "On-Manifold Preintegration for Real-Time Visual-Inertial Odometry", IEEE Trans. Robotics, 2017
- Mourikis & Roumeliotis, "A Multi-State Constraint Kalman Filter for Vision-aided Inertial Navigation", ICRA 2007
- Sola, "Quaternion kinematics for the error-state Kalman filter", 2017
- Kalibr Calibration Tool: https://github.com/ethz-asl/kalibr