Visual-Inertial Odometry (VIO)

Overview

Visual-Inertial Odometry (VIO) is the technology of fusing camera and IMU data for real-time pose estimation. Cameras provide rich environmental texture information but have limited frame rates, while IMUs provide high-frequency motion information but suffer from drift -- the two complement each other to achieve robust 6-DOF pose tracking.

Why VIO Is Needed

Sensor	Strengths	Limitations
Camera	Rich texture, no drift (loop closure)	Low frame rate (30Hz), motion blur, light-sensitive
IMU	High frequency (200-1000Hz), unaffected by lighting	Long-term drift, requires integration
VIO Fusion	High-frequency + high-accuracy, robust	Increased system complexity

Fusion Method Classification

graph TD
    A[VIO Fusion Methods] --> B[Loosely Coupled]
    A --> C[Tightly Coupled]

    B --> B1[Visual odometry runs independently<br>IMU runs independently<br>Backend fuses both results]

    C --> C1[Visual features and IMU data<br>in the same optimization framework<br>jointly optimized]

    A --> D[By Backend Type]
    D --> D1[Filter-Based<br>EKF / MSCKF]
    D --> D2[Optimization-Based<br>Bundle Adjustment + IMU]

Loosely vs. Tightly Coupled

Feature	Loosely Coupled	Tightly Coupled
Fusion Level	Result-level fusion	Raw data-level fusion
Accuracy	Lower	Higher
Robustness	Degrades when single sensor fails	More robust (IMU fills in when features are sparse)
Implementation Complexity	Low	High
Representative	-	VINS-Mono, OKVIS, BASALT

Tight Coupling Is Mainstream

Current mainstream VIO approaches in both academia and industry use tight coupling. Tight coupling better handles degraded scenarios (e.g., sparse textures, fast motion) and is the superior technical choice.

IMU Preintegration Theory

Motivation

In optimization-based VIO, after each optimization iteration updates the state estimate, IMU data needs to be re-integrated. IMU preintegration pre-integrates IMU measurements between two keyframes into relative motion constraints, avoiding repeated integration.

Preintegration Quantities

Preintegration quantities between keyframes \(i\) and \(j\):

Rotation preintegration:

\[ \Delta \mathbf{R}_{ij} = \prod_{k=i}^{j-1} \text{Exp}\left((\tilde{\boldsymbol{\omega}}_k - \mathbf{b}_g) \Delta t\right) \]

Velocity preintegration:

\[ \Delta \mathbf{v}_{ij} = \sum_{k=i}^{j-1} \Delta \mathbf{R}_{ik} \cdot (\tilde{\mathbf{a}}_k - \mathbf{b}_a) \Delta t \]

Position preintegration:

\[ \Delta \mathbf{p}_{ij} = \sum_{k=i}^{j-1} \left[ \Delta \mathbf{v}_{ik} \Delta t + \frac{1}{2} \Delta \mathbf{R}_{ik} \cdot (\tilde{\mathbf{a}}_k - \mathbf{b}_a) \Delta t^2 \right] \]

Where:

\(\tilde{\boldsymbol{\omega}}_k\) is the gyroscope measurement
\(\tilde{\mathbf{a}}_k\) is the accelerometer measurement
\(\mathbf{b}_g\), \(\mathbf{b}_a\) are biases
\(\text{Exp}(\cdot)\) is the exponential map from \(\mathfrak{so}(3) \to SO(3)\)

Core Advantage of Preintegration

When bias estimates \(\mathbf{b}_g\), \(\mathbf{b}_a\) change slightly, preintegration quantities can be updated through first-order approximation without re-integration:

\[ \Delta \mathbf{R}_{ij}(\mathbf{b}_g + \delta \mathbf{b}_g) \approx \Delta \mathbf{R}_{ij}(\mathbf{b}_g) \cdot \text{Exp}\left(\frac{\partial \Delta \mathbf{R}_{ij}}{\partial \mathbf{b}_g} \delta \mathbf{b}_g\right) \]

This greatly reduces computation during optimization iterations.

Preintegration Covariance Propagation

Uncertainty of preintegration quantities is propagated through the covariance matrix:

\[ \boldsymbol{\Sigma}_{k+1} = \mathbf{A}_k \boldsymbol{\Sigma}_k \mathbf{A}_k^T + \mathbf{B}_k \mathbf{Q} \mathbf{B}_k^T \]

Where \(\mathbf{A}_k\) is the state transition matrix and \(\mathbf{Q}\) is the IMU noise covariance.

Factor Graph Formulation

VIO can be naturally modeled as a factor graph optimization problem:

graph LR
    subgraph "State Variables"
    X0["X0<br>(p,R,v,b)"] 
    X1["X1<br>(p,R,v,b)"]
    X2["X2<br>(p,R,v,b)"]
    end

    subgraph "Landmarks"
    L1["L1"] 
    L2["L2"]
    L3["L3"]
    end

    X0 --- |"IMU<br>Preintegration"| X1
    X1 --- |"IMU<br>Preintegration"| X2

    X0 --- |"Visual<br>Observation"| L1
    X0 --- |"Visual<br>Observation"| L2
    X1 --- |"Visual<br>Observation"| L1
    X1 --- |"Visual<br>Observation"| L3
    X2 --- |"Visual<br>Observation"| L2
    X2 --- |"Visual<br>Observation"| L3

    P["Prior Factor"] --- X0

State Vector

State at each keyframe:

\[ \mathbf{x}_k = \begin{bmatrix} \mathbf{p}_k \\ \mathbf{R}_k \\ \mathbf{v}_k \\ \mathbf{b}_{a,k} \\ \mathbf{b}_{g,k} \end{bmatrix} \in \mathbb{R}^3 \times SO(3) \times \mathbb{R}^3 \times \mathbb{R}^3 \times \mathbb{R}^3 \]

Optimization Objective

\[ \mathbf{x}^* = \arg\min_{\mathbf{x}} \left\{ \| \mathbf{r}_p \|^2 + \sum_{k} \| \mathbf{r}_{\text{IMU}}^k \|_{\Sigma_k}^2 + \sum_{(i,j)} \| \mathbf{r}_{\text{vis}}^{ij} \|_{\Sigma_{ij}}^2 \right\} \]

Where:

\(\mathbf{r}_p\) is the prior residual
\(\mathbf{r}_{\text{IMU}}^k\) is the IMU preintegration residual
\(\mathbf{r}_{\text{vis}}^{ij}\) is the visual reprojection residual
\(\| \cdot \|_{\Sigma}^2 = (\cdot)^T \Sigma^{-1} (\cdot)\) is the Mahalanobis distance

Classic VIO Systems

VINS-Mono

A classic tightly-coupled monocular VIO system open-sourced by HKUST, the standard reference for learning VIO.

System Architecture:

graph TD
    A[IMU Data] --> B[Preintegration]
    C[Images] --> D[Feature Tracking<br>KLT Optical Flow]

    B --> E[Initialization]
    D --> E

    E --> F[Nonlinear Optimization<br>Sliding Window]
    F --> G[Marginalization]

    F --> H[Loop Closure Detection<br>DBoW2]
    H --> I[Global Pose Graph Optimization]

    F --> J[Output Pose]
    I --> J

Key Techniques:

Module	Method
Frontend	KLT optical flow feature tracking
Initialization	Visual SfM + IMU alignment
Backend	Sliding window + marginalization
Loop Closure	DBoW2 bag-of-words detection + pose graph optimization
Optimizer	Ceres Solver

OKVIS

Open-sourced by ETH, a keyframe-based tightly-coupled VIO supporting stereo.

MSCKF

An efficient EKF-based VIO that avoids maintaining feature point states through Multi-State Constraint:

\[ \mathbf{x} = \begin{bmatrix} \mathbf{x}_{\text{IMU}} \\ \mathbf{x}_{C_1} \\ \mathbf{x}_{C_2} \\ \vdots \end{bmatrix} \]

Feature points are not added to the state after observation; instead, multi-state constraints are built when tracking is lost.

BASALT

Open-sourced by TUM, a high-accuracy VIO using stereo + IMU, featuring efficient nonlinear optimization and an optical flow frontend.

System Comparison

System	Coupling	Backend	Camera Config	Highlight
VINS-Mono	Tight	Optimization	Monocular	Complete system, loop closure
VINS-Fusion	Tight	Optimization	Mono/Stereo/Multi	Multi-sensor
OKVIS	Tight	Optimization	Stereo	Keyframe management
MSCKF	Tight	EKF	Stereo	Computationally efficient
BASALT	Tight	Optimization	Stereo	High accuracy
ORB-SLAM3	Tight	Optimization	Mono/Stereo/RGB-D	Map reuse

Practical Deployment Considerations

Camera-IMU Calibration

The extrinsic \(\mathbf{T}_{CI}\) (relative pose) between camera and IMU must be precisely calibrated:

# Using Kalibr
rosrun kalibr kalibr_calibrate_imu_camera \
    --bag calibration.bag \
    --cam cam_config.yaml \
    --imu imu_config.yaml \
    --target april_grid.yaml

Time Synchronization

The time offset \(t_d\) between camera and IMU severely affects accuracy:

\[ t_{\text{cam}} = t_{\text{IMU}} + t_d \]

VINS-Mono supports online estimation of \(t_d\), but hardware-level time synchronization (e.g., trigger signals) is more reliable.

Common Failure Scenarios

Scenario	Cause	Mitigation
Texture-poor areas	White walls, sky	Rely on IMU short-term prediction
Fast rotation	Motion blur	High frame rate camera, global shutter
Drastic lighting changes	Feature tracking failure	HDR camera, robust features
Long corridors	Degenerate motion	Multi-sensor fusion
Pure rotation	Monocular scale unobservable	Stereo or IMU constraints

For more SLAM-related content, see SLAM Topic

References

Qin et al., "VINS-Mono: A Robust and Versatile Monocular Visual-Inertial State Estimator", IEEE Trans. Robotics, 2018
Forster et al., "On-Manifold Preintegration for Real-Time Visual-Inertial Odometry", IEEE Trans. Robotics, 2017
Mourikis & Roumeliotis, "A Multi-State Constraint Kalman Filter for Vision-aided Inertial Navigation", ICRA 2007
Sola, "Quaternion kinematics for the error-state Kalman filter", 2017
Kalibr Calibration Tool: https://github.com/ethz-asl/kalibr