SLAM: Simultaneous Localization and Mapping

Overview

SLAM (Simultaneous Localization and Mapping) is the problem of simultaneously estimating a robot's own pose and constructing a map of the environment in an unknown setting. It is a core technology for autonomous navigation of mobile robots.

Probabilistic Formulation of the SLAM Problem:

\[ p(x_{1:t}, m \mid z_{1:t}, u_{1:t}) \]

where:

Symbol	Meaning
\(x_{1:t}\)	Robot pose sequence from time 1 to \(t\)
\(m\)	Environment map
\(z_{1:t}\)	Sensor observation sequence
\(u_{1:t}\)	Control input (odometry) sequence

graph LR
    A["Sensor Data<br/>Camera/LiDAR/IMU"] --> B["Front End<br/>Data Association/Feature Extraction"]
    B --> C["State Estimation<br/>Filtering/Optimization"]
    C --> D["Back End<br/>Loop Closure/Global Optimization"]
    D --> E["Output<br/>Pose + Map"]
    D -->|"Loop Closure Constraints"| C
    style B fill:#e8f4fd,stroke:#2196F3
    style C fill:#fff3e0,stroke:#FF9800
    style D fill:#f3e5f5,stroke:#9C27B0

1 Mathematical Foundations of SLAM

1.1 Motion Model

\[ x_t = f(x_{t-1}, u_t) + w_t, \quad w_t \sim \mathcal{N}(0, R_t) \]

1.2 Observation Model

\[ z_t = h(x_t, m) + v_t, \quad v_t \sim \mathcal{N}(0, Q_t) \]

1.3 Two Forms of SLAM

Form	Estimated Quantity	Methods
Online SLAM	\(p(x_t, m \mid z_{1:t}, u_{1:t})\)	EKF-SLAM, particle filter
Full SLAM	\(p(x_{1:t}, m \mid z_{1:t}, u_{1:t})\)	Graph optimization

2 Filter-Based SLAM

2.1 EKF-SLAM

EKF-SLAM is the most classic SLAM method, maintaining a joint Gaussian distribution over the robot pose and all landmark positions.

State Vector:

\[ \hat{\mu}_t = \begin{bmatrix} x_t \\ m_1 \\ m_2 \\ \vdots \\ m_N \end{bmatrix}, \quad \Sigma_t = \begin{bmatrix} \Sigma_{xx} & \Sigma_{xm_1} & \cdots & \Sigma_{xm_N} \\ \Sigma_{m_1 x} & \Sigma_{m_1 m_1} & \cdots & \Sigma_{m_1 m_N} \\ \vdots & & \ddots & \vdots \\ \Sigma_{m_N x} & & \cdots & \Sigma_{m_N m_N} \end{bmatrix} \]

Prediction Step (Motion Update):

\[ \bar{\mu}_t = f(\hat{\mu}_{t-1}, u_t) \]

\[ \bar{\Sigma}_t = F_t \Sigma_{t-1} F_t^T + R_t \]

Update Step (Observation Update):

\[ K_t = \bar{\Sigma}_t H_t^T (H_t \bar{\Sigma}_t H_t^T + Q_t)^{-1} \]

\[ \hat{\mu}_t = \bar{\mu}_t + K_t (z_t - h(\bar{\mu}_t)) \]

\[ \Sigma_t = (I - K_t H_t) \bar{\Sigma}_t \]

where \(F_t = \frac{\partial f}{\partial x}\big|_{\hat{\mu}_{t-1}}\) and \(H_t = \frac{\partial h}{\partial x}\big|_{\bar{\mu}_t}\) are the Jacobian matrices.

Limitations of EKF-SLAM:

Computational complexity \(O(N^2)\) (\(N\) = number of landmarks), unsuitable for large-scale environments
Linearization errors lead to inconsistency
Data association errors are difficult to recover from

2.2 Particle Filter SLAM (FastSLAM)

FastSLAM (Montemerlo et al., 2002) uses Rao-Blackwellized particle filtering to decompose the SLAM problem:

\[ p(x_{1:t}, m \mid z_{1:t}, u_{1:t}) = p(x_{1:t} \mid z_{1:t}, u_{1:t}) \prod_{j=1}^{N} p(m_j \mid x_{1:t}, z_{1:t}) \]

Particles represent the path posterior \(p(x_{1:t} \mid \cdot)\)
Each particle maintains an independent map (EKF update for each landmark)

Advantages:

Computational complexity \(O(M \log N)\) (\(M\) = number of particles, \(N\) = number of landmarks)
Naturally handles multimodal distributions
Data association can be handled independently per particle

3 Visual SLAM

3.1 Overview

Visual SLAM uses cameras as the primary sensor, classified into:

Feature-based: Extract feature points, estimate pose based on geometric constraints
Direct: Directly use pixel intensity values to optimize pose
Semi-direct: Combines advantages of both (e.g., SVO)

3.2 ORB-SLAM3 System

ORB-SLAM3 (Campos et al., 2021) is currently the most complete open-source visual SLAM system, supporting monocular, stereo, RGB-D, and visual-inertial modes.

System Pipeline:

graph TB
    A["Image Input"] --> B["ORB Feature Extraction"]
    B --> C["Feature Matching"]
    C --> D["Tracking Thread<br/>(Frame-to-Frame Pose Estimation)"]
    D --> E["Keyframe Decision"]
    E -->|"Is Keyframe"| F["Local Mapping Thread<br/>(Local BA)"]
    F --> G["Loop Closure Thread<br/>(DBoW2)"]
    G -->|"Loop Detected"| H["Pose Graph Optimization"]
    H --> I["Global BA"]
    D --> J["Pose Output"]
    F --> K["Map Point Cloud"]
    style D fill:#e8f4fd,stroke:#2196F3
    style F fill:#fff3e0,stroke:#FF9800
    style G fill:#f3e5f5,stroke:#9C27B0

Core Modules:

1. Feature Extraction

Uses ORB (Oriented FAST and Rotated BRIEF) features:

FAST corner detection + orientation computation
rBRIEF descriptor (256-bit binary)
Image pyramid for scale invariance
Fast extraction; matching uses Hamming distance

2. Tracking

Constant velocity model for initial pose prediction
Feature matching + PnP solving
Local map tracking: match against local map points, optimize current pose
Keyframe selection strategy

3. Local Mapping

New keyframe insertion
Map point triangulation
Local Bundle Adjustment

\[ \min_{\{T_i\}, \{p_j\}} \sum_{i,j} \rho\left(\|z_{ij} - \pi(T_i, p_j)\|^2_{\Sigma_{ij}}\right) \]

where \(\pi\) is the projection function and \(\rho\) is a robust kernel function (e.g., Huber).

4. Loop Closure

Bag-of-words model (DBoW2) for image similarity retrieval
Geometric verification (RANSAC + Sim(3)/SE(3))
Pose graph optimization (correcting accumulated drift)
Optional global BA

3.3 Visual-Inertial SLAM (VIO)

Fusing IMU pre-integration:

\[ \Delta R_{ij} = \prod_{k=i}^{j-1} \text{Exp}((\omega_k - b_g) \Delta t) \]

\[ \Delta v_{ij} = \sum_{k=i}^{j-1} \Delta R_{ik} (a_k - b_a) \Delta t \]

\[ \Delta p_{ij} = \sum_{k=i}^{j-1} \Delta v_{ik} \Delta t + \frac{1}{2} \Delta R_{ik} (a_k - b_a) \Delta t^2 \]

VIO systems maintain robustness in texture-poor and fast-motion scenarios.

4 LiDAR SLAM

4.1 Overview

LiDAR SLAM uses laser scanners to obtain precise 3D point clouds (see Sensors), suitable for large-scale outdoor environments.

4.2 LOAM (Lidar Odometry and Mapping)

LOAM (Zhang & Singh, 2014) is one of the most influential LiDAR SLAM algorithms.

Core Idea: Decompose the problem into a high-frequency odometry module and a low-frequency mapping module.

Feature Extraction:

Edge points: Points with high curvature
Planar points: Points with low curvature

Point-to-Edge/Plane Distance Minimization:

Edge point residual:

\[ d_e = \frac{|(p_i - p_a) \times (p_i - p_b)|}{|p_a - p_b|} \]

Planar point residual:

\[ d_p = \frac{(p_i - p_a) \cdot ((p_a - p_b) \times (p_a - p_c))}{|(p_a - p_b) \times (p_a - p_c)|} \]

4.3 Cartographer

Google Cartographer (Hess et al., 2016) is a widely used 2D/3D SLAM system.

Front End:

Submap construction: Uses scan matching to insert new scans into submaps
Local optimization: Ceres Solver for scan matching

Back End:

Loop closure detection: Branch-and-bound search
Pose graph optimization: Sparse Pose Adjustment (SPA)

4.4 Other LiDAR SLAM Systems

System	Features
LeGO-LOAM	Ground segmentation + lightweight LOAM
LIO-SAM	Tightly-coupled LiDAR-IMU, factor graph optimization
FAST-LIO2	Tightly-coupled, iKD-tree acceleration, efficient
CT-ICP	Continuous-time ICP, handles motion distortion

5 Graph-Based SLAM

5.1 Pose Graph Optimization

Modeling SLAM as a graph optimization problem:

Nodes: Robot poses \(x_i\)
Edges: Relative constraints between poses \(z_{ij}\) (odometry, loop closure)

Optimization Objective:

\[ x^* = \arg\min_x \sum_{(i,j) \in \mathcal{E}} \|e_{ij}(x_i, x_j)\|^2_{\Omega_{ij}} \]

where the error function:

\[ e_{ij} = z_{ij}^{-1} \oplus (x_i^{-1} \oplus x_j) \]

\(\Omega_{ij}\) is the information matrix (inverse of covariance).

5.2 Factor Graphs

Factor graphs are a more general graphical model representation:

Variable nodes: States to be estimated (poses, landmarks, sensor biases, etc.)
Factor nodes: Constraints (priors, odometry, observations, loop closures)

\[ p(X) \propto \prod_k f_k(X_k) \]

Common Factor Types:

Factor	Connected Variables	Information Source
Prior Factor	\(x_0\)	Initial pose
Odometry Factor	\(x_{t-1}, x_t\)	Wheel/visual/inertial odometry
Landmark Observation Factor	\(x_t, l_j\)	Camera/LiDAR observation
Loop Closure Factor	\(x_i, x_j\)	Loop closure detection
IMU Pre-integration Factor	\(x_i, x_j, v_i, v_j, b_i\)	IMU data
GPS Factor	\(x_t\)	GPS localization

5.3 Solution Methods

The graph optimization problem ultimately reduces to nonlinear least squares:

\[ \min_x \sum_k \|r_k(x)\|^2_{\Sigma_k} \]

Solved iteratively using Gauss-Newton or Levenberg-Marquardt:

\[ (J^T \Sigma^{-1} J) \Delta x = -J^T \Sigma^{-1} r \]

Sparsity

The information matrix \(H = J^T \Sigma^{-1} J\) in SLAM problems is sparse (because only adjacent/co-visible variables have constraints between them). Efficient solving is achieved through sparse Cholesky decomposition.

Common Optimization Libraries:

g2o: General graph optimization framework
GTSAM: Factor graph-based, supports incremental optimization (iSAM2)
Ceres Solver: General nonlinear least squares solver

6 Learning-Enhanced SLAM

6.1 Depth Estimation

MonoDepth2: Self-supervised monocular depth estimation
DPT: Transformer-based depth prediction

6.2 Feature Extraction and Matching

SuperPoint: Self-supervised keypoint detection
SuperGlue: Graph Neural Network-based feature matching
LightGlue: Lightweight feature matching

6.3 End-to-End Learned SLAM

DROID-SLAM (Teed & Deng, 2021):

Dense matching based on RAFT optical flow network
Differentiable Bundle Adjustment layer
End-to-end training with strong generalization

\[ \min_{\{T_i\}, \{d_j\}} \sum_{(i,j)} \|p_{ij}^* - \Pi(T_i \cdot T_j^{-1}, d_j)\|^2_{\Sigma_{ij}} \]

where \(p_{ij}^*\) are network-predicted correspondences; gradient propagation is achieved through differentiable optimization by unrolling iterations.

6.4 Implicit Representations

iMAP: Uses NeRF as the map representation
NICE-SLAM: Hierarchical implicit representation
SplaTAM: 3D Gaussian Splatting-based SLAM

7 SLAM Evaluation Metrics

7.1 Trajectory Accuracy

Absolute Trajectory Error (ATE):

\[ \text{ATE} = \sqrt{\frac{1}{N}\sum_{i=1}^{N} \|p_i^{\text{est}} - p_i^{\text{gt}}\|^2} \]

Relative Pose Error (RPE):

\[ \text{RPE}_i = (T_i^{\text{gt}})^{-1} T_{i+\Delta}^{\text{gt}} \cdot ((T_i^{\text{est}})^{-1} T_{i+\Delta}^{\text{est}})^{-1} \]

7.2 Standard Datasets

Dataset	Sensors	Scenario
KITTI	Stereo + LiDAR + GPS	Outdoor driving
EuRoC	Stereo + IMU	Indoor drone
TUM-RGBD	RGB-D	Indoor handheld
Newer College	Multi-LiDAR + Camera	Outdoor walking

7.3 Evaluation Tools

evo: Python trajectory evaluation toolkit
rpg_trajectory_evaluation: Evaluation tool developed by ETH

8 System Integration

SLAM systems typically run under the ROS/ROS2 framework (see ROS2), coordinating with the navigation stack:

SLAM provides pose estimation and maps
The navigation stack performs path planning based on the map
Controllers execute motion commands

Sensor selection and calibration are crucial for SLAM performance (see Sensors).

References

Thrun, S., Burgard, W. & Fox, D. (2005). Probabilistic Robotics. MIT Press.
Cadena, C. et al. (2016). Past, present, and future of simultaneous localization and mapping. IEEE TRO.
Campos, C. et al. (2021). ORB-SLAM3: An accurate open-source library for visual, visual-inertial and multi-map SLAM. IEEE TRO.
Zhang, J. & Singh, S. (2014). LOAM: Lidar Odometry and Mapping in Real-time. RSS.
Teed, Z. & Deng, J. (2021). DROID-SLAM: Deep Visual SLAM for Monocular, Stereo, and RGB-D Cameras. NeurIPS.
Grisetti, G. et al. (2010). A tutorial on graph-based SLAM. IEEE ITSM.