State Space Models (SSM)
Overview
State Space Models originate from control theory, discretizing continuous-time dynamical systems for sequence modeling. From S4 to Mamba, SSMs provide a linear-complexity alternative to Transformers that excels at long-sequence modeling.
graph TD
A[State Space Models] --> B[Continuous SSM]
A --> C[Discretization]
A --> D[Structured SSM]
A --> E[Selective SSM]
B --> B1["dx/dt = Ax + Bu"]
C --> C1[ZOH / Bilinear]
D --> D1[S4 HiPPO]
D --> D2[S4D / S5]
E --> E1[Mamba]
E --> E2[Mamba-2 SSD]
style E1 fill:#f96,stroke:#333
1. Continuous State Space Model
1.1 Basic Definition
A continuous-time SSM is defined as:
where:
- \(u(t) \in \mathbb{R}\): Input signal
- \(x(t) \in \mathbb{R}^N\): Hidden state (\(N\)-dimensional)
- \(y(t) \in \mathbb{R}\): Output signal
- \(A \in \mathbb{R}^{N \times N}\): State matrix (system dynamics)
- \(B \in \mathbb{R}^{N \times 1}\): Input matrix
- \(C \in \mathbb{R}^{1 \times N}\): Output matrix
- \(D \in \mathbb{R}\): Skip connection (often omitted or set to 0)
1.2 Continuous to Discrete
For discrete sequences \((u_0, u_1, \ldots, u_L)\), the continuous SSM must be discretized.
Zero-Order Hold (ZOH):
Bilinear Transform:
where \(\Delta\) is the discretization step size.
1.3 Discrete SSM Recurrence
This is a linear recurrence that can be computed efficiently.
1.4 Convolutional View
Unrolling the recurrence:
Define the convolution kernel \(\bar{K} = (C\bar{B}, C\bar{A}\bar{B}, \ldots, C\bar{A}^{L-1}\bar{B})\):
Dual Computation Modes:
| Mode | Complexity | Use Case |
|---|---|---|
| Recurrent mode | \(O(L)\) time, \(O(N)\) memory | Autoregressive inference |
| Convolutional mode | \(O(L \log L)\) time | Parallel training |
2. S4: Structured State Spaces
2.1 HiPPO Matrix
S4 (Gu et al., 2022) introduces the HiPPO (High-order Polynomial Projection Operators) initialization for matrix \(A\).
HiPPO-LegS Matrix:
Intuition: The HiPPO matrix enables the state \(x(t)\) to optimally compress the history of input signals, equivalent to approximating a sliding window of input using orthogonal polynomial basis functions.
2.2 Diagonalization for Speed
Direct computation of \(\bar{A}^k\) is expensive. S4 leverages structural properties of \(A\):
- S4D: Diagonalize \(A\), \(A = V \Lambda V^{-1}\)
- Diagonal SSM: \(\bar{A}_{\text{diag}} = \exp(\Delta \Lambda)\)
- Complexity drops from \(O(N^2)\) to \(O(N)\)
2.3 Long-Range Modeling Capability
On the Long Range Arena benchmark, S4 dramatically outperforms Transformers:
| Task | Transformer | S4 |
|---|---|---|
| ListOps | 36.37 | 58.35 |
| Text | 64.27 | 86.82 |
| Retrieval | 57.46 | 87.09 |
| Image | 42.44 | 88.65 |
| PathFinder | 71.40 | 94.20 |
| Path-X (16K) | FAIL | 96.35 |
| Average | 53.66 | 86.09 |
3. Mamba: Selective State Spaces
3.1 Motivation
Standard SSM parameters \((\bar{A}, \bar{B}, C)\) are input-independent (Linear Time-Invariant, LTI):
- Pro: Can use convolution for efficient training
- Con: Cannot perform content-aware reasoning (e.g., selectively remembering based on input)
3.2 Selective SSM
Mamba (Gu & Dao, 2023) makes \(B, C, \Delta\) input-dependent:
Key Insights:
- \(\Delta_t\) controls the "forget gate": large \(\Delta\) → focus on current input; small \(\Delta\) → retain history
- \(B_t\) controls the "input gate": selectively writes information into state
- \(C_t\) controls the "output gate": selectively reads from state
Analogy to Gated RNNs:
3.3 Hardware-Aware Scan Algorithm
Selective SSM parameters vary with time, so convolution cannot be used. Mamba employs a hardware-aware parallel scan algorithm:
Parallel Scan (Prefix Sum):
The recurrence \(x_k = a_k x_{k-1} + b_k\) can be parallelized as:
This binary operator is associative, enabling parallel prefix sum in \(O(\log L)\) depth.
GPU Optimizations:
- Avoid storing full \((\bar{A}, \bar{B})\) sequences in HBM
- Perform discretization and scan in SRAM
- Kernel fusion strategy similar to FlashAttention
3.4 Mamba Architecture
Mamba Block (replaces Transformer Block):
Input x
├── Linear projection → expand dimensions
├── 1D depthwise convolution (kernel=4)
├── SiLU activation
├── Selective SSM
├── SiLU gating (separate branch)
└── Linear projection → output dimensions
Key Design Choices:
- No attention mechanism, no MLP block
- 1D causal convolution provides local context
- Expansion factor \(E=2\)
3.5 Mamba vs Transformer
| Feature | Transformer | Mamba |
|---|---|---|
| Training complexity | \(O(L^2 d)\) | \(O(L d N)\) |
| Inference complexity | \(O(L)\) per token | \(O(1)\) per token |
| KV cache | Grows linearly with sequence | Fixed size \(O(N d)\) |
| Long sequence capability | Limited by \(L^2\) | Linear scaling |
| In-context learning | Strong | Weaker |
| Retrieval capability | Strong | Weaker |
4. Mamba-2: Structured State Space Duality
4.1 SSD Framework
Mamba-2 (Dao & Gu, 2024) reveals the duality between SSMs and attention:
State Space Duality (SSD):
When \(A\) is scalar, this can be written in matrix form:
where \(L\) is a lower-triangular mask matrix formed by cumulative products of \(A\).
This is a form of structured causal attention!
4.2 Chunked Algorithm
Mamba-2 uses a chunking strategy:
- Divide the sequence into blocks of size \(c\)
- Within chunks: Use attention form (quadratic, but chunks are small)
- Between chunks: Use SSM recurrence (linear)
4.3 Performance Improvements
- 2-8x faster than Mamba-1
- Supports larger state dimensions
- Better hardware utilization (tensor core friendly)
5. SSM Applications
5.1 Language Modeling
| Model | Architecture | Parameters | Highlights |
|---|---|---|---|
| Mamba-1 | Pure SSM | 130M-2.8B | Matches same-size Transformers |
| Mamba-2 | SSD | - | Faster and better |
| Jamba | Attention+Mamba+MoE | 52B | Hybrid architecture |
| Zamba | Attention+Mamba | 7B | Shared attention layers |
5.2 Vision
- Vision Mamba (Vim): Mamba for image classification
- VMamba: 2D selective scan
- PlainMamba: Simplified vision Mamba
5.3 Other Domains
- Audio/speech processing
- Genomic sequence modeling
- Time series forecasting
- Point cloud processing
6. SSM Ecosystem Summary
graph LR
A[HiPPO 2020] --> B[S4 2022]
B --> C[S4D 2022]
B --> D[H3 2023]
D --> E[Mamba 2023]
E --> F[Mamba-2 2024]
B --> G[S5 2023]
E --> H[Jamba 2024]
E --> I[Vision Mamba 2024]
| Method | Year | Key Innovation |
|---|---|---|
| HiPPO | 2020 | Optimal state compression |
| S4 | 2022 | Structured SSM + convolutional mode |
| S4D | 2022 | Diagonal simplification |
| H3 | 2023 | SSM + gated attention |
| Mamba | 2023 | Selective SSM + hardware-aware scan |
| Mamba-2 | 2024 | SSD duality + chunked algorithm |
References
- Gu et al., "Efficiently Modeling Long Sequences with Structured State Spaces," ICLR 2022
- Gu et al., "On the Parameterization and Initialization of Diagonal State Space Models," NeurIPS 2022
- Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces," 2023
- Dao & Gu, "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality," ICML 2024