Skip to content

State Space Models (SSM)

Overview

State Space Models originate from control theory, discretizing continuous-time dynamical systems for sequence modeling. From S4 to Mamba, SSMs provide a linear-complexity alternative to Transformers that excels at long-sequence modeling.

graph TD
    A[State Space Models] --> B[Continuous SSM]
    A --> C[Discretization]
    A --> D[Structured SSM]
    A --> E[Selective SSM]

    B --> B1["dx/dt = Ax + Bu"]
    C --> C1[ZOH / Bilinear]
    D --> D1[S4 HiPPO]
    D --> D2[S4D / S5]
    E --> E1[Mamba]
    E --> E2[Mamba-2 SSD]

    style E1 fill:#f96,stroke:#333

1. Continuous State Space Model

1.1 Basic Definition

A continuous-time SSM is defined as:

\[ \dot{x}(t) = Ax(t) + Bu(t) \]
\[ y(t) = Cx(t) + Du(t) \]

where:

  • \(u(t) \in \mathbb{R}\): Input signal
  • \(x(t) \in \mathbb{R}^N\): Hidden state (\(N\)-dimensional)
  • \(y(t) \in \mathbb{R}\): Output signal
  • \(A \in \mathbb{R}^{N \times N}\): State matrix (system dynamics)
  • \(B \in \mathbb{R}^{N \times 1}\): Input matrix
  • \(C \in \mathbb{R}^{1 \times N}\): Output matrix
  • \(D \in \mathbb{R}\): Skip connection (often omitted or set to 0)

1.2 Continuous to Discrete

For discrete sequences \((u_0, u_1, \ldots, u_L)\), the continuous SSM must be discretized.

Zero-Order Hold (ZOH):

\[ \bar{A} = \exp(\Delta A) \]
\[ \bar{B} = (\Delta A)^{-1}(\exp(\Delta A) - I) \cdot \Delta B \]

Bilinear Transform:

\[ \bar{A} = (I - \Delta A / 2)^{-1}(I + \Delta A / 2) \]
\[ \bar{B} = (I - \Delta A / 2)^{-1} \Delta B \]

where \(\Delta\) is the discretization step size.

1.3 Discrete SSM Recurrence

\[ x_k = \bar{A} x_{k-1} + \bar{B} u_k \]
\[ y_k = C x_k \]

This is a linear recurrence that can be computed efficiently.

1.4 Convolutional View

Unrolling the recurrence:

\[ y_k = C \bar{A}^k \bar{B} u_0 + C \bar{A}^{k-1} \bar{B} u_1 + \cdots + C \bar{B} u_k \]

Define the convolution kernel \(\bar{K} = (C\bar{B}, C\bar{A}\bar{B}, \ldots, C\bar{A}^{L-1}\bar{B})\):

\[ y = \bar{K} * u \]

Dual Computation Modes:

Mode Complexity Use Case
Recurrent mode \(O(L)\) time, \(O(N)\) memory Autoregressive inference
Convolutional mode \(O(L \log L)\) time Parallel training

2. S4: Structured State Spaces

2.1 HiPPO Matrix

S4 (Gu et al., 2022) introduces the HiPPO (High-order Polynomial Projection Operators) initialization for matrix \(A\).

HiPPO-LegS Matrix:

\[ A_{nk} = -\begin{cases} (2n+1)^{1/2}(2k+1)^{1/2} & \text{if } n > k \\ n+1 & \text{if } n = k \\ 0 & \text{if } n < k \end{cases} \]

Intuition: The HiPPO matrix enables the state \(x(t)\) to optimally compress the history of input signals, equivalent to approximating a sliding window of input using orthogonal polynomial basis functions.

2.2 Diagonalization for Speed

Direct computation of \(\bar{A}^k\) is expensive. S4 leverages structural properties of \(A\):

  1. S4D: Diagonalize \(A\), \(A = V \Lambda V^{-1}\)
  2. Diagonal SSM: \(\bar{A}_{\text{diag}} = \exp(\Delta \Lambda)\)
  3. Complexity drops from \(O(N^2)\) to \(O(N)\)

2.3 Long-Range Modeling Capability

On the Long Range Arena benchmark, S4 dramatically outperforms Transformers:

Task Transformer S4
ListOps 36.37 58.35
Text 64.27 86.82
Retrieval 57.46 87.09
Image 42.44 88.65
PathFinder 71.40 94.20
Path-X (16K) FAIL 96.35
Average 53.66 86.09

3. Mamba: Selective State Spaces

3.1 Motivation

Standard SSM parameters \((\bar{A}, \bar{B}, C)\) are input-independent (Linear Time-Invariant, LTI):

  • Pro: Can use convolution for efficient training
  • Con: Cannot perform content-aware reasoning (e.g., selectively remembering based on input)

3.2 Selective SSM

Mamba (Gu & Dao, 2023) makes \(B, C, \Delta\) input-dependent:

\[ B_t = \text{Linear}_B(x_t), \quad C_t = \text{Linear}_C(x_t), \quad \Delta_t = \text{softplus}(\text{Linear}_\Delta(x_t)) \]

Key Insights:

  • \(\Delta_t\) controls the "forget gate": large \(\Delta\) → focus on current input; small \(\Delta\) → retain history
  • \(B_t\) controls the "input gate": selectively writes information into state
  • \(C_t\) controls the "output gate": selectively reads from state

Analogy to Gated RNNs:

\[ x_t = \bar{A}_t x_{t-1} + \bar{B}_t u_t \quad \longleftrightarrow \quad h_t = (1-z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t \]

3.3 Hardware-Aware Scan Algorithm

Selective SSM parameters vary with time, so convolution cannot be used. Mamba employs a hardware-aware parallel scan algorithm:

Parallel Scan (Prefix Sum):

The recurrence \(x_k = a_k x_{k-1} + b_k\) can be parallelized as:

\[ (a_k, b_k) \bullet (a_{k-1}, b_{k-1}) = (a_k a_{k-1}, a_k b_{k-1} + b_k) \]

This binary operator is associative, enabling parallel prefix sum in \(O(\log L)\) depth.

GPU Optimizations:

  • Avoid storing full \((\bar{A}, \bar{B})\) sequences in HBM
  • Perform discretization and scan in SRAM
  • Kernel fusion strategy similar to FlashAttention

3.4 Mamba Architecture

Mamba Block (replaces Transformer Block):

Input x
├── Linear projection → expand dimensions
├── 1D depthwise convolution (kernel=4)
├── SiLU activation
├── Selective SSM
├── SiLU gating (separate branch)
└── Linear projection → output dimensions

Key Design Choices:

  • No attention mechanism, no MLP block
  • 1D causal convolution provides local context
  • Expansion factor \(E=2\)

3.5 Mamba vs Transformer

Feature Transformer Mamba
Training complexity \(O(L^2 d)\) \(O(L d N)\)
Inference complexity \(O(L)\) per token \(O(1)\) per token
KV cache Grows linearly with sequence Fixed size \(O(N d)\)
Long sequence capability Limited by \(L^2\) Linear scaling
In-context learning Strong Weaker
Retrieval capability Strong Weaker

4. Mamba-2: Structured State Space Duality

4.1 SSD Framework

Mamba-2 (Dao & Gu, 2024) reveals the duality between SSMs and attention:

State Space Duality (SSD):

\[ y_t = \sum_{s=1}^{t} C_t^\top A_{t:s} B_s u_s \]

When \(A\) is scalar, this can be written in matrix form:

\[ Y = (L \odot QK^\top) V \]

where \(L\) is a lower-triangular mask matrix formed by cumulative products of \(A\).

This is a form of structured causal attention!

4.2 Chunked Algorithm

Mamba-2 uses a chunking strategy:

  • Divide the sequence into blocks of size \(c\)
  • Within chunks: Use attention form (quadratic, but chunks are small)
  • Between chunks: Use SSM recurrence (linear)
\[ \text{Total complexity} = O(L c + L \cdot N) \quad \text{where } c \text{ is the chunk size} \]

4.3 Performance Improvements

  • 2-8x faster than Mamba-1
  • Supports larger state dimensions
  • Better hardware utilization (tensor core friendly)

5. SSM Applications

5.1 Language Modeling

Model Architecture Parameters Highlights
Mamba-1 Pure SSM 130M-2.8B Matches same-size Transformers
Mamba-2 SSD - Faster and better
Jamba Attention+Mamba+MoE 52B Hybrid architecture
Zamba Attention+Mamba 7B Shared attention layers

5.2 Vision

  • Vision Mamba (Vim): Mamba for image classification
  • VMamba: 2D selective scan
  • PlainMamba: Simplified vision Mamba

5.3 Other Domains

  • Audio/speech processing
  • Genomic sequence modeling
  • Time series forecasting
  • Point cloud processing

6. SSM Ecosystem Summary

graph LR
    A[HiPPO 2020] --> B[S4 2022]
    B --> C[S4D 2022]
    B --> D[H3 2023]
    D --> E[Mamba 2023]
    E --> F[Mamba-2 2024]

    B --> G[S5 2023]

    E --> H[Jamba 2024]
    E --> I[Vision Mamba 2024]
Method Year Key Innovation
HiPPO 2020 Optimal state compression
S4 2022 Structured SSM + convolutional mode
S4D 2022 Diagonal simplification
H3 2023 SSM + gated attention
Mamba 2023 Selective SSM + hardware-aware scan
Mamba-2 2024 SSD duality + chunked algorithm

References

  • Gu et al., "Efficiently Modeling Long Sequences with Structured State Spaces," ICLR 2022
  • Gu et al., "On the Parameterization and Initialization of Diagonal State Space Models," NeurIPS 2022
  • Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces," 2023
  • Dao & Gu, "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality," ICML 2024

评论 #