State Space Models (SSM)

Overview

State Space Models originate from control theory, discretizing continuous-time dynamical systems for sequence modeling. From S4 to Mamba, SSMs provide a linear-complexity alternative to Transformers that excels at long-sequence modeling.

graph TD
    A[State Space Models] --> B[Continuous SSM]
    A --> C[Discretization]
    A --> D[Structured SSM]
    A --> E[Selective SSM]

    B --> B1["dx/dt = Ax + Bu"]
    C --> C1[ZOH / Bilinear]
    D --> D1[S4 HiPPO]
    D --> D2[S4D / S5]
    E --> E1[Mamba]
    E --> E2[Mamba-2 SSD]

    style E1 fill:#f96,stroke:#333

1. Continuous State Space Model

1.1 Basic Definition

A continuous-time SSM is defined as:

\[ \dot{x}(t) = Ax(t) + Bu(t) \]

\[ y(t) = Cx(t) + Du(t) \]

where:

\(u(t) \in \mathbb{R}\): Input signal
\(x(t) \in \mathbb{R}^N\): Hidden state (\(N\)-dimensional)
\(y(t) \in \mathbb{R}\): Output signal
\(A \in \mathbb{R}^{N \times N}\): State matrix (system dynamics)
\(B \in \mathbb{R}^{N \times 1}\): Input matrix
\(C \in \mathbb{R}^{1 \times N}\): Output matrix
\(D \in \mathbb{R}\): Skip connection (often omitted or set to 0)

1.2 Continuous to Discrete

For discrete sequences \((u_0, u_1, \ldots, u_L)\), the continuous SSM must be discretized.

Zero-Order Hold (ZOH):

\[ \bar{A} = \exp(\Delta A) \]

\[ \bar{B} = (\Delta A)^{-1}(\exp(\Delta A) - I) \cdot \Delta B \]

Bilinear Transform:

\[ \bar{A} = (I - \Delta A / 2)^{-1}(I + \Delta A / 2) \]

\[ \bar{B} = (I - \Delta A / 2)^{-1} \Delta B \]

where \(\Delta\) is the discretization step size.

1.3 Discrete SSM Recurrence

\[ x_k = \bar{A} x_{k-1} + \bar{B} u_k \]

\[ y_k = C x_k \]

This is a linear recurrence that can be computed efficiently.

1.4 Convolutional View

Unrolling the recurrence:

\[ y_k = C \bar{A}^k \bar{B} u_0 + C \bar{A}^{k-1} \bar{B} u_1 + \cdots + C \bar{B} u_k \]

Define the convolution kernel \(\bar{K} = (C\bar{B}, C\bar{A}\bar{B}, \ldots, C\bar{A}^{L-1}\bar{B})\):

\[ y = \bar{K} * u \]

Dual Computation Modes:

Mode	Complexity	Use Case
Recurrent mode	\(O(L)\) time, \(O(N)\) memory	Autoregressive inference
Convolutional mode	\(O(L \log L)\) time	Parallel training

2. S4: Structured State Spaces

2.1 HiPPO Matrix

S4 (Gu et al., 2022) introduces the HiPPO (High-order Polynomial Projection Operators) initialization for matrix \(A\).

HiPPO-LegS Matrix:

\[ A_{nk} = -\begin{cases} (2n+1)^{1/2}(2k+1)^{1/2} & \text{if } n > k \\ n+1 & \text{if } n = k \\ 0 & \text{if } n < k \end{cases} \]

Intuition: The HiPPO matrix enables the state \(x(t)\) to optimally compress the history of input signals, equivalent to approximating a sliding window of input using orthogonal polynomial basis functions.

2.2 Diagonalization for Speed

Direct computation of \(\bar{A}^k\) is expensive. S4 leverages structural properties of \(A\):

S4D: Diagonalize \(A\), \(A = V \Lambda V^{-1}\)
Diagonal SSM: \(\bar{A}_{\text{diag}} = \exp(\Delta \Lambda)\)
Complexity drops from \(O(N^2)\) to \(O(N)\)

2.3 Long-Range Modeling Capability

On the Long Range Arena benchmark, S4 dramatically outperforms Transformers:

Task	Transformer	S4
ListOps	36.37	58.35
Text	64.27	86.82
Retrieval	57.46	87.09
Image	42.44	88.65
PathFinder	71.40	94.20
Path-X (16K)	FAIL	96.35
Average	53.66	86.09

3. Mamba: Selective State Spaces

3.1 Motivation

Standard SSM parameters \((\bar{A}, \bar{B}, C)\) are input-independent (Linear Time-Invariant, LTI):

Pro: Can use convolution for efficient training
Con: Cannot perform content-aware reasoning (e.g., selectively remembering based on input)

3.2 Selective SSM

Mamba (Gu & Dao, 2023) makes \(B, C, \Delta\) input-dependent:

\[ B_t = \text{Linear}_B(x_t), \quad C_t = \text{Linear}_C(x_t), \quad \Delta_t = \text{softplus}(\text{Linear}_\Delta(x_t)) \]

Key Insights:

\(\Delta_t\) controls the "forget gate": large \(\Delta\) → focus on current input; small \(\Delta\) → retain history
\(B_t\) controls the "input gate": selectively writes information into state
\(C_t\) controls the "output gate": selectively reads from state

Analogy to Gated RNNs:

\[ x_t = \bar{A}_t x_{t-1} + \bar{B}_t u_t \quad \longleftrightarrow \quad h_t = (1-z_t) \odot h_{t-1} + z_t \odot \tilde{h}_t \]

3.3 Hardware-Aware Scan Algorithm

Selective SSM parameters vary with time, so convolution cannot be used. Mamba employs a hardware-aware parallel scan algorithm:

Parallel Scan (Prefix Sum):

The recurrence \(x_k = a_k x_{k-1} + b_k\) can be parallelized as:

\[ (a_k, b_k) \bullet (a_{k-1}, b_{k-1}) = (a_k a_{k-1}, a_k b_{k-1} + b_k) \]

This binary operator is associative, enabling parallel prefix sum in \(O(\log L)\) depth.

GPU Optimizations:

Avoid storing full \((\bar{A}, \bar{B})\) sequences in HBM
Perform discretization and scan in SRAM
Kernel fusion strategy similar to FlashAttention

3.4 Mamba Architecture

Mamba Block (replaces Transformer Block):

Input x
├── Linear projection → expand dimensions
├── 1D depthwise convolution (kernel=4)
├── SiLU activation
├── Selective SSM
├── SiLU gating (separate branch)
└── Linear projection → output dimensions

Key Design Choices:

No attention mechanism, no MLP block
1D causal convolution provides local context
Expansion factor \(E=2\)

3.5 Mamba vs Transformer

Feature	Transformer	Mamba
Training complexity	\(O(L^2 d)\)	\(O(L d N)\)
Inference complexity	\(O(L)\) per token	\(O(1)\) per token
KV cache	Grows linearly with sequence	Fixed size \(O(N d)\)
Long sequence capability	Limited by \(L^2\)	Linear scaling
In-context learning	Strong	Weaker
Retrieval capability	Strong	Weaker

4. Mamba-2: Structured State Space Duality

4.1 SSD Framework

Mamba-2 (Dao & Gu, 2024) reveals the duality between SSMs and attention:

State Space Duality (SSD):

\[ y_t = \sum_{s=1}^{t} C_t^\top A_{t:s} B_s u_s \]

When \(A\) is scalar, this can be written in matrix form:

\[ Y = (L \odot QK^\top) V \]

where \(L\) is a lower-triangular mask matrix formed by cumulative products of \(A\).

This is a form of structured causal attention!

4.2 Chunked Algorithm

Mamba-2 uses a chunking strategy:

Divide the sequence into blocks of size \(c\)
Within chunks: Use attention form (quadratic, but chunks are small)
Between chunks: Use SSM recurrence (linear)

\[ \text{Total complexity} = O(L c + L \cdot N) \quad \text{where } c \text{ is the chunk size} \]

4.3 Performance Improvements

2-8x faster than Mamba-1
Supports larger state dimensions
Better hardware utilization (tensor core friendly)

5. SSM Applications

5.1 Language Modeling

Model	Architecture	Parameters	Highlights
Mamba-1	Pure SSM	130M-2.8B	Matches same-size Transformers
Mamba-2	SSD	-	Faster and better
Jamba	Attention+Mamba+MoE	52B	Hybrid architecture
Zamba	Attention+Mamba	7B	Shared attention layers

5.2 Vision

Vision Mamba (Vim): Mamba for image classification
VMamba: 2D selective scan
PlainMamba: Simplified vision Mamba

5.3 Other Domains

Audio/speech processing
Genomic sequence modeling
Time series forecasting
Point cloud processing

6. SSM Ecosystem Summary

graph LR
    A[HiPPO 2020] --> B[S4 2022]
    B --> C[S4D 2022]
    B --> D[H3 2023]
    D --> E[Mamba 2023]
    E --> F[Mamba-2 2024]

    B --> G[S5 2023]

    E --> H[Jamba 2024]
    E --> I[Vision Mamba 2024]

Method	Year	Key Innovation
HiPPO	2020	Optimal state compression
S4	2022	Structured SSM + convolutional mode
S4D	2022	Diagonal simplification
H3	2023	SSM + gated attention
Mamba	2023	Selective SSM + hardware-aware scan
Mamba-2	2024	SSD duality + chunked algorithm

References

Gu et al., "Efficiently Modeling Long Sequences with Structured State Spaces," ICLR 2022
Gu et al., "On the Parameterization and Initialization of Diagonal State Space Models," NeurIPS 2022
Gu & Dao, "Mamba: Linear-Time Sequence Modeling with Selective State Spaces," 2023
Dao & Gu, "Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality," ICML 2024