Skip to content

Long Sequence Modeling

Overview

Processing ultra-long sequences (tens of thousands to millions of tokens) is a core challenge for sequence models. This chapter covers long-sequence modeling methods beyond Transformers, including xLSTM, RWKV, Hyena, and their trade-offs with Transformers and SSMs.


1. xLSTM: Extended LSTM

1.1 Motivation

Classic LSTM limitations: fixed storage capacity, cannot parallelize, difficult to scale.

xLSTM (Beck et al., 2024) overcomes these limitations through two improvements.

1.2 sLSTM (Scalar LSTM)

Introduces exponential gating:

\[ f_t = \exp(\tilde{f}_t), \quad i_t = \exp(\tilde{i}_t) \]

Uses a normalizer for numerical stability:

\[ n_t = f_t n_{t-1} + i_t \]
\[ h_t = o_t \odot \frac{c_t}{n_t} \]

1.3 mLSTM (Matrix LSTM)

Extends the scalar cell state to a matrix:

\[ C_t = f_t C_{t-1} + i_t v_t k_t^\top \]
\[ h_t = o_t \odot \frac{C_t q_t}{\max(|n_t^\top q_t|, 1)} \]

where \(n_t = f_t n_{t-1} + i_t k_t\).

Key Insight: The matrix cell state dramatically increases storage capacity, similar to linear attention.

1.4 xLSTM Architecture

Alternates between sLSTM and mLSTM blocks:

  • sLSTM block: sLSTM + gated residual + LayerNorm
  • mLSTM block: mLSTM + feedforward + LayerNorm
  • mLSTM can be fully parallelized for training

2. RWKV: Linear Attention RNN

2.1 Core Idea

RWKV (Peng et al., 2023) combines Transformer's training parallelism with RNN's inference efficiency.

2.2 WKV Mechanism

The core of RWKV is the WKV (Weighted Key-Value) operation:

\[ \text{wkv}_t = \frac{\sum_{i=1}^{t-1} e^{-(t-1-i)w + k_i} v_i + e^{u+k_t} v_t}{\sum_{i=1}^{t-1} e^{-(t-1-i)w + k_i} + e^{u+k_t}} \]

where \(w\) is the positional decay weight and \(u\) is the current token bonus.

Recurrent Form:

\[ a_t = e^{-w} a_{t-1} + e^{k_t} v_t \]
\[ b_t = e^{-w} b_{t-1} + e^{k_t} \]
\[ \text{wkv}_t = \frac{a_{t-1} + e^{u+k_t} v_t}{b_{t-1} + e^{u+k_t}} \]

2.3 RWKV Architecture

Time-Mixing (analogous to attention):

\[ r_t = W_r \cdot (\mu_r x_t + (1-\mu_r) x_{t-1}) \]

Similarly define \(k_t, v_t\), then:

\[ o_t = \sigma(r_t) \odot \text{wkv}_t \]

Channel-Mixing (analogous to FFN):

\[ o_t = \sigma(r_t) \odot (W_v \cdot \max(k_t, 0)^2) \]

2.4 RWKV Version Evolution

Version Key Improvements
RWKV-4 Base WKV
RWKV-5 (Eagle) Matrix-valued state
RWKV-6 (Finch) Data-dependent decay
RWKV-7 Further improvements

3. Hyena: Implicit Long Convolutions

3.1 Core Idea

Hyena (Poli et al., 2023) replaces attention with implicitly parameterized long convolutions.

3.2 Hyena Operator

Hyena achieves high-order interactions through alternating element-wise multiplication and long convolution:

\[ y = x_1 \odot (h_1 * (x_2 \odot (h_2 * \cdots))) \]

where \(h_i\) are implicitly parameterized convolution kernels (generated by MLP):

\[ h(t) = \text{Window}(t) \cdot \text{MLP}_\phi(\text{PositionalEncoding}(t)) \]

3.3 Computational Complexity

  • Long convolution via FFT: \(O(L \log L)\)
  • No quadratic complexity
  • Total: \(O(NL \log L)\), where \(N\) is the Hyena order

4. Linear Attention

4.1 From Standard to Linear Attention

Standard attention:

\[ \text{Attn}(Q, K, V) = \text{softmax}(QK^\top) V \]

Linear attention (Katharopoulos et al., 2020):

\[ \text{LinAttn}(Q, K, V) = \phi(Q)(\phi(K)^\top V) \]

By computing \(\phi(K)^\top V\) first (a \(d \times d\) matrix), achieving \(O(Ld^2)\) complexity.

4.2 Recurrent Form of Linear Attention

\[ S_t = S_{t-1} + \phi(k_t) v_t^\top \]
\[ y_t = \phi(q_t)^\top S_t \]

This is a linear RNN! \(S_t\) is a \(d \times d\) state matrix.

4.3 Limitations of Linear Attention

  • No softmax normalization → underperforms standard attention
  • Choice of feature map \(\phi\) has significant impact
  • RetNet, GLA, and others improve upon this foundation

5. Other Methods

5.1 RetNet

RetNet (Sun et al., 2023): Retentive Network, supporting three computation modes.

  • Parallel mode: Attention-like (training)
  • Recurrent mode: RNN-like (inference)
  • Chunk mode: Hybrid (long-sequence training)

Core: Attention with decay:

\[ \text{Retention}(Q, K, V) = (QK^\top \odot D) V \]

where \(D_{ij} = \gamma^{i-j}\) (\(i \geq j\)), \(\gamma\) is the decay factor.

5.2 GLA (Gated Linear Attention)

\[ S_t = G_t \odot S_{t-1} + k_t v_t^\top \]

Adds data-dependent gating \(G_t\) for improved selectivity.


6. Long Range Arena Benchmark

6.1 Task Descriptions

Task Sequence Length Tests
ListOps 2K Hierarchical reasoning
Text 4K Text classification
Retrieval 8K Document matching
Image 1K Image classification (flattened to sequence)
PathFinder 1K Spatial reasoning
Path-X 16K Ultra-long spatial reasoning

6.2 Method Comparison

Method Average Training Mode Inference
Transformer 53.66 Parallel \(O(L)\)/token
S4 86.09 Parallel (conv) \(O(1)\)/token
Mamba ~87 Parallel (scan) \(O(1)\)/token
Hyena ~85 Parallel (FFT) \(O(\log L)\)/token
RWKV ~83 Parallel \(O(1)\)/token

7. SSM vs Transformer Trade-offs

Dimension Transformer SSM/Linear Models
Long sequences \(O(L^2)\) limitation \(O(L)\) or \(O(L\log L)\)
In-context learning Strong Weaker (fixed state bottleneck)
Exact retrieval Strong (global attention) Weak (information compressed)
Training parallelism Good Good (conv/scan)
Inference efficiency KV cache grows Fixed state
Hardware friendliness Matrix multiply Needs specialized implementations
Ecosystem maturity High Developing

Consensus Trend: Hybrid architectures (e.g., Jamba, Zamba) combine the best of both.


8. Summary

Key Takeaways:

  1. xLSTM revives LSTM with matrix states and exponential gating
  2. RWKV achieves RNN training parallelism, approaching Transformer performance
  3. Hyena replaces attention with implicit long convolutions
  4. Linear attention bridges SSMs and attention
  5. Hybrid architectures are the current trend

References

  • Beck et al., "xLSTM: Extended Long Short-Term Memory," 2024
  • Peng et al., "RWKV: Reinventing RNNs for the Transformer Era," EMNLP 2023
  • Poli et al., "Hyena Hierarchy: Towards Larger Convolutional Language Models," ICML 2023
  • Katharopoulos et al., "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention," ICML 2020
  • Sun et al., "Retentive Network: A Successor to Transformer for Large Language Models," 2023

评论 #