Long Sequence Modeling
Overview
Processing ultra-long sequences (tens of thousands to millions of tokens) is a core challenge for sequence models. This chapter covers long-sequence modeling methods beyond Transformers, including xLSTM, RWKV, Hyena, and their trade-offs with Transformers and SSMs.
1. xLSTM: Extended LSTM
1.1 Motivation
Classic LSTM limitations: fixed storage capacity, cannot parallelize, difficult to scale.
xLSTM (Beck et al., 2024) overcomes these limitations through two improvements.
1.2 sLSTM (Scalar LSTM)
Introduces exponential gating:
Uses a normalizer for numerical stability:
1.3 mLSTM (Matrix LSTM)
Extends the scalar cell state to a matrix:
where \(n_t = f_t n_{t-1} + i_t k_t\).
Key Insight: The matrix cell state dramatically increases storage capacity, similar to linear attention.
1.4 xLSTM Architecture
Alternates between sLSTM and mLSTM blocks:
- sLSTM block: sLSTM + gated residual + LayerNorm
- mLSTM block: mLSTM + feedforward + LayerNorm
- mLSTM can be fully parallelized for training
2. RWKV: Linear Attention RNN
2.1 Core Idea
RWKV (Peng et al., 2023) combines Transformer's training parallelism with RNN's inference efficiency.
2.2 WKV Mechanism
The core of RWKV is the WKV (Weighted Key-Value) operation:
where \(w\) is the positional decay weight and \(u\) is the current token bonus.
Recurrent Form:
2.3 RWKV Architecture
Time-Mixing (analogous to attention):
Similarly define \(k_t, v_t\), then:
Channel-Mixing (analogous to FFN):
2.4 RWKV Version Evolution
| Version | Key Improvements |
|---|---|
| RWKV-4 | Base WKV |
| RWKV-5 (Eagle) | Matrix-valued state |
| RWKV-6 (Finch) | Data-dependent decay |
| RWKV-7 | Further improvements |
3. Hyena: Implicit Long Convolutions
3.1 Core Idea
Hyena (Poli et al., 2023) replaces attention with implicitly parameterized long convolutions.
3.2 Hyena Operator
Hyena achieves high-order interactions through alternating element-wise multiplication and long convolution:
where \(h_i\) are implicitly parameterized convolution kernels (generated by MLP):
3.3 Computational Complexity
- Long convolution via FFT: \(O(L \log L)\)
- No quadratic complexity
- Total: \(O(NL \log L)\), where \(N\) is the Hyena order
4. Linear Attention
4.1 From Standard to Linear Attention
Standard attention:
Linear attention (Katharopoulos et al., 2020):
By computing \(\phi(K)^\top V\) first (a \(d \times d\) matrix), achieving \(O(Ld^2)\) complexity.
4.2 Recurrent Form of Linear Attention
This is a linear RNN! \(S_t\) is a \(d \times d\) state matrix.
4.3 Limitations of Linear Attention
- No softmax normalization → underperforms standard attention
- Choice of feature map \(\phi\) has significant impact
- RetNet, GLA, and others improve upon this foundation
5. Other Methods
5.1 RetNet
RetNet (Sun et al., 2023): Retentive Network, supporting three computation modes.
- Parallel mode: Attention-like (training)
- Recurrent mode: RNN-like (inference)
- Chunk mode: Hybrid (long-sequence training)
Core: Attention with decay:
where \(D_{ij} = \gamma^{i-j}\) (\(i \geq j\)), \(\gamma\) is the decay factor.
5.2 GLA (Gated Linear Attention)
Adds data-dependent gating \(G_t\) for improved selectivity.
6. Long Range Arena Benchmark
6.1 Task Descriptions
| Task | Sequence Length | Tests |
|---|---|---|
| ListOps | 2K | Hierarchical reasoning |
| Text | 4K | Text classification |
| Retrieval | 8K | Document matching |
| Image | 1K | Image classification (flattened to sequence) |
| PathFinder | 1K | Spatial reasoning |
| Path-X | 16K | Ultra-long spatial reasoning |
6.2 Method Comparison
| Method | Average | Training Mode | Inference |
|---|---|---|---|
| Transformer | 53.66 | Parallel | \(O(L)\)/token |
| S4 | 86.09 | Parallel (conv) | \(O(1)\)/token |
| Mamba | ~87 | Parallel (scan) | \(O(1)\)/token |
| Hyena | ~85 | Parallel (FFT) | \(O(\log L)\)/token |
| RWKV | ~83 | Parallel | \(O(1)\)/token |
7. SSM vs Transformer Trade-offs
| Dimension | Transformer | SSM/Linear Models |
|---|---|---|
| Long sequences | \(O(L^2)\) limitation | \(O(L)\) or \(O(L\log L)\) |
| In-context learning | Strong | Weaker (fixed state bottleneck) |
| Exact retrieval | Strong (global attention) | Weak (information compressed) |
| Training parallelism | Good | Good (conv/scan) |
| Inference efficiency | KV cache grows | Fixed state |
| Hardware friendliness | Matrix multiply | Needs specialized implementations |
| Ecosystem maturity | High | Developing |
Consensus Trend: Hybrid architectures (e.g., Jamba, Zamba) combine the best of both.
8. Summary
Key Takeaways:
- xLSTM revives LSTM with matrix states and exponential gating
- RWKV achieves RNN training parallelism, approaching Transformer performance
- Hyena replaces attention with implicit long convolutions
- Linear attention bridges SSMs and attention
- Hybrid architectures are the current trend
References
- Beck et al., "xLSTM: Extended Long Short-Term Memory," 2024
- Peng et al., "RWKV: Reinventing RNNs for the Transformer Era," EMNLP 2023
- Poli et al., "Hyena Hierarchy: Towards Larger Convolutional Language Models," ICML 2023
- Katharopoulos et al., "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention," ICML 2020
- Sun et al., "Retentive Network: A Successor to Transformer for Large Language Models," 2023