Long Sequence Modeling

Overview

Processing ultra-long sequences (tens of thousands to millions of tokens) is a core challenge for sequence models. This chapter covers long-sequence modeling methods beyond Transformers, including xLSTM, RWKV, Hyena, and their trade-offs with Transformers and SSMs.

1. xLSTM: Extended LSTM

1.1 Motivation

Classic LSTM limitations: fixed storage capacity, cannot parallelize, difficult to scale.

xLSTM (Beck et al., 2024) overcomes these limitations through two improvements.

1.2 sLSTM (Scalar LSTM)

Introduces exponential gating:

\[ f_t = \exp(\tilde{f}_t), \quad i_t = \exp(\tilde{i}_t) \]

Uses a normalizer for numerical stability:

\[ n_t = f_t n_{t-1} + i_t \]

\[ h_t = o_t \odot \frac{c_t}{n_t} \]

1.3 mLSTM (Matrix LSTM)

Extends the scalar cell state to a matrix:

\[ C_t = f_t C_{t-1} + i_t v_t k_t^\top \]

\[ h_t = o_t \odot \frac{C_t q_t}{\max(|n_t^\top q_t|, 1)} \]

where \(n_t = f_t n_{t-1} + i_t k_t\).

Key Insight: The matrix cell state dramatically increases storage capacity, similar to linear attention.

1.4 xLSTM Architecture

Alternates between sLSTM and mLSTM blocks:

sLSTM block: sLSTM + gated residual + LayerNorm
mLSTM block: mLSTM + feedforward + LayerNorm
mLSTM can be fully parallelized for training

2. RWKV: Linear Attention RNN

2.1 Core Idea

RWKV (Peng et al., 2023) combines Transformer's training parallelism with RNN's inference efficiency.

2.2 WKV Mechanism

The core of RWKV is the WKV (Weighted Key-Value) operation:

\[ \text{wkv}_t = \frac{\sum_{i=1}^{t-1} e^{-(t-1-i)w + k_i} v_i + e^{u+k_t} v_t}{\sum_{i=1}^{t-1} e^{-(t-1-i)w + k_i} + e^{u+k_t}} \]

where \(w\) is the positional decay weight and \(u\) is the current token bonus.

Recurrent Form:

\[ a_t = e^{-w} a_{t-1} + e^{k_t} v_t \]

\[ b_t = e^{-w} b_{t-1} + e^{k_t} \]

\[ \text{wkv}_t = \frac{a_{t-1} + e^{u+k_t} v_t}{b_{t-1} + e^{u+k_t}} \]

2.3 RWKV Architecture

Time-Mixing (analogous to attention):

\[ r_t = W_r \cdot (\mu_r x_t + (1-\mu_r) x_{t-1}) \]

Similarly define \(k_t, v_t\), then:

\[ o_t = \sigma(r_t) \odot \text{wkv}_t \]

Channel-Mixing (analogous to FFN):

\[ o_t = \sigma(r_t) \odot (W_v \cdot \max(k_t, 0)^2) \]

2.4 RWKV Version Evolution

Version	Key Improvements
RWKV-4	Base WKV
RWKV-5 (Eagle)	Matrix-valued state
RWKV-6 (Finch)	Data-dependent decay
RWKV-7	Further improvements

3. Hyena: Implicit Long Convolutions

3.1 Core Idea

Hyena (Poli et al., 2023) replaces attention with implicitly parameterized long convolutions.

3.2 Hyena Operator

Hyena achieves high-order interactions through alternating element-wise multiplication and long convolution:

\[ y = x_1 \odot (h_1 * (x_2 \odot (h_2 * \cdots))) \]

where \(h_i\) are implicitly parameterized convolution kernels (generated by MLP):

\[ h(t) = \text{Window}(t) \cdot \text{MLP}_\phi(\text{PositionalEncoding}(t)) \]

3.3 Computational Complexity

Long convolution via FFT: \(O(L \log L)\)
No quadratic complexity
Total: \(O(NL \log L)\), where \(N\) is the Hyena order

4. Linear Attention

4.1 From Standard to Linear Attention

Standard attention:

\[ \text{Attn}(Q, K, V) = \text{softmax}(QK^\top) V \]

Linear attention (Katharopoulos et al., 2020):

\[ \text{LinAttn}(Q, K, V) = \phi(Q)(\phi(K)^\top V) \]

By computing \(\phi(K)^\top V\) first (a \(d \times d\) matrix), achieving \(O(Ld^2)\) complexity.

4.2 Recurrent Form of Linear Attention

\[ S_t = S_{t-1} + \phi(k_t) v_t^\top \]

\[ y_t = \phi(q_t)^\top S_t \]

This is a linear RNN! \(S_t\) is a \(d \times d\) state matrix.

4.3 Limitations of Linear Attention

No softmax normalization → underperforms standard attention
Choice of feature map \(\phi\) has significant impact
RetNet, GLA, and others improve upon this foundation

5. Other Methods

5.1 RetNet

RetNet (Sun et al., 2023): Retentive Network, supporting three computation modes.

Parallel mode: Attention-like (training)
Recurrent mode: RNN-like (inference)
Chunk mode: Hybrid (long-sequence training)

Core: Attention with decay:

\[ \text{Retention}(Q, K, V) = (QK^\top \odot D) V \]

where \(D_{ij} = \gamma^{i-j}\) (\(i \geq j\)), \(\gamma\) is the decay factor.

5.2 GLA (Gated Linear Attention)

\[ S_t = G_t \odot S_{t-1} + k_t v_t^\top \]

Adds data-dependent gating \(G_t\) for improved selectivity.

6. Long Range Arena Benchmark

6.1 Task Descriptions

Task	Sequence Length	Tests
ListOps	2K	Hierarchical reasoning
Text	4K	Text classification
Retrieval	8K	Document matching
Image	1K	Image classification (flattened to sequence)
PathFinder	1K	Spatial reasoning
Path-X	16K	Ultra-long spatial reasoning

6.2 Method Comparison

Method	Average	Training Mode	Inference
Transformer	53.66	Parallel	\(O(L)\)/token
S4	86.09	Parallel (conv)	\(O(1)\)/token
Mamba	~87	Parallel (scan)	\(O(1)\)/token
Hyena	~85	Parallel (FFT)	\(O(\log L)\)/token
RWKV	~83	Parallel	\(O(1)\)/token

7. SSM vs Transformer Trade-offs

Dimension	Transformer	SSM/Linear Models
Long sequences	\(O(L^2)\) limitation	\(O(L)\) or \(O(L\log L)\)
In-context learning	Strong	Weaker (fixed state bottleneck)
Exact retrieval	Strong (global attention)	Weak (information compressed)
Training parallelism	Good	Good (conv/scan)
Inference efficiency	KV cache grows	Fixed state
Hardware friendliness	Matrix multiply	Needs specialized implementations
Ecosystem maturity	High	Developing

Consensus Trend: Hybrid architectures (e.g., Jamba, Zamba) combine the best of both.

8. Summary

Key Takeaways:

xLSTM revives LSTM with matrix states and exponential gating
RWKV achieves RNN training parallelism, approaching Transformer performance
Hyena replaces attention with implicit long convolutions
Linear attention bridges SSMs and attention
Hybrid architectures are the current trend

References

Beck et al., "xLSTM: Extended Long Short-Term Memory," 2024
Peng et al., "RWKV: Reinventing RNNs for the Transformer Era," EMNLP 2023
Poli et al., "Hyena Hierarchy: Towards Larger Convolutional Language Models," ICML 2023
Katharopoulos et al., "Transformers are RNNs: Fast Autoregressive Transformers with Linear Attention," ICML 2020
Sun et al., "Retentive Network: A Successor to Transformer for Large Language Models," 2023