NDT Series and Transformer

The NDT (Neural Data Transformer) series, released by Joel Ye and colleagues between 2021 and 2024, brought the Transformer architecture to neural signal decoding. NDT3 (NeurIPS 2024) marks the critical leap to a neural foundation model — a general-purpose BCI foundation model pretrained on 200+ datasets and 500+ hours of data.

1. NDT1: BERT for Neurons

Ye & Pandarinath (2021, NeurIPS) introduced NDT1:

Setup

Input: a spike-rate time series (T, N) — T time steps, N neurons
Architecture: a standard Transformer encoder
Task: masked modeling — randomly mask time steps and reconstruct

Comparison with LFADS

	LFADS	NDT1
Structure	VAE + GRU	Transformer
Training	Reconstruction ELBO	Masked reconstruction
Inference	Limited latent dimension	Whole sequence
Performance (NLB)	Good	Better

NDT1 surpasses LFADS on the Neural Latents Benchmark — Transformers are more suited to neural data than RNN-VAEs.

2. NDT2: Cross-Session Alignment

Ye et al. (2023, BioRxiv → ICLR 2024) NDT2 tackles the cross-session channel-misalignment problem:

The Problem

"Neuron #42" from the same Utah array across different sessions may not be the same neuron (electrode drift, neuron loss). A naive model treats each session as a different dataset.

The Solution

One embedding per channel: similar to BERT's token embedding
Session embedding: each session gets a learnable vector
Context learning: a small amount of data suffices for fast alignment

Significance

NDT2 shows that one model can work across sessions and subjects — the key prerequisite for a neural foundation model.

3. NDT3: Neural Foundation Model

Azabou & Ye et al. (2024, NeurIPS) NDT3 is a landmark:

Scale

200+ datasets (BrainGate, Pitt, Shenoy lab, etc.)
500+ hours of electrophysiology data
100+ subjects (monkeys, humans)
1B+ parameters in a Transformer

Architectural Innovations

PerceiverIO-like cross-attention, fixed-size latent
Unit tokenization: one query per unit, naturally supporting variable-length inputs
Rotary position embedding for the time axis

Pretrain + Fine-tune

Pretraining: masked autoencoding
Fine-tuning: small amounts of labeled data adapt to specific tasks (gesture, speech, cursor)
Zero-shot: even a new subject can be decoded reasonably without fine-tuning

Performance

10 minutes of data from a new subject reaches the performance of hours of traditional methods
Zero-shot across tasks: a gesture model transfers directly to cursor tasks
SOTA on Neural Latents and FALCON benchmarks

NDT3 is BCI's GPT-3 moment.

4. Why Transformers Excel for BCI

Why are Transformers well-suited to neural data?

Variable-length sequences: both spike times and neuron counts can vary
Long-range dependencies: decision and preparation periods can span 500+ ms
Multi-modal fusion: spikes + LFP + behavioral variables can all be tokenized
Cross-task reuse: pretrained features generalize across many tasks
Scaling effects: more data → better performance, following the scaling laws

5. Other Neural Transformers

BrainBERT (Wang 2023)

The ECoG version of BERT: masked spectrogram prediction, learning cross-task-useful ECoG representations.

Neuroformer (Antoniades 2023)

A multi-modal (vision + neural) autoregressive Transformer predicting neural activity + animal behavior.

POYO (Azabou 2023 NeurIPS)

A unified "cross-dataset + cross-subject" architecture — see Neural Foundation Model POYO.

MAE for EEG

EEGPT (Pu 2024): Masked-Autoencoder-style EEG pretraining.

6. Neural Transformers vs. Language Transformers

Aspect	Language	Neural
Token	word/subword	spike / bin / channel
Vocabulary	Fixed (50K)	Infinite (continuous)
Context	1K–1M	0.5–10 s
Data scale	TB	GB (growing)
Cross-distribution	Natural	Requires channel alignment

The biggest difference: language has labels (the next token is the label), while neural data requires behavior/task labels or a self-supervised objective.

7. Key Architectural Design Choices

Design lessons from NDT3 and successors:

Tokenization

Per-unit: one token per neuron (variable length)
Per-bin: one token per time bin (fixed window)
Hybrid: mixing both dimensions

Per-unit is more flexible but increases sequence length; most modern models use a hybrid.

Positional Encoding

Absolute: suits fixed tasks
RoPE (Rotary): suits variable-length, cross-task use
Learnable per-session: compensates for session differences

Attention

Standard: quadratic complexity, suited to short sequences
Linear attention / Flash: for long sequences
Cross-attention (Perceiver): fixed latent size, good scalability

8. Online Deployment

Transformer latency challenges:

Self-attention is \(O(T^2)\)
500 ms window + 10 ms bin = T=50 is still manageable
Long windows (2 s+) require flash-attention / KV cache

NDT3 online inference: streaming + rolling window, 10 ms latency achievable.

9. Commercial Implications of the NDT Series

Neural foundation models let BCI follow a "pretrain + fine-tune" pathway like NLP:

Neuralink, Synchron: pretrain themselves or use open-source foundation models
Small/mid BCI companies: no need to train from scratch — download from HuggingFace-style "BrainHubs"
Researchers: a few hundred examples suffice to build a useful BCI

This shifts BCI from "every lab develops on its own" to a "collaborative community ecosystem."

10. Logical Chain

NDT1 demonstrates Transformer > RNN on neural data, displacing LFADS's VAE-GRU.
NDT2 solves cross-session channel alignment, breaking the limits of single-session models.
NDT3 pretrains on 500+ hours of data, opening the BCI foundation-model era.
Transformers' variable-length, long-dependency, and scalable nature are particularly well-matched to neural data.
NDT3 = BCI's GPT-3 moment — collaborative community, cross-subject transfer, zero-shot adaptation.

References

Ye & Pandarinath (2021). Representation learning for neural population activity with Neural Data Transformers. NeurIPS.
Ye et al. (2024). A unified framework for neural decoding with pretrained transformers (NDT2). ICLR.
Azabou, Ye et al. (2024). NDT3: A foundation model for neural data. NeurIPS. https://arxiv.org/abs/2407.14668
Wang et al. (2023). BrainBERT: self-supervised representation learning for intracranial recordings. ICLR. https://openreview.net/forum?id=xmcYx_reUn6
Antoniades et al. (2024). Neuroformer: multimodal and multitask generative pretraining for brain data. ICLR.