LLM Post-processing Fusion
LLM post-processing / rescoring is a core component of modern speech and handwriting BCIs. The BCI outputs phoneme or character probabilities directly, and the LLM re-ranks the "plausible candidates" into the "most reasonable sentence." This is the key to Willett 2023's WER drop from 23% to 9.1%, and the most concrete form of the BCI × LLM convergence.
1. Why BCI Needs LLMs
Inherent limits of BCI signals
- Spike noise: real firing rates fluctuate within every 20 ms bin
- ECoG spatial blur: crosstalk between adjacent electrodes
- EEG: skull filtering
Accuracy ceiling of direct decoding is around 70–80%. Reaching a "natural conversation" level (>95%) requires an external prior.
LLM as a strong prior
An LLM knows: - Vocabulary legality: "xqzr" is not a word - Grammar: subject-verb-object - Common sense: coffee is hot vs iced - Context: what the previous sentence was about
These priors let the LLM pick the most likely correct option from multiple candidates.
2. Three Levels of Rescoring
Level 1: n-gram LM
The simplest: 3-gram counts + Kneser-Ney smoothing.
- Willett 2023 uses a 3-gram for the first pass
- Speed: extremely fast, <10 ms
- Effect: 30% reduction in word-level errors
Level 2: neural LM (GPT-2 class)
GPT-2 scores each candidate with log-likelihood:
- Willett 2023 uses GPT-2 for a second-pass rescoring
- Latency: ~100 ms (small model)
- Effect: WER 23% → 9.1%
Level 3: large LLM (GPT-4 / Claude)
Send Top-K candidates to the LLM and let it choose:
Prompt: "Based on the context '...previous sentence...',
which of these is most likely?
A. 'the weather is nice today'
B. 'the weatherice night today'
C. 'he whether is night today'
Select A, B, or C."
- Effect: near 0% WER for common phrases
- Latency: 500 ms+ (API call)
- Poor real-time performance — suitable for offline or latency-tolerant scenarios
3. Classic Beam Search + LM Formula
Modern BCIs combine a neural acoustic model with an LM via beam search:
- \(P_{\text{neural}}\): output probability of the BCI decoder
- \(P_{\text{LM}}\): LM probability
- \(|w|\): length penalty (to prevent overly short outputs)
- \(\lambda_{1,2,3}\): weights (typically tuned on a validation set)
Beam size is typically 50–500 — larger is more accurate but slower.
4. Two-Stage Decoding Architecture
The design shared by Willett 2023 and Card 2024:
Stage 1: RNN + CTC → phoneme probabilities
Stage 2a: CTC beam search + word LM → Top-K sentence candidates
Stage 2b: GPT-2 rescoring → pick the best
Key advantages of this design: - Stage 1 is fast (real-time) - Stage 2 is accurate (but the user has to wait a bit) - A combination of streaming + refinement
5. LLM Dialogue Layer
In the BCI + LLM + robot stack (see Hierarchical Planning — BCI, LLM, Robot), the LLM is more than a rescorer — it also plays the role of dialogue manager:
Clarification
User says "get thing." LLM:
"What do you want? A. Water B. Phone C. Remote"
User selects A via BCI.
Completion
User says "Give me..." → LLM completes: "...a cup of water? Yes/No"
User nods (BCI-detected) → execute.
Memory
LLM remembers user preferences and daily patterns: - "You usually want coffee in the morning. Shall I get one?" - Reduces repeated intent inputs
6. Real-Time Optimization
Solutions to the latency of large-LLM rescoring:
1. Predictive decoding
The LLM keeps processing partial output without waiting for the user to finish: - User says "the w..." - LLM is already predicting "weather," "water," "wolf" - Upon receiving the next phoneme "e," it rules out "wolf"
2. Speculative decoding
Small model generates quickly, large model verifies and corrects: - Small model: 2 tokens/ms - Large model verifies every 4 tokens - 3–5× speedup
3. Edge LLM
- Llama-3 8B / Phi-3 3.8B run locally
- Latency 50–100 ms
- Cloud GPT-4 as fallback
4. Caching
BCI → text mappings for frequent phrases can be cached — most daily conversation is templated.
7. LLM-Assisted Learning
LLMs help the BCI not only at inference time but also during training:
Automatic data augmentation
The LLM generates similar training sentences, expanding BCI training data.
Active learning
The LLM flags samples the BCI decoded with uncertainty, asks the user to confirm, and feeds back for reinforcement.
User modeling
The LLM learns the user's speech style (formal/casual, vocabulary preferences) for personalized rescoring.
8. Risks of LLM Fusion
1. Over-correction
The LLM may speak for the user — the user wanted to say "iced water" but the LLM, biased toward the common phrase, produces "hot coffee."
2. Intent drift
The LLM may "misunderstand" the user and shift intent.
3. Privacy
Calling an LLM API means user thoughts are processed in the cloud — see Neurorights.
4. Alignment
LLMs may refuse certain outputs (sensitive topics) — does a paralyzed patient still have full freedom of speech?
9. Open-Source and Closed-Source Options
| Option | Model | Pros and cons |
|---|---|---|
| GPT-4 API | Closed | Strongest, but privacy and latency concerns |
| Claude API | Closed | Strong, good alignment |
| Llama-3 local | Open | Better privacy, slightly lower performance |
| Phi-3 3.8B | Open | Feasible on edge devices |
| Custom SLM | Self-trained | BCI-specific, requires data |
BCI companies' choices: - Neuralink: currently uses an open-source edge LLM + optional cloud - Synchron: partners with OpenAI - Precision: not publicly disclosed
10. From Rescoring to Generation
Next step: let the LLM not merely rescore, but directly generate the output.
Neural embedding as soft prompt
- BCI → neural embedding (CEBRA, NDT3)
- Embedding fed into the LLM as input
- LLM directly generates text
End-to-end training
- Paired neural + text data
- Joint training of the neural encoder + LLM
- NeuroLM (2024) is an early attempt in this direction
This turns BCI from discrete "neural → text" decoding into continuous generation — analogous to moving from ASR toward diffusion-based speech.
11. Logical Chain
- BCI direct-decoding accuracy is capped at ~80% and needs LLMs to close the gap.
- n-gram → GPT-2 → GPT-4 is a three-stage rescoring hierarchy with progressive gains.
- Beam search + LM weighting is the current standard for BCI decoding.
- LLMs at the dialogue layer extend BCI functions — clarification, completion, memory.
- Real-time optimizations (speculative decoding, edge LLMs) make large LLMs usable.
- LLM fusion introduces new risks — over-correction, privacy, free speech.
- The future direction: neural embeddings as LLM input in end-to-end fusion.
References
- Willett et al. (2023). A high-performance speech neuroprosthesis. Nature. — GPT-2 rescoring
- Bourlard & Morgan (1994). Connectionist speech recognition: a hybrid approach. Springer. — the classic ASR + LM treatment
- Leviathan et al. (2022). Fast inference from transformers via speculative decoding. arXiv. — speculative decoding
- Willett et al. (2024). Follow-up on BCI + LM integration. NEJM.
- Metzger et al. (2023). A high-performance neuroprosthesis for speech decoding and avatar control. Nature. — multi-stream LLM fusion