Your context window is one million tokens. The model bills you per token in and per token out, and the in-token bill grows linearly with the prompt – but the underlying compute grows quadratically. At a million tokens, the attention step is doing roughly a trillion pairwise calculations. Someone is paying for that. It’s you.
A handful of new architectures claim they can do the same job at linear cost. Some of them can. Some of them can’t. None of them have replaced transformers yet, but at least one of them is going to.
In To LLMs… and Beyond! we mentioned state-space models – specifically Mamba – as the leading post-transformer candidate. That’s accurate but underspecified. There’s a whole research front trying to do better than the transformer at sequence modelling, and the candidates differ in what they’re trying to fix. This post walks the field.
The point isn’t that transformers are about to be replaced. They aren’t. The point is that the assumption “transformer = the only way” is already broken, and the alternatives are interesting enough to know about before they show up in production.
What’s wrong with transformers
The transformer’s superpower is its attention mechanism: every token can attend to every other token. That’s how it captures long-range dependencies, and it’s why it dominates language modelling.
The cost is also right there in the design. If your sequence has n tokens, the attention step does roughly n² pairwise comparisons. Double the input, quadruple the compute and the memory.
For short sequences this doesn’t matter. For long ones it dominates. A 2,000-token prompt is fine. A 200,000-token prompt is expensive. A 2,000,000-token prompt is, on a vanilla transformer, infeasible.
The industry has worked around this with engineering – FlashAttention, sliding-window attention, ring attention, KV-cache compression – and the workable context window has stretched from 2k tokens (GPT-3) to 1M+ tokens (Claude, Gemini) over a few years. But the underlying complexity is still quadratic. The workarounds are clever, not free.
The post-transformer architectures all share one design goal: sub-quadratic scaling in sequence length. Beyond that they diverge sharply.
State-space models: Mamba
The most-discussed post-transformer architecture is the state-space model (SSM), and the leading example is Mamba (Gu and Dao, 2023).
The intuition is the one we used in the entry post: instead of every token attending to every other token (the “re-read the book each time” approach), the model maintains a compressed hidden state that gets updated as each token comes in (the “running notes” approach). The cost of updating is constant per token, so the total cost is linear in sequence length, not quadratic.
The catch is that the hidden state is lossy. It’s a fixed-size summary of everything that came before. If a transformer wants to recall the seventh sentence of a hundred-page document, it has the full attention budget to do so. If Mamba wants to recall it, it has to have written something useful about it into the hidden state at the time – and the hidden state has finite capacity.
The Mamba innovation that mattered was making the state-update mechanism selective – the model learns which tokens to actually attend to and which to skim past, rather than treating every token equally. This narrowed the gap with transformers significantly, particularly on language modelling benchmarks.
As of 2026, Mamba and Mamba-2 are competitive with transformers of similar size on many language tasks, sometimes superior on tasks involving very long sequences (DNA, audio, ultra-long documents), and sometimes weaker on tasks requiring precise long-range recall (associative memory). The honest summary: Mamba is real, it works, and it hasn’t beaten transformers across the board.
The hybrid approach: Striped Hyena, Jamba
Most serious research on post-transformer architectures has converged on a pragmatic answer: don’t pick one, mix them.
Hyena (Stanford, 2023) and its successor Striped Hyena are sub-quadratic architectures that interleave Hyena blocks with attention blocks – letting the cheap Hyena blocks do most of the work and the expensive attention blocks handle the parts that genuinely need cross-token comparison.
Jamba (AI21 Labs, 2024) does the same thing but with Mamba blocks: a transformer-Mamba hybrid that uses Mamba layers for efficiency and transformer layers for the kinds of pattern matching transformers are still better at.
The hybrid pattern is now the default assumption for “what comes after the pure transformer.” It’s not “Mamba replaces attention,” it’s “Mamba is a cheap layer that lets you spend your attention budget more carefully.”
RWKV and RetNet: the RNN comeback
Two other notable lines try to revive the recurrent neural network – the architecture transformers replaced – with modern training tricks.
RWKV (Receptance Weighted Key Value, BlinkDL, 2023+) is an RNN that can be trained like a transformer. Standard RNNs are notoriously slow to train because they’re inherently sequential – token t+1 depends on token t. RWKV reformulates the recurrence in a way that allows parallel training (like a transformer) but sequential inference at constant cost per token (like an RNN). At inference time, an RWKV model uses constant memory regardless of sequence length – the dream that transformers can’t achieve.
RetNet (Retentive Network, Microsoft, 2023) takes a similar approach with a different mechanism. It claims the “impossible triangle”: parallel training, recurrent inference, and strong performance.
Neither has displaced transformers. Both are competitive in their weight classes and both are interesting if you care about deployment cost more than peak quality – a constant-memory inference path is genuinely useful when you’re running models on phones or in tight latency budgets.
Liquid neural networks
Liquid AI (an MIT spin-out) builds on a different research lineage: continuous-time neural networks where the hidden state evolves according to differential equations rather than discrete update steps. The promise is dramatically smaller models (often orders of magnitude smaller) that match the performance of much larger transformers on specific tasks.
It’s early. Their language models are interesting and small (Liquid’s LFM-3B punches above its weight), but the wider research community hasn’t replicated the results across the spectrum of language tasks. Worth knowing exists. Probably not worth deploying yet unless you have a specific reason.
Diffusion for text
Image generation switched from autoregressive to diffusion years ago (DALL-E 1 was autoregressive; DALL-E 2 onwards is diffusion). The natural question: why not the same for text?
The answer for a long time was “because text is discrete and diffusion is continuous.” Recent work has found ways around this: discrete diffusion (operating directly on token distributions rather than continuous latents), masked diffusion (a generalisation of BERT’s masking objective), and absorbing-state diffusion (gradually replacing tokens with a special mask token, then learning to reverse the masking).
Models in this space include SEDD (Score Entropy Discrete Diffusion), Plaid, and LLaDA (Large Language Diffusion Model, 2024-2025). The pitch is interesting: instead of generating left-to-right one token at a time, the model generates the whole output simultaneously and refines it over multiple denoising steps. This gives you parallel generation (faster wall-clock for long outputs) and the ability to edit or fill in any part of the output (not just append to the end).
As of 2026, diffusion language models are competitive with similarly-sized autoregressive transformers on some benchmarks but lag on others. They’re a genuine alternative paradigm, not just a tweak. Whether they end up dominant or niche is one of the more open questions in the field.
A comparison
| Architecture | Sequence cost | Inference memory | Strengths | Weaknesses |
|---|---|---|---|---|
| Transformer | O(n²) | Grows with context | General performance, ecosystem maturity | Cost at long context |
| Mamba (SSM) | O(n) | Constant per token | Long-sequence efficiency | Lossy hidden state, weaker associative recall |
| Striped Hyena / Jamba (hybrid) | Sub-quadratic | Mostly constant + some attention KV | Pragmatic mix, often best of both | More complex to train |
| RWKV / RetNet (RNN-like) | O(n) train, constant inference | Constant | Cheapest inference, edge-friendly | Smaller ecosystem, training quirks |
| Liquid (continuous-time) | O(n) typical | Constant or near-constant | Very small models punching up | Early, narrower benchmark coverage |
| Diffusion (discrete) | O(n) per step × steps | Holds full sequence | Parallel generation, in-place editing | Fixed step count, less mature for text |
What’s actually in production
In 2026, transformers still dominate every major API and almost every open-weight release. The frontier models – Claude, GPT, Gemini – are transformers. The leading open-weight models – Llama, Mistral, Qwen – are transformers.
The cracks where alternatives have started shipping:
- Long-sequence applications (DNA, audio, ultra-long-document analysis) increasingly use Mamba or hybrid architectures because the quadratic cost is the binding constraint.
- Edge deployment (phones, embedded devices) is where RWKV and RetNet have the most traction – constant-memory inference matters more than peak benchmark scores when you have 4GB of RAM.
- Hybrid models like Jamba are starting to appear in commercial offerings, mostly behind the scenes.
- Diffusion language models are research today, productisation tomorrow – the parallel generation property is too useful to ignore long-term.
What this means for you
Probably nothing immediate. If you’re building on Claude or GPT or a Llama derivative, you’re using a transformer, and you’ll keep using a transformer for the foreseeable future. The point of knowing the alternatives isn’t to switch away from transformers tomorrow.
The point is to recognise the shape of the next disruption when it lands. The story of “X dominated Y until something better came along” is the story of every architecture in the history of machine learning. Convolutional networks dominated vision for a decade until Vision Transformers came for them. RNNs dominated sequence modelling until transformers came for them. Transformers will eventually be replaced by something, and the candidates above are the live ones in 2026.
If you maintain AI infrastructure, the bet that pays off is keeping the interfaces clean – treating “the language model” as a swappable component rather than baking transformer-specific assumptions into your stack. The day a hybrid architecture starts winning at half the cost, you want to be able to swap.
What’s worth remembering
- Transformers’ weakness is quadratic attention cost at long sequences.
- State-space models (Mamba) trade memory for cost – linear scaling, but a lossy hidden state.
- Hybrid architectures (Jamba, Striped Hyena) are the pragmatic answer: cheap layers most of the time, expensive attention where needed.
- RWKV and RetNet revive the RNN with modern training tricks. Constant-memory inference is the killer feature.
- Liquid neural networks are smaller and continuous-time. Promising, early.
- Diffusion language models generate in parallel with iterative refinement – a different paradigm, still maturing.
- Transformers haven’t been replaced yet. They will be. The architecture that wins probably already exists in a paper today.
The next post in this series leaves the post-transformer future and goes the other way – backwards, to the language models that ran the world before BERT and the architectures that ran the world before that.
The next chapter, Before the Transformer, publishes around 16 May.