After the Transformer

Your context windowContext windowThe maximum number of tokens an LLM can attend to in a single call – prompt plus output combined. is one million tokensTokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. . The modelModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. bills you per token in and per token out, and the in-token bill grows linearly with the prompt – but the underlying compute grows quadratically. At a million tokens, the attentionAttentionThe mechanism inside a transformer that lets each token weigh how much every other token in the context matters to it. step is doing roughly a trillion pairwise calculations. Someone is paying for that. It’s you.

A handful of new architectures claim they can do the same job at linear cost. Some of them can. Some of them can’t. None of them have replaced transformers yet, but at least one of them is going to.

In To LLMs… and Beyond! we mentioned state-space models – specifically Mamba – as the leading post-transformer candidate. That’s accurate but underspecified. There’s a whole research front trying to do better than the transformerTransformerThe neural network architecture that underpins modern LLMs – stacks of self-attention layers that let every token look at every other token in the context. at sequence modelling, and the candidates differ in what they’re trying to fix. This post walks the field.

The point isn’t that transformers are about to be replaced. They aren’t. The point is that the assumption “transformer = the only way” is already broken, and the alternatives are interesting enough to know about before they show up in production.

What’s wrong with transformers

The transformer’s superpower is its attention mechanism: every token can attend to every other token. That’s how it captures long-range dependencies, and it’s why it dominates language modelling.

The cost is also right there in the design. If your sequence has n tokens, the attention step does roughly n² pairwise comparisons. Double the input, quadruple the compute and the memory.

For short sequences this doesn’t matter. For long ones it dominates. A 2,000-token prompt is fine. A 200,000-token prompt is expensive. A 2,000,000-token prompt is, on a vanilla transformer, infeasible.

The industry has worked around this with engineering – FlashAttention, sliding-window attention, ring attention, KV-cache compression – and the workable context window has stretched from 2k tokens (GPT-3) to 1M+ tokens (Claude, Gemini) over a few years. But the underlying complexity is still quadratic. The workarounds are clever, not free.

The post-transformer architectures all share one design goal: sub-quadratic scaling in sequence length. Beyond that they diverge sharply.

State-space models: Mamba

The most-discussed post-transformer architecture is the state-space model (SSM), and the leading example is Mamba (Gu and Dao, 2023).

The intuition is the one we used in the entry post: instead of every token attending to every other token (the “re-read the book each time” approach), the model maintains a compressed hidden state that gets updated as each token comes in (the “running notes” approach). The cost of updating is constant per token, so the total cost is linear in sequence length, not quadratic.

The catch is that the hidden state is lossy. It’s a fixed-size summary of everything that came before. If a transformer wants to recall the seventh sentence of a hundred-page document, it has the full attention budget to do so. If Mamba wants to recall it, it has to have written something useful about it into the hidden state at the time – and the hidden state has finite capacity.

The Mamba innovation that mattered was making the state-update mechanism selective – the model learns which tokens to actually attend to and which to skim past, rather than treating every token equally. This narrowed the gap with transformers significantly, particularly on language modelling benchmarks.

As of 2026, Mamba and Mamba-2 are competitive with transformers of similar size on many language tasks, sometimes superior on tasks involving very long sequences (DNA, audio, ultra-long documents), and sometimes weaker on tasks requiring precise long-range recall (associative memory). The honest summary: Mamba is real, it works, and it hasn’t beaten transformers across the board.

The hybrid approach: Striped Hyena, Jamba

Most serious research on post-transformer architectures has converged on a pragmatic answer: don’t pick one, mix them.

Hyena (Stanford, 2023) and its successor Striped Hyena are sub-quadratic architectures that interleave Hyena blocks with attention blocks – letting the cheap Hyena blocks do most of the work and the expensive attention blocks handle the parts that genuinely need cross-token comparison.

Jamba (AI21 Labs, 2024) does the same thing but with Mamba blocks: a transformer-Mamba hybrid that uses Mamba layers for efficiency and transformer layers for the kinds of pattern matching transformers are still better at.

The hybrid pattern is now the default assumption for “what comes after the pure transformer.” It’s not “Mamba replaces attention,” it’s “Mamba is a cheap layer that lets you spend your attention budget more carefully.”

RWKV and RetNet: the RNN comeback

Two other notable lines try to revive the recurrent neural network – the architecture transformers replaced – with modern training tricks.

RWKV (Receptance Weighted Key Value, BlinkDL, 2023+) is an RNN that can be trained like a transformer. Standard RNNs are notoriously slow to train because they’re inherently sequential – token t+1 depends on token t. RWKV reformulates the recurrence in a way that allows parallel training (like a transformer) but sequential inferenceInferenceRunning a trained model to produce output – as opposed to training it. at constant cost per token (like an RNN). At inference time, an RWKV model uses constant memory regardless of sequence length – the dream that transformers can’t achieve.

RetNet (Retentive Network, Microsoft, 2023) takes a similar approach with a different mechanism. It claims the “impossible triangle”: parallel training, recurrent inference, and strong performance.

Neither has displaced transformers. Both are competitive in their weight classes and both are interesting if you care about deployment cost more than peak quality – a constant-memory inference path is genuinely useful when you’re running models on phones or in tight latency budgets.

Liquid neural networks

Liquid AI (an MIT spin-out) builds on a different research lineage: continuous-time neural networks where the hidden state evolves according to differential equations rather than discrete update steps. The promise is dramatically smaller models (often orders of magnitude smaller) that match the performance of much larger transformers on specific tasks.

It’s early. Their language models are interesting and small (Liquid’s LFM-3B punches above its weight), but the wider research community hasn’t replicated the results across the spectrum of language tasks. Worth knowing exists. Probably not worth deploying yet unless you have a specific reason.

Diffusion for text

Image generation switched from autoregressive to diffusion years ago (DALL-E 1 was autoregressive; DALL-E 2 onwards is diffusion). The natural question: why not the same for text?

The answer for a long time was “because text is discrete and diffusion is continuous.” Recent work has found ways around this: discrete diffusion (operating directly on token distributions rather than continuous latents), masked diffusion (a generalisation of BERT’s masking objective), and absorbing-state diffusion (gradually replacing tokens with a special mask token, then learning to reverse the masking).

Models in this space include SEDD (Score Entropy Discrete Diffusion), Plaid, and LLaDA (Large Language Diffusion Model, 2024-2025). The pitch is interesting: instead of generating left-to-right one token at a time, the model generates the whole output simultaneously and refines it over multiple denoising steps. This gives you parallel generation (faster wall-clock for long outputs) and the ability to edit or fill in any part of the output (not just append to the end).

As of 2026, diffusion language models are competitive with similarly-sized autoregressive transformers on some benchmarks but lag on others. They’re a genuine alternative paradigm, not just a tweak. Whether they end up dominant or niche is one of the more open questions in the field.

A comparison

Architecture	Sequence cost	Inference memory	Strengths	Weaknesses
Transformer	O(n²)	Grows with context	General performance, ecosystem maturity	Cost at long context
Mamba (SSM)	O(n)	Constant per token	Long-sequence efficiency	Lossy hidden state, weaker associative recall
Striped Hyena / Jamba (hybrid)	Sub-quadratic	Mostly constant + some attention KV	Pragmatic mix, often best of both	More complex to train
RWKV / RetNet (RNN-like)	O(n) train, constant inference	Constant	Cheapest inference, edge-friendly	Smaller ecosystem, training quirks
Liquid (continuous-time)	O(n) typical	Constant or near-constant	Very small models punching up	Early, narrower benchmark coverage
Diffusion (discrete)	O(n) per step × steps	Holds full sequence	Parallel generation, in-place editing	Fixed step count, less mature for text

What’s actually in production

In 2026, transformers still dominate every major API and almost every open-weight release. The frontier models – Claude, GPT, Gemini – are transformers. The leading open-weight models – Llama, Mistral, Qwen – are transformers.

The cracks where alternatives have started shipping:

Long-sequence applications (DNA, audio, ultra-long-document analysis) increasingly use Mamba or hybrid architectures because the quadratic cost is the binding constraint.
Edge deployment (phones, embedded devices) is where RWKV and RetNet have the most traction – constant-memory inference matters more than peak benchmark scores when you have 4GB of RAM.
Hybrid models like Jamba are starting to appear in commercial offerings, mostly behind the scenes.
Diffusion language models are research today, productisation tomorrow – the parallel generation property is too useful to ignore long-term.

What this means for you

Probably nothing immediate. If you’re building on Claude or GPT or a Llama derivative, you’re using a transformer, and you’ll keep using a transformer for the foreseeable future. The point of knowing the alternatives isn’t to switch away from transformers tomorrow.

The point is to recognise the shape of the next disruption when it lands. The story of “X dominated Y until something better came along” is the story of every architecture in the history of machine learning. Convolutional networks dominated vision for a decade until Vision Transformers came for them. RNNs dominated sequence modelling until transformers came for them. Transformers will eventually be replaced by something, and the candidates above are the live ones in 2026.

If you maintain AI infrastructure, the bet that pays off is keeping the interfaces clean – treating “the language modelLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. ” as a swappable component rather than baking transformer-specific assumptions into your stack. The day a hybrid architecture starts winning at half the cost, you want to be able to swap.

The pure transformer is showing its age in one specific way: the quadratic cost of attending every token to every other token, which the workarounds soften but don’t remove. The candidates all try to escape that ceiling by some flavour of compressed running state. Mamba writes notes as it goes and pays the price in lossy recall. RWKV and RetNet pull the recurrent network out of retirement with new training tricks and get constant-memory inference in return. Liquid networks let the hidden state evolve continuously and squeeze surprising performance out of very small models. Diffusion abandons the left-to-right loop entirely and refines a whole output across multiple passes. None of these has unseated the transformer, and the hybrids – Striped Hyena, Jamba – are an admission that the most useful answer in the medium term is a mix.

If you’re building on Claude or GPT today, the practical takeaway is to keep the interface to “the language model” honest and swappable. The history of machine learning is a sequence of architectures dominating until something better arrived. Transformers will get their turn. The architecture that eventually replaces them is probably already in a paper somewhere on arXiv.