The Other Transformers

April 25, 2026 · 12 min read

You have a backlog of 80,000 support tickets and you need to tag each one with one of fourteen categories. Someone suggests using an LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. . You write the prompt, you wire up the API, you run the numbers, and the bill comes back at $1,400 just for the categorisation. You haven’t even started doing anything with the categories yet.

There’s a better tool for this. It’s also a transformerTransformerThe neural network architecture that underpins modern LLMs – stacks of self-attention layers that let every token look at every other token in the context. . It’s just not the one everyone talks about.

In To LLMs… and Beyond! we treated “transformer” as one thing, the engine behind Claude, GPT, Llama. That was useful for a tour of the field, but it elided a real distinction. The transformer architecture comes in three structural shapes, and only one of them is the autoregressive text-generator that the AI conversation has fixated on.

The other two are still in production at every serious AI shop. They’re cheaper, faster, and often more accurate for the jobs they were designed to do. This post is about when to reach for them instead.

Three shapes from one paper

The 2017 paper AttentionAttentionThe mechanism inside a transformer that lets each token weigh how much every other token in the context matters to it. Is All You Need introduced the transformer with a specific job in mind: machine translation. English in, French out. The architecture had two halves, an encoder that read the English sentence and produced an internal representation of its meaning, and a decoder that consumed that representation and produced French one tokenTokenThe unit of text an LLM actually sees – usually a short character sequence, not a whole word. at a time.

Almost immediately, researchers noticed you could use the halves separately.

Encoder-only models keep just the encoder. They take text in and produce a representation, a vectorVectorAn ordered list of numbers – in AI usage, almost always an embedding – and by extension the databases that index them for nearest-neighbour search. , a label, a span. They never generate text. BERT (2018) is the headline example.
Decoder-only models keep just the decoder. They take text in and produce more text, one token at a time. GPT, Claude, and Llama are all this shape.
Encoder-decoder models keep both halves. They take text in, encode it, and decode something different out. T5 and BART are the headline examples.

The shape determines what the modelModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. is good at. And it determines what it costs.

Encoder-only: BERT and friends

BERT stands for Bidirectional Encoder Representations from Transformers. The “bidirectional” is the part that matters. A decoder-only model like GPT processes text left-to-right, one token at a time, when it’s predicting the next token, it can only see what came before. An encoder-only model processes the entire sequence at once, and every token can attend to every other token in both directions.

This makes encoder-only models worse at generating fluent text, in fact, they don’t generate text at all in the usual sense, but better at understanding it. When BERT looks at the word “bank” in “I sat by the bank of the river,” it can see “river” three tokens later, and that informs its representation of “bank.” A left-to-right model has to commit to a meaning before it has all the evidence.

What encoder-only models actually output is a sequence of vectors, one per input token. You can use those vectors directly (as embeddingsEmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. for similarity search) or you can stick a tiny classification head on top (a single linear layer that maps a vector to a label) and get a classifier.

The big BERT-family models you’ll encounter:

Model	Made by	Notable for
BERT	Google, 2018	The original. Set state of the art on a dozen benchmarks overnight.
RoBERTa	Meta, 2019	BERT trained better, more data, longer, with the masking strategy fixed. Usually beats BERT.
DeBERTa	Microsoft, 2020-2021	Disentangled attention. Strong on classification benchmarks, often the default for new projects.
DistilBERT	Hugging Face, 2019	A 40%-smaller BERT that's 60% faster and keeps 97% of the accuracy. The pragmatic choice.
ModernBERT	Answer.AI, 2024	BERT with the last six years of architectural improvements bolted on. Long context, fast inference.

These are all small. BERT-base has 110 million parameters, DistilBERT has 66 million, ModernBERT-large has 395 million. Compare that to a frontier LLM at hundreds of billions. They run on a CPU. They run on your laptop. They run on a Raspberry Pi if you don’t mind waiting.

What encoder-only models are good at

Anything where the answer is shorter than the input. Specifically:

Classification. Sentiment, intent, topic, language detection, content moderation, spam, urgency triage. One label out per input.
Multi-label classification. Tagging a document with several categories at once.
Named entity recognition (NER). Picking out people, places, organisations, dates from text. One label per token.
Span extraction. “Find the answer to this question inside this document.” The model points at the start and end positions of the span. SQuAD-style question answering.
Sentence embeddings. Producing a fixed-size vector that represents the meaning of a piece of text. The foundation of semantic search and RAGRAGA pattern where you retrieve relevant documents at query time and stuff them into the prompt so the model can ground its answer on them. .
Pairwise classification. “Are these two sentences saying the same thing?” “Does sentence A entail sentence B?”

For all of these, an LLM will also work. It will just cost roughly a hundred times more, take roughly ten times longer, and, in many cases, be less accurate.

Why an LLM is often worse, not just more expensive

Counterintuitive but real: a fine-tuned BERT often outperforms a frontier LLM at classification tasks the BERT was specifically trained for.

The reason is task alignment. An LLM is trained to predict the next token across the entirety of internet text. A fine-tuned classifier is trained on labelled examples of exactly the task you care about, ten thousand support tickets with their correct categories, say. The LLM has read the universe and has a vague sense of what “billing” means; the classifier has stared at your specific definition of “billing” for a thousand epochs.

The LLM also has to speak its answer, which introduces failure modes the classifier doesn’t have. Will it return “billing” or “Billing” or “billing/payments” or a polite refusal because the ticket mentions a credit card? The classifier returns one of fourteen integers. Always.

There’s an obvious counter: what if you don’t have ten thousand labelled examples? Genuine constraint, and where LLMs shine, zero-shot or few-shot classification with a prompt is a real superpower when you’re starting from nothing. But the moment you’ve labelled enough data to fine-tuneFine-tuningContinuing to train an already-trained model on a smaller dataset to adapt its behaviour. a small encoder, the cost-quality curve usually flips.

Encoder-decoder: T5, BART, FLAN

The encoder-decoder shape is for jobs where the output is structured but isn’t a free-form essay, a transformation of the input rather than a continuation of it.

The flagship example is Google’s T5 (Text-to-Text Transfer Transformer, 2019), which framed every NLP task as text-in, text-out:

Translation: input “translate English to German: That is good.” → output “Das ist gut.”
Summarisation: input “summarize: <article>” → output “<summary>”
Classification: input “cola sentence: The course is jumping well.” → output “not acceptable”
Question answering: input “question: What is the capital of France? context: …” → output “Paris”

The shape is well-suited to anything that has a deterministic-ish target, a translation, a summary, a structured output, a SQL query generated from a natural-language question. The encoder reads the whole input once, builds a rich representation, and the decoder produces the (usually short) output guided by that representation.

The other notable encoder-decoder family is BART (Meta, 2019), which was trained on a denoising objective, corrupt the input, recover the original, and is particularly strong at summarisation.

The instruction-tuned descendants. FLAN-T5, T5-XXL, BART-large-CNN, are still common backbones for production summarisation and translation pipelines, especially when you want to fine-tune on your own data.

What encoder-decoder models are good at

Translation. The original use case, still strong.
Summarisation. Extractive (copy spans) or abstractive (rewrite). BART-large-CNN was the production default for years.
Structured generation. Text-to-SQL, text-to-JSON, text-to-API-call. The encoder grounds the output in the input.
Grammar correction. Input: messy sentence. Output: clean sentence.
Question answering with generation. Where the answer isn’t necessarily a span in the document and needs to be paraphrased.

The boundary with decoder-only LLMs has blurred. Modern LLMs do all of the above competently, often better than older T5 models, and the simplicity of “one model for everything” has pulled a lot of work toward the decoder-only side. But for pipelines where you need something small, fast, deterministic, and fine-tuneable, T5-family models still pull their weight.

A decision table

If your task is...	Reach for...	Why not an LLM?
Tag each item with one of N categories	DeBERTa or DistilBERT, fine-tuned	100x cheaper, often more accurate, no parsing of free-text output
Find people, places, dates in text	A BERT-family NER model (e.g. spaCy's transformer)	Token-level precision, no hallucinated entities
Embed sentences for semantic search	A sentence-transformers model (BGE, E5, GTE)	LLMs don't natively produce sentence embeddings; encoder models do this as their primary job
Translate between languages at scale	A T5- or NLLB-family model, fine-tuned if needed	Per-token cost matters at translation volumes; specialised models still lead
Convert natural language to SQL or JSON	A code-fine-tuned T5, or an LLM if accuracy matters more than cost	Mixed. LLMs win on hard cases, encoder-decoders win on cost at scale
Decide if a comment is toxic	A fine-tuned encoder classifier (e.g. Detoxify)	Real-time moderation needs millisecond latency, not 800ms API round-trips
Have a free-form conversation	An LLM	Encoder models cannot generate fluent multi-turn text
Reason through a multi-step problem	An LLM, ideally a reasoning model	Encoder models have no chain-of-thought; they produce one answer in one pass

The pragmatic stack

In production AI systems, you’ll often see encoder, encoder-decoder, and decoder-only models working together rather than competing.

A typical retrieval-augmented chat application:

Bi-encoder (BERT-family) embeds the user’s query and finds the top 100 candidate documents from the vector database. Cheap, parallel, fast.
Cross-encoder (BERT-family) re-ranks those 100 down to the top 5 by reading each query-document pair carefully. We’ll cover this in the next post.
Decoder-only LLM consumes the top 5 documents alongside the query and writes a fluent answer.

Each stage uses the right tool for its job. The encoder does the cheap, high-throughput retrieval and ranking. The LLM does the expensive, low-throughput generation, but only after the encoder has narrowed the search space by three orders of magnitude.

This is the pattern that matters. It’s not “LLM vs BERT.” It’s “use BERT to make the LLM step efficient enough to be worth doing.”

Where to find them

Hugging Face is the de facto registry. bert-base-uncased, roberta-large, microsoft/deberta-v3-large, distilbert-base-uncased, answerdotai/ModernBERT-large, t5-base, facebook/bart-large-cnn, google/flan-t5-xl, all available, all free to download.
sentence-transformers is the library for using BERT-family models as embedding models. all-MiniLM-L6-v2 is the gateway drug, 22 million parameters, runs on a phone, and is the correct starting point for 80% of semantic-search projects.
spaCy wraps fine-tuned encoder models for NER, POS tagging, and similar pipelines, with an API designed for production use rather than research.
Cohere, OpenAI, Voyage sell hosted embedding APIs if you want the model without the operations.

The word “transformer” hides three quite different machines under one name. The decoder-only shape is what everyone means when they say LLM, and it’s the one that has to speak its answer aloud, one token at a time. That mouth is what makes it generative, and it’s also what makes the bill arrive. The encoder-only shape never opens its mouth: it reads, it understands, it points at a label or a span or hands back a vector. The encoder-decoder shape sits in between, reading once and producing a short, structured response.

If your job has a stable target, one of fourteen categories, a span in a document, an embedding for retrieval, a SQL query, there’s almost always a smaller, older, cheaper model that does it better than a frontier LLM, especially once you have labelled data to fine-tune on. The serious AI shops know this. Their production stacks don’t pick between transformer shapes; they chain them. The encoder narrows the search space by three orders of magnitude so the decoder’s expensive generation step is worth paying for. “Should I use an LLM?” is the wrong framing; the useful framing is where in the pipeline an LLM actually earns its cost.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.