When Not to Use an LLM

June 24, 2026 · 11 min read

A new project lands on your desk. A specification, a deadline, a budget. The default assumption, yours, your team’s, your stakeholders’, is “we’ll use an LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. .” That assumption is correct perhaps half the time, and the other half it costs ten times as much as it should and works less well than it should and locks you into a vendor for no good reason. The discipline that pays off is asking “is this actually an LLM problem?” before you write the first prompt.

This is the closing post for the series. Nine posts ago, in To LLMs… and Beyond!, we mapped the model landscape. Then we worked through the parts the entry post skipped: encoder-only and encoder-decoder transformers, post-transformer architectures, classical NLP, statistical baselines, hand-written rules, search and planning, logic and constraints, probabilistic reasoning. Each post showed where its tool wins.

This post puts those tools in one place. It’s the field-guide flowchart for “what should I actually use?”

The framing axis: generative vs. discriminative

The single most useful question to ask up front: does the task require generating new text, or only making decisions about existing text (and other data)?

Generative tasks produce new content, writing, summarising, translating, conversing, coding from scratch. The output is open-ended.
Discriminative tasks make decisions about given input, classifying, extracting, searching, ranking, scoring, deciding. The output is bounded.

LLMs are generative models. They can do discriminative tasks (you can prompt one to classify a document) but you’re using a Boeing 747 to deliver a pizza. The correct tool for a discriminative task is usually a discriminative model, and the discriminative models we covered in this series are dramatically cheaper.

The first decision is which side you’re on. Most of the wasteful uses of LLMs in industry are people using LLMs for discriminative tasks because the LLM is the tool everyone knows.

Symptoms that say “this is not an LLM problem”

Run through these. If any of them describe your situation, an alternative is probably the better answer.

The output is one of N labels

You’re tagging support tickets, classifying sentiment, routing emails, detecting spam, picking categories. The output is from a small set you defined.

Reach for: a fine-tuned encoder model (DeBERTa, DistilBERT, ModernBERT) or, for simpler problems, TF-IDF + logistic regression.

Why: hundredfold cheaper, often more accurate on your specific labels, no parsing of free-text output. See The Other Transformers and The Boring Baseline That Wins.

You’re extracting entities from text

People, places, organisations, dates, product codes, drug names, gene names.

Reach for: a CRF (especially in specialised domains) or a fine-tuned BERT-family NER model (in general domains).

Why: token-level precision, no hallucinated entities, runs on a CPU at thousands of sentences per second. See Before the Transformer.

The pattern is exact

Email addresses, phone numbers, postcodes, account numbers, log lines, URLs, structured codes.

Reach for: a regular expression. Maybe a finite-state transducer if the structure is richer.

Why: deterministic, microsecond latency, free, auditable. See Rules, Grammars, and Regex.

A regulator or auditor will ask you to explain decisions

Loan approval, claims adjudication, benefits eligibility, compliance flagging, content moderation in regulated industries.

Reach for: a production rule engine (Drools, IBM ODM) or a hand-written decision tree.

Why: every decision traces to a specific rule. Domain experts maintain the rules. See Rules, Grammars, and Regex and Knowledge, Logic, and Constraints.

You need provable correctness

Hardware verification, security analysis, type checking, scheduling that must satisfy constraints, configuration that must be valid.

Reach for: a SAT solver or SMT solver (Z3, CVC5).

Why: an LLM produces plausible answers; an SMT solver produces correct ones. See Knowledge, Logic, and Constraints.

You’re searching a state space

Pathfinding, scheduling, route optimisation, build dependency resolution, game AI, robot planning.

Reach for: A* (with a heuristic), Dijkstra (without), alpha-beta or MCTS for adversarial games, a CSP solver (OR-Tools), a planner.

Why: search algorithms find optimal solutions where LLMs guess plausible ones. See Search and Planning.

Your data is noisy and you need calibrated uncertainty

Sensor fusion, state estimation, A/B testing with adaptive allocation, scientific modelling, risk analysis.

Reach for: Kalman filter, particle filter, Bayesian network, multi-armed bandit, probabilistic programming language.

Why: probabilistic methods give you posterior distributions, not point estimates. See Bayesian Reasoning.

Latency budget is in microseconds

Real-time bidding, mobile autocomplete, network packet inspection, anything in a hot path.

Reach for: regex, an n-gram model, a small linear classifier, or a CRF, whatever fits the task at sub-millisecond latency.

Why: a transformerTransformerThe neural network architecture that underpins modern LLMs – stacks of self-attention layers that let every token look at every other token in the context. can’t run that fast.

You have semantic search but not generation

User searches a corpus and you return the relevant documents, full stop. No summary, no answer, just the documents.

Reach for: a sentence-embedding model (BGE, E5, Cohere Embed) plus a vector databaseVectorAn ordered list of numbers – in AI usage, almost always an embedding – and by extension the databases that index them for nearest-neighbour search. . Add a cross-encoder reranker if relevance matters. Add BM25 hybrid retrieval if exact matches matter.

Why: the LLM only earns its cost when it’s generating something, and “find me the documents” doesn’t need generation. See The Reranker You Didn’t Know You Needed.

Symptoms that say “this is an LLM problem”

The other half. If any of these describe your situation, you probably do want an LLM.

The output is free-form text, and the input is too varied to enumerate. Writing, summarising, paraphrasing, translating, conversing.
You’re integrating signals across modalities (text + image, text + audio) and producing text.
Few-shot or zero-shotFew-shot / Zero-shotGiving the model worked examples of the task in the prompt (few-shot), versus asking with no examples at all (zero-shot). is required, because you don’t have labelled training data and won’t have any soon.
The user is interacting through natural language, including ambiguous, idiomatic, multi-turn dialogue.
The task requires general world knowledge, common sense, broad context, things people just know.
The reasoning is multi-step and involves chaining facts the model might have, including by writing intermediate working.
You need code generation in any non-trivial sense, writing functions, refactoring, explaining.

The honest summary: LLMs are excellent natural-language interfaces and excellent flexible generators. They’re middling discriminative classifiers and expensive search algorithms. Use them where their generative-and-flexible nature is the value.

A combined decision flowchart

When a project arrives, run it through this rough order:

Is the output one of N labels? → encoder classifier.
Is the output structured (entities, spans, JSON)? → encoder NER model, encoder-decoder, or CRF.
Is the input pattern exact? → regex / FST.
Is the task an explicit logic problem (constraints, eligibility, verification)? → rule engine, CSP solver, or SAT/SMT.
Is the task search through a state space? → A*, planner, MCTS.
Is the data noisy and uncertainty calibrated? → Bayesian methods.
Is it semantic search of a corpus? → embeddingsEmbeddingA fixed-length vector of floats that represents a piece of text (or image, or other thing) in a space where similar meanings sit close together. + reranker, possibly with hybrid retrieval.
Otherwise, is it a generative or natural-language task? → an LLM.

This skips dozens of edge cases. The point isn’t to follow it slavishly; it’s to make the question “have I considered the alternative?” reflexive rather than skipped.

The hybrid pattern

Most production systems built well don’t pick one of these. They use several, in a stack.

A common shape:

Edges: rules and regex pre-process input, validate, route, redact.
Retrieval: embeddings + reranker find relevant documents from a corpus.
Reasoning: a logic system or knowledge graph answers parts that need provable correctness.
Generation: an LLM produces the human-facing output, grounded in the retrieved and reasoned-over results.
Edges again: rules validate the LLM’s output, apply business policy, log for audit.

The LLM is one component, not the whole system. It does the part it’s good at, producing fluent, contextual, natural-language output, and the rest of the stack does the parts it isn’t.

This is the shape that beats both “LLM for everything” (too expensive, too unreliable) and “no LLM” (too rigid). Each tool earns its cost on the part it’s correct for.

A condensed table

Putting the whole series in one table:

If the task is...	The correct tool is usually...
Classify into N labels	Fine-tuned encoder (DeBERTa) or TF-IDF + logreg
Tag entities in text	CRF or BERT-family NER
Match exact patterns	Regex / FST
Translate or summarise at scale	Encoder-decoder (T5/BART) or LLM
Embed text for search	Sentence-transformer model
Rerank candidate documents	Cross-encoder reranker
Apply thousands of business rules	Drools / IBM ODM
Recursive query over relations	Datalog
Verify a property holds	SAT / SMT solver (Z3)
Find shortest path	A* / Dijkstra
Schedule under constraints	CSP solver (OR-Tools)
Plan a sequence of actions	PDDL planner / GOAP
Play a perfect-information game	Alpha-beta / MCTS
Diagnose from symptoms	Bayesian network
Estimate state from noisy sensors	Kalman / particle filter
Allocate online experiments	Multi-armed bandit
Fit a custom probabilistic model	Stan / PyMC / Pyro
Generate fluent natural text	An LLM
Reason in multi-turn natural-language dialogue	An LLM
Generate or refactor code	An LLM, especially a reasoning model
Provide a natural-language interface to anything above	An LLM in front of the correct tool

Why this matters now

For a few years it’s been possible to ship “AI” without a thought, by routing every problem to GPT-4 or Claude. The default works often enough that teams don’t notice when it’s the wrong default. The cost of that complacency, at scale, is large, in money, in latency, in reliability, in vendor lock-in, in carbon.

The shift that’s coming, slowly, is the recognition that “use an LLM” is one answer among many, and a good engineer reaches for the correct tool. The correct tool is sometimes a transformer with billions of parameters. It’s sometimes a regex. It’s sometimes a Datalog query, a Kalman filter, a CSP solver, an A* search. Knowing the choices is the difference between an engineer and a person with one hammer.

The hope of this series, and the entry post that started it, is to make those choices visible. To put encoder models, n-grams, CRFs, A*, Drools, Z3, Stan, and the rest into the same mental kit as Claude and GPT. Not to make you use them. To make you choose, consciously, by what fits the problem, not by what’s at the top of the API reference.

The generative-versus-discriminative split pre-decides most of the choice. Discriminative tasks rarely need an LLM, and most of the wasteful AI in industry is people running discriminative work through generative models because the generative model is the one their team already has a key for. Exact patterns are regex’s job. Regulated decisions belong in a rule engine because auditability beats fluency. Provable correctness lives in a SAT or SMT solver because plausibility isn’t proof. Search problems belong to search algorithms. A*, planners, CSP solvers. Noisy data with uncertainty that has to be calibrated belongs to Bayesian methods. The production answer is almost always a hybrid stack with each tool doing the part it’s correct for and an LLM handling the natural-language interface that ties everything together.

This is the last post in the series. The previous nine each map a specific corner of the field; this one shows how to pick between them. If a future post in The AI Field Guide shows up, it’ll be on a specific corner that earns its own deep-dive, diffusion at scale, knowledge-graph reasoning in production, neurosymbolic systems that have actually shipped. None of those is on the schedule yet. For now, the field guide is what it is.

The aim, end-to-end, was to make the answer to “what should I use?” something better than “the model in the API I already have a key for.” If the choice is a little more deliberate now than it was nine posts ago, the series did its job.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.