The Boring Baseline That Wins

May 23, 2026 · 11 min read

You have 4,000 customer reviews. Half are positive, half are negative, more or less. You want a sentiment classifier. The team’s first instinct is to call the LLMLLMA neural network trained to predict the next token in a sequence, large enough that it generalises to tasks it wasn’t explicitly trained for. API once per review and parse the response. The bill is real, the latency is real, and the accuracy on your specific data is unproven.

An afternoon’s work in scikit-learn produces a modelModelA trained set of weights plus the architecture that makes them useful – the thing you load up and run inference against. that hits 92% accuracy, runs at 50,000 predictions per second on a CPU, and costs nothing per call. The afternoon includes lunch.

This shouldn’t be an unusual outcome, but increasingly it is.

There’s a recurring pattern in machine learning projects: someone reaches for the most sophisticated tool first, struggles with it, and only later discovers that a “boring” classical baseline, TF-IDF features fed into a logistic regression, would have solved the problem in an hour. The previous post covered the classical NLP that still ships in production. This post covers the classical machine learning that should be the default starting point for most text-classification, clustering, and topic-modelling projects.

Not because neural models are bad. Because for problems below a certain size and complexity, the boring tools are simply the correct answer.

TF-IDF: the trick that won’t die

TF-IDF, Term Frequency / Inverse Document Frequency, is a way of turning a piece of text into a vector of numbers based on which words appear in it and how distinctive those words are.

The intuition is simple. For each word in your vocabulary, multiply two numbers:

TF: how often the word appears in this document. Common words score high.
IDF: a penalty for words that appear in many documents. Words that are common everywhere (like “the” or “and”) score low. Words that appear in only a few documents score high.

The result is a feature vector where words that are distinctive to a document score highly and words that are common across the corpus score low. “Refund” in a customer-service ticket scores high; “the” scores near zero.

That’s it. There’s no neural network, no trainingTrainingThe process of fitting a model’s weights to data by minimising a loss function. in the modern sense. You count words, you weight them, you have a feature vector. The whole pipeline is a hundred lines of Python or a single call to sklearn.feature_extraction.text.TfidfVectorizer.

And it works. Astonishingly well, for a fifty-year-old idea.

Logistic regression on TF-IDF features

Once you have TF-IDF vectors, you can feed them into any classifier. The most-used and least-glamorous choice is logistic regression: a linear model that learns a weight for each feature and predicts the probability of each class as a logistic function of the weighted sum.

For text classification with reasonable amounts of data (a few thousand to a few hundred thousand labelled examples), TF-IDF + logistic regression is often within a few percentage points of the best deep-learning model, and orders of magnitude cheaper to train, deploy, and explain.

Real numbers from real projects:

Sentiment analysis on movie reviews (50k examples, IMDB-style): TF-IDF + logistic regression hits ~89% accuracy. A fine-tuned BERT hits ~94%. A frontier LLM with a prompt hits ~92%. The first one trains in 30 seconds and runs at 50,000 predictions per second on a CPU.
Spam detection (millions of emails): TF-IDF + logistic regression or naive Bayes is still the production standard at most large mail providers. A fine-tuned transformer would be more accurate by a percentage point and cost a thousand times more to run at scale.
Topic classification of news articles (20-30 classes, 100k articles): TF-IDF + logistic regression matches BERT to within a couple of points and runs in milliseconds.

The pattern holds: when the task is “find a stable mapping from word patterns to a fixed set of labels,” and you have a few thousand examples, the linear model on lexical features is the sensible baseline.

When the linear model isn’t enough

The boring baseline has known weaknesses, and they’re the cases where you actually want a transformerTransformerThe neural network architecture that underpins modern LLMs – stacks of self-attention layers that let every token look at every other token in the context. .

Paraphrase and synonymy. “I’m furious” and “I’m absolutely livid” are obviously sentiment-equivalent to a human. TF-IDF treats them as completely different features. Word2vec helps a bit; transformers solve it.
Long-range context. “The hotel was lovely, except for the bedbugs and the manager who threatened me.” A bag-of-words model averages “lovely” and “threatened” and gets the answer roughly correct by accident. A transformer reads it as a sentence and weights the second clause appropriately.
Negation and irony. “Best customer service ever, if you enjoy waiting four hours and being lied to.” TF-IDF sees “best” + “customer service” + “ever” and predicts positive. The transformer sees the structure.
Low-resource targets. If you only have 50 labelled examples, the linear model is overfitting; an LLM with zero-shot prompting may genuinely do better.

The rule of thumb is: if the task can be solved by paying attention to the correct keywords, the boring baseline works. If it requires understanding sentence structure or context, you need a transformer.

Naive Bayes: the even more boring baseline

Naive Bayes is, in a real sense, more primitive than logistic regression. It assumes every feature is independent of every other feature given the class, a “naive” assumption that’s almost always false. And yet it often works fine, particularly for spam classification, document categorisation, and short-text problems.

The reason is computational. Naive Bayes is blazing fast to train, counting word occurrences per class, and equally fast at inferenceInferenceRunning a trained model to produce output – as opposed to training it. . For applications where you need to retrain frequently (incoming email streams, news feeds, anything with model drift) it’s hard to beat. Multinomial naive Bayes specifically remains the correct default for short text classification with limited data.

Clustering: k-means and the friends you don’t think about

Sometimes the task isn’t “classify this into one of N labels”, it’s “find natural groupings in this data.” That’s clustering, and the boring baseline is k-means.

K-means takes a set of points (your TF-IDF vectors, your image embeddings, whatever) and a number k, and finds k clusters such that each point is closer to its own cluster’s centre than to any other. It’s the algorithm taught in the first week of a machine learning course, and it’s still the correct tool for most clustering problems.

When you’d actually use it:

Customer segmentation based on behaviour vectors.
Document clustering for exploratory analysis (“what topics exist in this corpus?”).
Image quantisation, reducing a photograph to a palette of k colours.
Vector quantisation for compression and indexing in vector databases.

K-means has limitations, it assumes spherical clusters, requires you to pick k, and can get stuck in bad local minima, but for “I have a pile of vectors and I want to know what’s in there,” it’s still the first tool to reach for.

For when k-means isn’t enough, there’s a small family of alternatives that are themselves still classical: DBSCAN for density-based clustering, hierarchical clustering when you want a dendrogram, Gaussian Mixture Models when you want soft assignments and uncertainty.

Topic modelling: LDA and NMF

A specific kind of unsupervised text analysis: what topics are present in this corpus, and which documents touch on which topics?

The classical answer is Latent Dirichlet Allocation (LDA, Blei et al., 2003). LDA models each document as a mixture of topics, and each topic as a distribution over words. The result, when applied to a corpus of news articles, might give you topics that look like “sports basketball game team player,” “politics election vote senator democrat,” “weather storm rain temperature forecast.” Each document is described as some percentage of each topic.

LDA is interpretable, deterministic-ish, and runs on modest hardware. It produces output a human can read (a topic is a list of weighted words) rather than a 768-dimensional vector. For exploratory analysis, journalism, and humanities research, it’s still extremely common.

Non-negative Matrix Factorisation (NMF) does a similar thing through different mathematics and often produces sharper, more separable topics, worth trying alongside LDA when topic modelling is what you actually want.

The neural alternatives, topic models built on top of contextual embeddings, like BERTopic, produce subtler topics but are harder to interpret and slower to run. If your goal is “give me a readable list of what’s in this corpus,” LDA is still hard to beat.

A starter kit, in code

Eighty per cent of the practical problems in this post can be solved with a combination of:

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.cluster import KMeans
from sklearn.decomposition import LatentDirichletAllocation

The total surface area is maybe 30 functions. The mental model is small. The deployment cost is whatever it costs to run a Python process on a CPU. You can train, deploy, and serve all of these from a single laptop, and you can scale them out to billions of documents on commodity hardware without surprise.

That’s not nothing. That’s most of the practical value of machine learning, available without buying a GPU or calling an API.

A decision table

If your task is...	The boring baseline is...	Reach for a transformer when...
Sentiment / topic / intent classification with thousands of labels	TF-IDF + logistic regression	You need to handle paraphrase, irony, or long context
Spam / phishing / abuse detection	Multinomial naive Bayes or logistic regression	Adversaries are actively rewording to evade keywords
Document categorisation across many classes	TF-IDF + linear SVM or logistic regression	Class definitions are subtle and require context
Customer segmentation	K-means on engineered features	You need clusters defined by complex relationships
"What topics exist in this corpus?"	LDA or NMF	You need topics defined by semantic meaning rather than co-occurring words
Initial baseline for any new ML problem	TF-IDF + logistic regression, even if you eventually replace it	Always start here. Knowing how the boring baseline scores tells you whether the fancy model is worth the cost.

Why teams skip this step

Three usual reasons.

First, the gradient of professional incentives points away from boring. Saying “I shipped a TF-IDF + logistic regression model” sounds like 2008. Saying “I fine-tuned a transformer” sounds like 2026. The actual customer doesn’t care.

Second, the tooling for fancy models is now better than the tooling for boring ones. Hugging Face, Replicate, and the LLM APIs have made it easier to call a transformer than to set up a scikit-learn pipeline, particularly for someone new to the field. The friction has inverted.

Third, “good enough” is hard to defend when the alternative is “best.” Nobody got fired for picking the state-of-the-art model. If you pick the linear baseline and it’s 92% accurate, someone will eventually ask why you didn’t use the 94% transformer. The answer is “because it costs a thousand times more and is two percent better and we don’t need that two percent”, but that’s an explicit trade-off discussion most teams don’t want to have.

The fix is to make the boring baseline the explicit comparison point. If you can’t beat the linear model by a meaningful margin, the linear model wins. If you can, you’ve justified the upgrade with a number.

The discipline that pays off is making the boring baseline the explicit comparison point on every project. TF-IDF and logistic regression remain the right place to start a text-classification problem with thousands of labelled examples. Multinomial naive Bayes still beats most things for very short text at very high throughput. K-means is still the first thing to reach for when you want to know what groups exist in a pile of vectors, and LDA or NMF are still the tools to use when “give me a readable list of topics” is the actual brief. None of these is the consolation prize. They are the score the fancier model has to beat by a margin large enough to justify its cost.

Most production ML in industry is still classical. The headlines belong to LLMs and the backend belongs to logistic regression. A 92% model that runs at fifty thousand predictions per second on a CPU usually beats a 94% model that costs a thousandth of a cent per call, once you multiply by the volume you’re actually serving. Always know the boring number before you commit to something fancier.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.