Bayesian Reasoning

June 17, 2026 · 13 min read

A self-driving car has a GPS that’s accurate to about three metres, a wheel-speed sensor that drifts, a camera that loses lane markings in heavy rain, and a LiDAR that can’t see through fog. Each sensor is wrong some of the time, in different ways. The car needs to know where it is, right now, with enough confidence to turn left.

The mathematics that combines these noisy signals into a single best estimate, with calibrated uncertainty about how good that estimate is, is older than the car. It’s older than computers. And it’s still the correct tool for the job.

In the previous post we covered logical reasoning, deduction over crisp facts and rules. This post covers what to do when the facts aren’t crisp. Most real-world signals are noisy; most real-world situations are uncertain. Probabilistic reasoning is the AI tradition that takes uncertainty as a first-class citizen.

The probabilistic side of classical AI is doing genuinely useful work in places transformers can’t. Understanding which is which is the job of this post.

Bayes’s rule, the engine

The whole field rests on one piece of mathematics, four numbers, and the relationship between them. It’s worth building from scratch, because it takes a minute and everything else in this post leans on it.

Start with a counting exercise, no algebra required. A clinic screens 1,000 people for a disease that 10 of them actually have. The test is decent but not perfect: it catches 9 of the 10 who are ill, and it wrongly flags 99 of the 990 who are healthy. A patient’s test comes back positive. How worried should they be?

Count the positives. There are 9 + 99 = 108 of them, and only 9 are genuinely ill. So the probability of disease given a positive test is 9 out of 108, about 8%.

The gut says 90%, because the test “catches 9 of the 10 who are ill” and 9 out of 10 is 90%. But that 90% answers a different question than the patient is asking. It’s the test’s hit rate: of the people who are ill, how many does the test flag? What the patient wants to know is the reverse: of the people the test flags, how many are actually ill? Those two questions have wildly different answers here, and swapping one for the other is the single most common mistake people make with this kind of problem.

The reason they diverge is the count of healthy people. The test’s 90% hit rate gets applied to just 10 ill people and produces 9 true positives. Its roughly 10% false-alarm rate (99 of the 990 healthy) gets applied to a population almost a hundred times larger, and produces 99 false positives. So the false positives swamp the true ones: out of 108 positive results, 99 are healthy people the test got wrong. A positive moves the patient from a 1% baseline (10 in 1,000) up to about 8%, worth a follow-up but nothing like the 90% the gut blurts out. Famously, most people (doctors included) guess far too high when asked this cold, because the hit rate is the number in front of them and the size of the healthy population is the number they forget.

Everything we just did was one fraction: the people who are ill and tested positive, divided by everyone who tested positive. The top of the fraction is “how often illness produces a positive test” (9 in 10) scaled by “how common illness is” (10 in 1,000). The bottom is “how common a positive test is overall” (108 in 1,000). In words: the probability of a claim, given some evidence, equals how often the claim produces that evidence, times how common the claim is, divided by how common the evidence is. In symbols:

P(claim | evidence) = P(evidence | claim) × P(claim) / P(evidence)

That’s Bayes’s rule. It isn’t a formula you have to take on trust; it’s the counting we just did, written down once so we never have to draw the 1,000 people again.

The same arithmetic, stated for the situation you usually face: suppose you want to know P(disease | symptom), the probability the patient has a disease given that they have a symptom. The numbers you usually have are different: P(symptom | disease) (how often the disease causes that symptom), P(disease) (how common the disease is in general), and P(symptom) (how common the symptom is in general). Bayes’s rule says:

P(disease | symptom) = P(symptom | disease) × P(disease) / P(symptom)

That’s it. From “what we’d expect to see if X were true” plus “how common X is” plus “how common the evidence is,” you compute “how probable X is given the evidence.” This is the mechanism for updating beliefs in the face of evidence, and it’s the foundation of every probabilistic AI method.

The reason it’s such a big deal: the probabilities you can usually estimate are not the ones you usually want. Bayes’s rule is the bridge.

Bayesian networks

A Bayesian network is a graph where nodes are random variables and edges represent causal or statistical dependencies. Each node has a conditional probability table that says how its value depends on its parents. Together the network compactly represents a joint probability distribution over all the variables.

The classic example is medical diagnosis. Variables: Smoker, LungCancer, Bronchitis, XRayPositive, ShortnessOfBreath. The graph encodes which causes which. Given evidence (the X-ray was positive, the patient has shortness of breath), inference algorithms compute the posterior probability of each unobserved variable (does the patient have lung cancer? bronchitis?).

Bayesian networks (Pearl, 1988) became the dominant approach to AI for diagnosis in the 1990s and 2000s. They run:

Medical diagnostic systems, including genuine production tools at hospitals.
Equipment fault diagnosis in aerospace, manufacturing, and energy.
Forensic analysis, combining DNA evidence, witness statements, and circumstantial evidence into a probability of guilt.
Decision-support systems in agriculture, environmental management, and risk assessment.

Tools like Hugin, Netica, and the open-source pgmpy and bnlearn give you graphical interfaces and inference engines. For domains where you can hand-craft the dependency structure and elicit conditional probabilities from experts, Bayesian networks are a workhorse.

Hidden Markov Models, again

We met HMMs in Before the Transformer as a sequence-modelling tool. They’re equally a probabilistic-reasoning tool: an HMM is a Bayesian network with a particular structure (a Markov chain of hidden states emitting observations).

Production uses include:

Speech recognition acoustic modelling (replaced by deep learning in the 2010s, but still in some pipelines).
Bioinformatics, particularly profile HMMs for protein-family analysis.
Activity recognition from sensor data.
Financial regime detection, whether the market is in a “high-volatility” or “low-volatility” hidden state.

The core algorithms (Forward-Backward, Viterbi, Baum-Welch) are still the correct tool when your problem is “infer hidden states from a sequence of noisy observations.”

Kalman filters: state estimation under noise

A Kalman filter (Rudolf Kalman, 1960) is the correct tool when:

You have a system whose state evolves over time according to known dynamics (with some noise).
You have noisy measurements of that state.
You want the best estimate of the current state, plus the uncertainty about that estimate.

The Kalman filter combines the prediction from the dynamics (“given where I was last time and what I did, where should I be now?”) with the new measurement (“what does the sensor say?”) to produce a fused estimate. It does this optimally for linear systems with Gaussian noise.

Kalman filters run:

GPS-INS fusion in every commercial aircraft, ship, and military platform.
Self-driving cars, fusing GPS, LiDAR, IMU, wheel odometry, and camera data.
Spacecraft navigation. Apollo used a Kalman filter; so does every modern probe.
Tracking radar. Aircraft, missiles, weather phenomena.
Financial time series modelling. State-space models for term structures and yields.

When the system is non-linear, you use an Extended Kalman Filter (linearise around the current estimate) or an Unscented Kalman Filter (sample-based linearisation). For more general non-linear / non-Gaussian problems, you escalate to particle filters.

Kalman filtering is one of those techniques that quietly underpins half the world’s working software and rarely gets credited.

Particle filters: when Gaussian assumptions break

A particle filter (or Sequential Monte Carlo) replaces the parametric Gaussian distribution of a Kalman filter with a sample-based representation. Instead of “the state is a Gaussian centred at x with covariance Σ,” it’s “the state is represented by 10,000 sampled positions, weighted by how well each one explains the data.”

This is more general than Kalman filtering, it handles non-linear dynamics, non-Gaussian noise, and multimodal posteriors (the state could be one of several plausible places, and we’re not sure which). The trade-off is computational: particle filters are slower and less elegant than Kalman filters.

Production uses:

Robot localisation. Where is the robot in the building, given a map and noisy sensor data? Particle filters are the standard answer.
Object tracking in video under heavy occlusion.
Time-series state estimation when the model is non-linear.
Wildlife population modelling with noisy mark-recapture data.

If your problem is “I have noisy observations and a process model, and Kalman filtering’s assumptions don’t hold,” reach for a particle filter.

Markov decision processes and reinforcement learning

A Markov Decision Process (MDP) is the formalism behind classical reinforcement learning. The world is a set of states; from each state you can take actions; each action probabilistically takes you to a next state and gives you a reward. The goal is to find a policy, a mapping from states to actions, that maximises long-run reward.

MDPs and their variants run:

Industrial control systems for process optimisation (chemical plants, HVAC scheduling).
Robotics control (trajectory planning, manipulation).
Game-playing systems. AlphaZero is solving MDPs at scale.
Operations research problems, inventory management, queue control, network routing.

Algorithms include value iteration, policy iteration, Q-learning, SARSA. Modern deep RL (DQN, PPO, SAC) inherits the MDP framing but learns the value or policy function with neural networks.

For problems with a small state space and known dynamics, classical MDP algorithms still beat deep RL on tractability and sample efficiency.

Multi-armed bandits

A specific kind of MDP gets its own name: the multi-armed bandit problem. You have several “arms” (options) you can pull, each with an unknown reward distribution. You want to maximise total reward over time, balancing exploration (trying arms to learn their rewards) and exploitation (pulling the arm that seems best).

Bandit algorithms (epsilon-greedy, UCB, Thompson sampling) are the production answer for:

A/B testing where you want to dynamically allocate traffic toward better-performing variants, “multi-armed bandit testing” or “adaptive experimentation.”
Ad serving. Which ad to show this user given your uncertainty about how they’ll respond.
News and product recommendation with cold-start items.
Clinical trial design. Adaptive trials that allocate more patients to treatments that look better.
Hyperparameter optimisation. Bayesian optimisation for ML model tuning is bandit-flavoured.

Experimentation platforms like Optimizely and most ad-tech platforms use bandit algorithms in production. They’re cheaper, faster, and often more ethical than rigid A/B tests.

Decision theory

The most general framework: combine probabilities with utilities (how much you care about each outcome) to choose actions that maximise expected utility.

This is the foundation of:

Insurance pricing, expected loss given probability of claims.
Medical decision-making, treat or wait, given probability of disease and utility of various outcomes.
Engineering safety analysis, failure mode and effects analysis.
Climate policy, decisions under deep uncertainty about future states of the world.

Decision theory is less an algorithm than a framework, but it’s the unifying structure for problems where you need to act under uncertainty with explicit trade-offs.

Probabilistic programming

A modern development worth knowing about: probabilistic programming languages (Stan, PyMC, Pyro, NumPyro, Edward, Turing.jl). These let you specify a probabilistic model in code, prior distributions, likelihood, observed data, and the language handles inference (typically by Markov Chain Monte Carlo or variational methods).

The promise: any model you can write down, you can fit to data, with proper uncertainty quantification. The reality: probabilistic programming has become the standard tool for Bayesian statistical modelling in science and increasingly in industry.

Production uses include:

Bayesian A/B testing with proper credible intervals rather than ad-hoc p-values.
Clinical trial analysis in pharma.
Marketing mix modelling, attributing sales to advertising channels.
Forecasting at companies that care about uncertainty intervals, not just point estimates.

If your problem is “fit a custom probabilistic model to data and extract calibrated uncertainty,” reach for Stan or PyMC.

A decision table

If your task is...	Reach for...
Diagnose a problem from symptoms with hand-crafted dependencies	A Bayesian network
Estimate a continuously-evolving state from noisy sensors	Kalman filter (linear-Gaussian) or particle filter (non-linear)
Localise a robot in a known map	Particle filter (Monte Carlo localisation)
Allocate users to test variants and adapt to results	Multi-armed bandit (Thompson sampling, UCB)
Tune hyperparameters of an ML model	Bayesian optimisation (a kind of bandit)
Make sequential decisions under uncertainty in a known environment	An MDP solver, value iteration if state space is small, RL if not
Fit a custom probabilistic model to data with uncertainty estimates	A probabilistic programming language (Stan, PyMC, Pyro)
Decide whether the cost of an action is worth its expected benefit	Decision theory, model probabilities and utilities explicitly
Generate a fluent paragraph from a prompt	An LLM, nothing in the classical probabilistic toolkit generates fluent text

Why probabilistic reasoning is having a quieter renaissance

While LLMs got the headlines, probabilistic methods have been quietly improving:

Bayesian deep learning, neural networks that produce calibrated uncertainty.
Conformal prediction, distribution-free uncertainty quantification for any base model, including LLMs.
Probabilistic programming on GPUs. Pyro, NumPyro making MCMC and variational inference at scale tractable.
Causal inference at scale, structural causal models for counterfactual reasoning, increasingly used in tech-company experimentation platforms.

These aren’t competing with LLMs. They’re complementing them, particularly in domains where uncertainty matters more than fluency, safety-critical systems, scientific inference, regulated industries.

Probability is the third leg of classical AI and the one that quietly underpins more critical infrastructure than the other two combined. Bayes’s rule does the actual work in every method on this page, updating beliefs in light of evidence, in the right direction, with the right weights. Bayesian networks encode causal structure for diagnosis. Kalman filters fuse noisy sensors into state estimates in every aircraft, ship, and Mars rover. Particle filters extend the same story to non-linear, multimodal cases and are how robots know where they are inside a building. MDPs and reinforcement learning model sequential decisions when the world has dynamics. Multi-armed bandits run adaptive A/B testing and ad allocation. Probabilistic programming languages let scientists fit custom models with calibrated uncertainty rather than ad-hoc point estimates.

None of these compete with LLMs. They complement them, particularly anywhere uncertainty matters more than fluency, safety-critical systems, scientific inference, regulated industries. The shape that’s emerging is deep learning for perception, probabilistic methods for reasoning, and both producing calibrated outputs the next layer of the system can use.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.