You type a question. A few seconds later, coherent, fluent text appears on your screen – text that seems to understand what you asked, that follows instructions, that writes code and poetry and legal briefs. It’s natural to wonder: what is actually happening in there?
The answer is simultaneously simpler and stranger than most people expect. Large language models (LLMs) don’t “understand” language in the way humans do. They don’t have beliefs, memories, or intentions. What they do – and they do it extraordinarily well – is predict the next token in a sequence. That’s it. One token at a time, over and over, until the response is complete.
But “just predicting the next token” turns out to be a surprisingly rich activity. To predict well, the model needs to capture something about syntax, semantics, logic, world knowledge, coding conventions, social norms, and the structure of arguments. Not because anyone told it to. Because all of those things are reflected in the patterns of text that humans produce, and the model learned those patterns by reading a significant fraction of the internet.
This post is about how that works. Not the marketing version – the actual mechanisms. Tokens, embeddings, attention, transformers, training, and the gap between what these systems can do and what they “understand.”
Tokens: the atoms of text
LLMs don’t read characters. They don’t read words, either. They read tokens – chunks of text that are somewhere between characters and words in size.
The word “understanding” might be a single token. The word “tokenisation” might be split into “token” + “isation”. A common word like “the” is almost certainly a single token in any major tokeniser. An uncommon word like “antidisestablishmentarianism” would be split into several. Numbers are tokenised digit by digit or in small groups. Code tokens include things like def, return, (), and \n.
Why tokens instead of characters or words? Characters are too granular – a model working character by character would need enormous context windows to see meaningful patterns. Words are too coarse – there are hundreds of thousands of distinct words in English alone, and the model would need a separate entry for every inflection, tense, and compound. Tokens hit a practical sweet spot.
The process of breaking text into tokens is called tokenisation, and the dominant method is Byte Pair Encoding (BPE), originally described by Philip Gage in 1994 as a data compression algorithm and later adapted for neural language models by Sennrich, Haddow, and Birch in 2016.
BPE works by starting with individual bytes (or characters) and iteratively merging the most frequent pair. Here’s a simplified example:
Suppose your training text contains the sequence low lower lowest repeatedly. BPE starts with individual characters: l, o, w, e, r, s, t, and so on. It counts every adjacent pair. If l + o appears most frequently, it merges them into a new token lo. Now it counts again. If lo + w is the most frequent pair, it merges them into low. Then low + e might merge into lowe, and so on. The process continues for a fixed number of merge operations (typically 30,000 to 100,000), producing a vocabulary of that many tokens.
The result is a vocabulary where common words are single tokens, common subwords are single tokens, and rare or novel words get split into known pieces. This is crucial for handling words the model has never seen before – it can still process them, just broken into familiar subword units.
Most modern LLMs use vocabularies of 30,000 to 100,000 tokens. GPT-4 uses around 100,000. Claude uses a similar order of magnitude. The exact vocabulary depends on the training data and the number of BPE merges performed.
A practical consequence: LLMs “see” text differently from humans. The sentence “I saw a dog” might be four tokens. The sentence “I saw a Labradoodle” might be five or six, because “Labradoodle” gets split into subwords. The model doesn’t see characters – it sees a sequence of integer IDs, each mapping to a token in its vocabulary. Token 1547 might be “the”. Token 28903 might be “ function” (with a leading space – spaces are part of tokens in most schemes). Token 85 might be a newline character.
This tokenisation step is entirely mechanical. It happens before the model sees anything. The model never operates on raw text – only on sequences of token IDs.
Embeddings: giving tokens meaning
A token ID is just a number. The model needs something richer – a representation that captures the meaning of each token and its relationship to other tokens.
This is where embeddings come in. Each token in the vocabulary is assigned a high-dimensional vector – a list of numbers, typically 4,096 to 12,288 of them in modern LLMs. These vectors are learned during training, not hand-crafted. At the start of training, they’re initialised randomly. By the end, tokens with similar meanings have vectors that point in similar directions in this high-dimensional space.
The classic example, from Mikolov et al.’s 2013 word2vec paper, is that the vector for “king” minus the vector for “man” plus the vector for “woman” gives a vector very close to “queen”. This isn’t a trick – it falls out naturally from training on large amounts of text, because the contexts in which these words appear encode their relationships.
In an LLM, the embedding layer is the first thing that happens. The input sequence of token IDs gets converted into a sequence of embedding vectors. If your input is 500 tokens and each token maps to a vector of 8,192 dimensions, you now have a 500 x 8,192 matrix of floating-point numbers. This matrix is what flows into the rest of the model.
But there’s a problem: the embedding for a token is the same regardless of where it appears in the sequence. The word “bank” has one embedding, whether it means a river bank, a financial bank, or a shot in snooker. The model needs to know not just what each token is, but where it sits in the sequence.
Positional encoding solves this. The original transformer paper (Vaswani et al., 2017) used sinusoidal functions to generate position-dependent vectors that are added to the token embeddings. More recent models use Rotary Position Embeddings (RoPE, Su et al., 2021), which encode relative positions by rotating the embedding vectors. The details vary, but the purpose is the same: after positional encoding, the model can distinguish between “The dog bit the man” and “The man bit the dog”.
The transformer: the architecture underneath
Every major LLM – GPT, Claude, Llama, Gemini – is built on the transformer architecture, introduced in a 2017 paper by researchers at Google with the quietly confident title “Attention Is All You Need”. Before transformers, language models used recurrent neural networks (RNNs) that processed text one word at a time, left to right, like reading a sentence with a finger. This worked, but it was slow and struggled with long-range dependencies – by the time the model reached the end of a paragraph, it had largely forgotten the beginning.
Transformers threw that away. Instead of processing text sequentially, a transformer looks at the entire input at once and figures out which parts relate to which other parts. It’s the difference between reading a sentence word by word and seeing the whole sentence on a page. This parallelism made transformers dramatically faster to train, and the ability to attend to any part of the input regardless of distance made them dramatically better at capturing meaning.
The transformer is built from a stack of identical blocks, each containing two key components: an attention mechanism (which figures out what to pay attention to) and a feed-forward network (which processes the result). We’ll look at both, starting with attention – the mechanism that made the whole thing work.
Attention: the mechanism that changed everything
The core innovation is the attention mechanism. It’s what allows the model to relate different parts of the input to each other, regardless of distance.
Here’s the intuition. Consider the sentence: “The cat sat on the mat because it was tired.” What does “it” refer to? The cat, obviously. But how does the model figure that out? It needs to look back at every previous token and determine which ones are relevant to interpreting “it” in this context.
Attention lets the model do exactly this. For each token in the sequence, the model computes three things from its embedding:
- A query vector: “What am I looking for?”
- A key vector: “What do I contain?”
- A value vector: “What information should I provide if I’m relevant?”
These are computed by multiplying the token’s embedding by three learned weight matrices (Q, K, and V). Then, for each token, the model computes the dot product of its query with every other token’s key. This produces a set of attention scores – numbers indicating how relevant each other token is to the current one.
These scores are passed through a softmax function (which converts them into probabilities that sum to 1), and then used to compute a weighted average of the value vectors. The result is a new representation of the current token that incorporates information from every other token in the sequence, weighted by relevance.
In the “it was tired” example, the attention mechanism would assign a high score to the pairing of “it” (query) with “cat” (key), because the model has learned from training data that pronouns attend to their antecedents.
The mathematical formulation, from the original transformer paper, is:
Attention(Q, K, V) = softmax(QK^T / sqrt(d_k)) V
The sqrt(d_k) term is a scaling factor (d_k is the dimension of the key vectors) that prevents the dot products from becoming too large, which would push the softmax into regions where the gradients are tiny and learning stalls.
Multi-head attention: parallel perspectives
A single attention computation captures one kind of relationship between tokens. But language is rich – a single token might simultaneously need to attend to its syntactic subject, the verb it modifies, the topic of the paragraph, and the format of the document.
Multi-head attention runs multiple attention computations in parallel, each with its own Q, K, and V weight matrices. A model with 32 attention heads computes 32 different sets of attention patterns simultaneously. The results are concatenated and projected back to the model’s dimension through another learned weight matrix.
Different heads learn to capture different kinds of relationships. Research by Clark et al. (2019) and others has found that in trained models, some attention heads specialise in syntactic dependencies (subject-verb agreement), some in positional relationships (attending to the previous token), some in semantic relationships, and some in patterns that are difficult for humans to interpret.
The key insight is that nobody tells the heads what to specialise in. The specialisation emerges from training. The model discovers that attending to different kinds of information in parallel produces better predictions.
The transformer block
An attention layer is part of a larger unit called a transformer block (or transformer layer). Each block consists of:
- Multi-head self-attention: the attention mechanism described above
- Layer normalisation: scaling the outputs to have zero mean and unit variance, which stabilises training
- Feed-forward network: two linear transformations with a non-linear activation function (typically GeLU or SwiGLU) in between
- Residual connections: adding the input of each sub-layer to its output, so information can flow through the network without being forced through every transformation
The feed-forward network is where much of the model’s “knowledge” is believed to be stored. While attention handles the relationships between tokens, the feed-forward layers act as a kind of lookup table – a massive, compressed, approximate memory of facts and patterns learned during training. Research by Geva et al. (2021) characterised feed-forward layers as “key-value memories” where the first linear transformation acts as keys and the second acts as values.
A modern LLM stacks many transformer blocks on top of each other. GPT-4 is believed to have around 120 layers. Claude’s architecture isn’t public, but models of this class typically have 80 to 120 layers. The input embeddings flow through every block, being progressively refined. Early layers tend to capture surface-level patterns (syntax, local word relationships). Middle layers capture more abstract features (semantic roles, entity relationships). Late layers produce the representations that directly inform the prediction of the next token.
Context windows: how much the model can see
The context window is the maximum number of tokens the model can process in a single forward pass. It’s a hard limit – the model literally cannot see tokens outside this window.
Early transformer models had modest context windows: GPT-2 (2019) had 1,024 tokens, roughly 750 words. GPT-3 (2020) had 2,048 tokens. As of 2025, context windows have expanded dramatically – Claude’s context window is 1,000,000 tokens – roughly 750,000 words, or about ten novels.
The expansion is non-trivial because the standard attention mechanism has a computational cost that scales quadratically with sequence length. If you double the context window, the attention computation costs four times as much. For a 200,000-token context window with naive attention, the cost would be staggering.
Modern models address this through various efficiency techniques. FlashAttention (Dao et al., 2022) restructures the attention computation to be more cache-efficient without changing the mathematical result. Grouped-query attention (GQA) shares key and value projections across multiple query heads, reducing memory requirements. Some models use sparse attention patterns that allow each token to attend to only a subset of other tokens.
The context window matters because everything the model “knows” about your specific conversation comes from the context window. The model has no persistent memory between conversations. If you had a conversation yesterday, the model doesn’t remember it. If you mentioned your name 50,000 tokens ago, the model can (in principle) still attend to that information, but the practical effectiveness of attention over very long ranges depends on the model and the training.
Generating text: one token at a time
Here’s where things get concrete. When you send a prompt to an LLM, the model processes the entire input through all its layers and produces, at the final layer, a probability distribution over the entire vocabulary for the next token.
Not the next sentence. Not the next word. The next token.
The model might assign a 15% probability to “the”, 8% to “a”, 4% to “\n”, 3% to “this”, and so on across all 100,000 tokens in its vocabulary. These probabilities sum to 1.
Then the model selects one token from this distribution, appends it to the sequence, and runs the whole process again to predict the token after that. This is called autoregressive generation – each output becomes part of the input for the next prediction.
A 500-token response requires 500 forward passes through the entire model. This is why generation is slower than processing the input – each new token requires a full pass through all layers (though in practice, the computation is optimised using a KV cache that stores the key and value vectors from previous tokens so they don’t need to be recomputed).
Temperature and top-p: controlling randomness
How does the model choose which token to select from the probability distribution? This is where temperature and top-p (nucleus sampling) come in.
Temperature scales the logits (the raw, pre-softmax scores) before converting them to probabilities. A temperature of 1.0 uses the distribution as-is. A temperature below 1.0 (say, 0.3) makes the distribution “sharper” – the most likely tokens become even more likely, and unlikely tokens become even less likely. A temperature of 0 is deterministic: always pick the highest-probability token. A temperature above 1.0 “flattens” the distribution, making unlikely tokens more likely to be selected.
Low temperature produces more predictable, focused text. High temperature produces more varied, creative (and sometimes nonsensical) text.
Top-p (nucleus sampling, introduced by Holtzman et al., 2020) takes a different approach: instead of scaling all probabilities, it considers only the smallest set of tokens whose cumulative probability exceeds a threshold p. If p = 0.9, the model considers only the top tokens that together account for 90% of the probability mass, and samples from among those. Everything else is excluded.
Top-p is adaptive – when the model is confident (one token dominates the distribution), the nucleus is small. When the model is uncertain (many tokens are roughly equally likely), the nucleus is large. This tends to produce better results than temperature alone, because it naturally adjusts the diversity of outputs to the model’s confidence.
In practice, APIs expose both parameters, and they interact. Most production uses keep temperature relatively low (0.0 to 0.7) for factual tasks and higher (0.7 to 1.0) for creative tasks.
The training pipeline
How does a model learn to predict the next token? The training process has three major phases, each building on the last.
Phase 1: Pretraining
Pretraining is where the model learns language. The training data is a massive corpus of text – web pages, books, code repositories, academic papers, forums, documentation. For frontier models – the term the industry uses for the most capable models from the leading labs, like Claude, GPT-4, and Gemini – this corpus is measured in trillions of tokens. The exact composition is typically proprietary, but it includes a broad cross-section of human-written text.
The training objective is straightforward: given a sequence of tokens, predict the next one. The model processes the training data in batches, makes predictions, computes how wrong it was (using cross-entropy loss, which measures the difference between the predicted probability distribution and the actual next token), and adjusts its weights to be slightly less wrong next time.
This adjustment happens through backpropagation and gradient descent – the same optimisation procedure used in virtually all deep learning. The loss function tells you how wrong the model was. Backpropagation computes how each weight in the model contributed to that error. Gradient descent adjusts each weight by a small amount in the direction that reduces the error. Repeat this billions of times, across trillions of tokens, and the weights gradually converge on values that produce good predictions.
Modern pretraining uses the Adam optimiser (Kingma and Ba, 2015) or variants of it, with learning rate schedules that warm up the learning rate gradually and then decay it. The training runs on thousands of GPUs (or TPUs) for weeks or months. The compute cost for frontier models is measured in tens of millions of dollars.
The remarkable thing about pretraining is how much emerges from such a simple objective. The model isn’t told about grammar, logic, programming languages, history, or mathematics. It just learns to predict the next token. But to predict well across such a diverse corpus, it must implicitly capture an enormous amount about the structure of language and the world it describes.
Phase 2: Fine-tuning (supervised)
A pretrained model is good at predicting text, but it’s not yet useful as an assistant. If you prompt it with “What is the capital of Australia?”, a purely pretrained model might continue with “The answer is Canberra” – but it might also continue with “This question appears on the geography quiz for Year 7 students” or “A. Canberra B. Sydney C. Melbourne D. Brisbane”. It’s predicting what text is likely to follow, and there are many plausible continuations.
Supervised fine-tuning (SFT) narrows the model’s behaviour by training it on examples of the desired interaction pattern. Human annotators write thousands of example prompt-response pairs demonstrating the kind of helpful, accurate, structured responses the model should produce. The model is fine-tuned on these examples using the same next-token prediction objective, but with a much smaller, curated dataset.
SFT teaches the model the format of being an assistant – that it should answer questions directly, structure its responses clearly, acknowledge uncertainty, and follow instructions.
Phase 3: RLHF (Reinforcement Learning from Human Feedback)
SFT gets the model most of the way there, but human preferences are subtle. Is it better to give a concise answer or a thorough one? How should the model handle ambiguous instructions? When should it refuse a request?
Reinforcement Learning from Human Feedback (RLHF, described by Ouyang et al., 2022 for the InstructGPT work) addresses this by training the model to optimise for human preferences.
The process has two steps:
-
Train a reward model. Generate multiple responses to the same prompt. Human annotators rank them from best to worst. Train a separate neural network (the reward model) to predict which response a human would prefer. This reward model learns to score outputs on quality, helpfulness, safety, and adherence to instructions.
-
Optimise the language model against the reward model. Using a reinforcement learning algorithm (typically PPO – Proximal Policy Optimisation, Schulman et al., 2017, or more recently DPO – Direct Preference Optimisation), adjust the language model’s weights to produce outputs that the reward model scores highly. The key constraint is that the model shouldn’t deviate too far from the fine-tuned model – you don’t want optimising for the reward model to destroy the model’s general capabilities.
RLHF is what makes the difference between a model that can predict text and a model that is genuinely useful to interact with. It’s also what makes models more cautious, more structured in their responses, and more inclined to refuse harmful requests.
Some newer approaches, including Constitutional AI (Bai et al., 2022), use AI feedback in addition to (or instead of) human feedback in parts of the process, but the core idea remains: optimise the model’s outputs to align with human preferences.
What “predicting the next token” actually means
There’s a common dismissal of LLMs: “It’s just predicting the next token.” This is technically accurate and deeply misleading.
Consider what it takes to predict the next token well. If the context is a legal contract, the model must “know” contract structure, legal terminology, and the conventions of contract drafting. If the context is Python code, it must track variable scopes, function signatures, indentation, and the semantics of the language. If the context is a conversation about quantum physics, it must produce text that’s consistent with quantum mechanics.
The model doesn’t “know” these things in the way a human expert does. It has no experiences, no intuitions, no understanding of why quantum mechanics is the way it is. But it has captured statistical patterns in text that are rich enough to produce outputs that look like they come from someone who does understand.
This is genuinely remarkable, and it’s also the source of the most important failure modes. The model is optimising for “what would plausible-sounding text look like here?” – not for “what is true?” These are usually the same thing, because plausible text about well-covered topics tends to be accurate. But they diverge in exactly the cases where accuracy matters most: obscure facts, recent events, precise numerical claims, and reasoning chains that require strict logical validity.
Why they hallucinate
Hallucination – the generation of confident, fluent, entirely fabricated information – is not a bug that can be fixed with more training data. It’s a structural consequence of how LLMs work.
The model generates text by choosing high-probability tokens one at a time. It has no mechanism for checking whether its output is factually correct. It has no database of facts it can look up. It has no way to distinguish between “this is a pattern I learned from reliable sources” and “this is a plausible-sounding continuation that happens to be wrong.”
When the model encounters a question about an obscure topic, it faces a choice: produce fluent text that matches the expected pattern (which might be wrong), or signal uncertainty (which requires overriding the strong pattern of producing confident text that it learned during training). The training process – especially RLHF – has pushed models toward expressing uncertainty more often, but the fundamental tension remains.
Hallucination is especially likely when:
- The question asks for specific details (dates, numbers, names) about topics that appear infrequently in the training data
- The model is asked to cite sources (it has learned the pattern of citations but doesn’t have access to a citation database)
- The question requires reasoning that extends beyond the patterns in the training data
- The prompt is ambiguous and the model guesses at intent rather than asking for clarification
Retrieval-augmented generation (RAG) – where the model is given relevant documents to reference – helps significantly, because it replaces “generate from patterns” with “summarise from provided text.” But the underlying architecture hasn’t changed. The model is still predicting tokens, not verifying facts.
Why they’re good at code
LLMs are disproportionately good at writing code, and the reasons are illuminating.
First, code is heavily represented in training data. GitHub alone contains billions of files of source code, all publicly available. Stack Overflow has millions of answered questions with code examples. Documentation, tutorials, blog posts, textbooks – the volume of well-structured code in the training corpus is enormous.
Second, code is less ambiguous than natural language. A function either compiles or it doesn’t. A variable is either in scope or it isn’t. The syntax rules are strict and well-defined. This makes code easier for a statistical model to learn, because the patterns are more consistent. In natural language, “bank” can mean ten different things. In Python, def always means the same thing.
Third, code is highly repetitive. Most code follows standard patterns: import libraries, define functions, handle errors, return results. Design patterns recur across millions of repositories. The model doesn’t need to invent novel algorithms (though it sometimes can) – it needs to recognise which pattern applies and instantiate it correctly for the current context.
Fourth, code comes with its own error-checking mechanism. When you run LLM-generated code and it fails, the error message is itself a prompt you can feed back to the model. This feedback loop – generate, run, fix, repeat – is enormously productive, because the model is good at understanding error messages and making targeted corrections.
This is the pattern we saw throughout the GreenBox series, from the early sprint work where Tom first used an LLM to accelerate development, through to the ensemble programming sessions where the whole team collaborated with AI assistance, to the 2 AM production fix where an LLM diagnosed a timezone bug from structured logs and ADRs, wrote the fix, opened a PR, and had it deployed via canary before anyone woke up. The Value Is in Ideas, Not Code post captured the underlying shift: when code generation becomes cheap, the bottleneck moves to knowing what to ask for. The 2 AM fix shows the other side – when the infrastructure is good enough (ADRs, observability, canary deploys, automated rollback), the LLM can act on that knowledge autonomously.
The gap between capability and understanding
Here’s the thing that I think is most important to understand about LLMs, and it’s the thing that most commentary gets wrong.
LLMs are not “stochastic parrots” that merely recombine memorised text. Nor are they conscious beings that truly understand what they’re saying. They’re something new – something we don’t have a great word for yet.
They can follow complex instructions. They can write functional code for problems that don’t appear in their training data. They can reason through multi-step problems (imperfectly, but measurably). They can transfer knowledge between domains in ways that look a lot like understanding. They can generate creative solutions that surprise even their creators.
But they can also fail at basic arithmetic, get confused by negation, confidently assert falsehoods, struggle with spatial reasoning, and produce outputs that are syntactically perfect but semantically absurd. These failures are not random – they reflect the boundaries of what can be learned from the statistical patterns of text.
A useful analogy: an LLM is like someone who has read everything ever written but has never been outside. They can describe a sunset beautifully because they’ve read thousands of descriptions. They can explain the physics of light scattering. They can write a character who watches a sunset and feels moved. But they’ve never actually seen one. Their knowledge is real – it produces genuinely useful outputs – but it’s mediated entirely through text.
This gap matters practically. LLMs are extraordinary tools for generation, summarisation, translation, code writing, brainstorming, and pattern matching. They are poor tools for factual verification, mathematical proof, real-time information, and any task where correctness must be guaranteed rather than probable.
The Chinese Room and the question of understanding
The philosopher John Searle posed a thought experiment in 1980 that feels uncomfortably relevant now. Imagine you’re locked in a room. People slide Chinese characters under the door. You don’t speak Chinese, but you have an enormous book of rules: “When you see this pattern, write that pattern and slide it back.” You follow the rules perfectly. To the people outside, it looks like the room understands Chinese. But you – the person in the room – understand nothing. You’re just matching patterns.
Searle’s argument was that this is what computers do: symbol manipulation without comprehension. The room produces correct outputs without understanding the meaning of any of them. It’s a compelling argument, and LLMs seem to be the most sophisticated Chinese Room ever built. They manipulate tokens according to learned statistical patterns and produce outputs that look like understanding. But is there understanding inside? Or just very good pattern matching?
The honest answer is: we don’t know. And the reason we don’t know exposes a deeper problem – we can’t actually define what “understanding” means precisely enough to test for it.
Consider: when a child learns that fire is hot, do they “understand” heat? Or have they learned a pattern – touch fire, feel pain, don’t touch fire again? When a doctor diagnoses a rare disease from a cluster of symptoms, is that understanding or pattern matching against thousands of cases they’ve seen or read about? When you catch a ball, are you solving differential equations or running a learned motor pattern? The boundary between “genuine understanding” and “very sophisticated pattern matching” is far blurrier than Searle’s thought experiment suggests.
This matters because the question people really want answered – “is AI actually intelligent?” – runs into the same wall. We don’t have a rigorous definition of intelligence. Alan Turing sidestepped the problem entirely in 1950 with his famous test: don’t ask whether the machine thinks, ask whether you can tell the difference. That’s a pragmatic answer, not a philosophical one. The Turing Test tells you about your ability to detect the difference, not about what’s happening inside the machine.
The psychologist Howard Gardner proposed that intelligence isn’t one thing – it’s at least eight different things (linguistic, logical-mathematical, spatial, musical, bodily-kinaesthetic, interpersonal, intrapersonal, naturalistic). By some of those measures, LLMs are superhuman. By others, they’re non-functional. A system that can write better prose than most humans but can’t tell you whether a ball fits in a box is intelligent by one definition and not by another.
The practical takeaway for anyone using LLMs: stop asking “is it intelligent?” and start asking “is it useful for this specific task?” The Chinese Room might not understand Chinese, but if it answers your questions correctly, helps you write better code, and catches bugs you missed – does the philosophy matter? Searle would say yes. Your deploy pipeline doesn’t care.
What I find most interesting is that the debate reveals more about the limits of our definitions than about the limits of the technology. We built something that defies our existing categories. It’s not intelligent the way humans are. It’s not unintelligent the way a calculator is. It’s something else, and we’ll probably need new words before we can talk about it clearly.
The transformer architecture at a glance
Here’s a summary of how the pieces fit together, from input to output.
| Stage | What happens | Output |
|---|---|---|
| Tokenisation | Raw text is split into tokens using BPE | Sequence of token IDs |
| Embedding | Token IDs are mapped to high-dimensional vectors | Matrix of embedding vectors |
| Positional encoding | Position information is added to embeddings | Position-aware embeddings |
| Transformer blocks (x80-120) | Multi-head attention + feed-forward, repeated | Refined representations at each layer |
| Output projection | Final layer representations projected to vocabulary size | Logits (scores) for every token in vocabulary |
| Softmax + sampling | Logits converted to probabilities, one token selected | The next token |
Then the selected token is appended to the sequence and the process repeats from the transformer blocks onward (with the KV cache avoiding redundant computation for earlier tokens).
Scale and emergent capabilities
One of the most striking findings in LLM research is that capabilities emerge at scale. Smaller models can complete simple text. Larger models can follow instructions. Even larger models can perform multi-step reasoning, write complex code, and engage with nuanced arguments.
These emergent abilities – capabilities that appear suddenly as models scale, rather than improving gradually – were characterised by Wei et al. (2022). A model with 1 billion parameters might be unable to do basic arithmetic. A model with 10 billion might do simple addition. A model with 100 billion might do multi-digit multiplication. The capability doesn’t improve linearly with scale – it appears relatively abruptly.
Whether “emergence” is truly a phase transition or an artefact of how we measure performance is debated (Schaeffer et al., 2023 argue it’s partly the latter), but the practical observation is clear: larger models are not just slightly better – they’re qualitatively different in what they can do.
The scaling laws described by Kaplan et al. (2020) and refined by Hoffmann et al. (2022) (the “Chinchilla” paper) established that model performance follows predictable power laws as a function of model size, dataset size, and compute. The Chinchilla paper’s key finding was that many models were trained on too little data relative to their size – a 70-billion-parameter model should be trained on roughly 1.4 trillion tokens, far more than was standard at the time.
The parameter count
When people talk about a “70B model” or a “400B model”, the B stands for billions of parameters – the learned weights in the model. These are the numbers that get adjusted during training. Every attention weight, every feed-forward weight, every embedding vector is a parameter.
A 70-billion-parameter model stored in 16-bit floating point requires roughly 140 GB of memory just for the weights. And that’s before accounting for the memory needed when the model actually runs – what the industry calls inference, meaning the process of feeding in a prompt and generating a response. During inference, the model needs additional memory for the KV cache (a store of previously computed attention keys and values so it doesn’t have to recompute them for every new token), activations, and overhead. This is why running large models requires multiple GPUs.
The cost of inference is substantial. Running a frontier model requires a cluster of high-end GPUs – typically NVIDIA A100s or H100s. A single H100 costs around US$30,000, and you need eight of them to run a 70B model (more for larger models). A cluster capable of serving a model like Claude or GPT-4 to millions of users costs tens of millions of dollars in hardware alone, before electricity, cooling, networking, and the engineering team to keep it running.
This cost is what drives the per-token pricing you see from API providers. When Anthropic charges a fraction of a cent per token, that price reflects the amortised cost of the GPU cluster, the electricity to run it (a single H100 draws around 700 watts), the memory bandwidth consumed by the KV cache, and the engineering overhead. Input tokens are cheaper than output tokens because reading the prompt involves a single forward pass, while generating a response requires a separate forward pass for every token produced – each one computing attention across the full context. A long conversation with a frontier model might generate 2,000 output tokens. At each step, the model is attending to every previous token, which is why the cost scales with both the length of the input and the length of the output.
For perspective: generating a 2,000-word response from a frontier model via API typically costs between AU$0.05 and AU$0.50, depending on the model and the length of the input context. That sounds cheap – and compared to hiring a human to write 2,000 words, it is – but multiply it by millions of requests per day and the infrastructure bill is enormous. The economics of LLMs are fundamentally a story about GPU memory, electricity, and how many tokens you can push through a chip per second.
The parameters are where the model’s “knowledge” lives, encoded in the relationships between weights. A specific fact isn’t stored in a specific parameter – it’s distributed across millions of parameters in a way that makes it accessible when the right pattern of activation occurs. This distributed representation is what makes it possible to store so much information in a relatively compact set of numbers, and it’s also what makes hallucination so difficult to prevent – you can’t just look up “is this fact correct?” in the model’s weights.
Chain of thought and reasoning
A pure next-token predictor struggles with multi-step reasoning because each token is generated based on the full context but without any explicit “thinking” step. In 2022, Wei et al. showed that prompting models to “think step by step” – chain-of-thought prompting – dramatically improves performance on reasoning tasks.
This works because it gives the model more tokens in which to work through intermediate steps. Instead of jumping from question to answer in one step, the model generates its reasoning as text, and that text becomes part of the context for subsequent tokens. The model is essentially using its own output as a scratchpad.
This is less magical than it sounds. The model isn’t “thinking” in the way a human does. It’s producing text that follows the pattern of step-by-step reasoning, and each step constrains the next step in useful ways. But the practical effect is substantial – chain-of-thought prompting can improve accuracy on mathematical and logical reasoning tasks by 20-40 percentage points.
More recent models have this behaviour built into their training. Claude, for instance, often works through problems step by step without being asked, because this pattern was reinforced during RLHF.
What about the future?
LLMs are improving fast. Context windows are expanding. Training data curation is becoming more sophisticated. New architectures (mixture-of-experts models, which activate only a subset of parameters for each token) are making larger models more efficient. Multimodal models that process text, images, and audio are becoming standard.
But the fundamental architecture – transformers predicting the next token – has been remarkably stable since 2017. The improvements have come from scale, data quality, training techniques, and engineering, not from a radical rethinking of the approach.
Whether this architecture has a ceiling – whether “predict the next token” can scale all the way to artificial general intelligence, or whether something fundamentally different is needed – is the most important open question in AI research. The optimists point to the steady improvement of scaling laws and the continued emergence of new capabilities. The sceptics point to the persistent failure modes (hallucination, poor arithmetic, brittleness to adversarial inputs) as evidence that statistical pattern matching has structural limits.
Both sides might be right. LLMs might continue to improve dramatically while retaining certain categories of failure. They might become better at everything we need them for while still not “understanding” anything in the way humans do.
For practical purposes, the answer to “how do LLMs work?” is: they read text as tokens, embed those tokens in high-dimensional space, use attention to relate tokens to each other across thousands of layers, and predict the next token from the resulting representation. The training process teaches them patterns that span syntax, semantics, logic, and world knowledge. The result is a system that can generate remarkably useful text while having no explicit model of truth, no persistent memory, and no understanding of why its outputs are correct when they are.
That’s not a criticism. It’s a description. And understanding the description makes you better at using the tool – knowing when to trust it, when to verify, and when to reach for something else entirely.