You’ve heard of ChatGPT. Someone at work mentioned “diffusion models” and you nodded. A blog post told you to use a “multimodal” something. Your cousin sent you an AI-generated image of a cat riding a submarine and you wondered, vaguely, how that works. You’ve been meaning to look into all of this but every explanation assumes you already know the bit you don’t.
This is the field guide you needed six months ago.
In the previous post in this series, we opened up a Large Language Model and looked at the machinery inside — tokens, embeddings, attention, transformer blocks, the training pipeline. That post answered one question: how does an LLM actually work?
This post answers the next one: what else is out there?
Because LLMs are just one species in a rapidly growing zoo. There are models that generate images, models that generate video, models that produce music, models that reason step by step for minutes before answering, and models that combine several of these capabilities at once. The terminology is a mess. The marketing is worse. And if you’re trying to figure out what tool you actually need for a specific job, the landscape can feel impenetrable.
Let’s fix that. We’ll start with a word that gets thrown around constantly and rarely defined.
Modality: types of information
In AI, a modality is a type of input or output — a channel of information. The word comes from philosophy and cognitive science, where it refers to the senses: sight, hearing, touch. In AI, it’s been stretched to cover any distinct form of data.
The main modalities you’ll encounter:
| Modality | What it is | Example models |
|---|---|---|
| Text | Natural language, prose, dialogue | Claude, GPT-4, Llama |
| Code | Programming languages — arguably text, but the rules are different enough to matter | Claude, Codex, Code Llama |
| Image | Photographs, illustrations, diagrams, sprites | DALL-E, Stable Diffusion, Midjourney |
| Audio | Speech, music, sound effects | Whisper (speech→text), Suno (text→music) |
| Video | Moving images, often with audio | Sora, Runway, Kling |
| 3D | Meshes, point clouds, scenes | Point-E, NeRFs (emerging) |
| Structured data | Tables, databases, graphs | Various specialised models |
| Embeddings | Numerical representations that capture meaning — the hidden modality that powers search | text-embedding-3, Cohere Embed |
A model can be single-modality — text in, text out. Or it can be multimodal — accepting and producing multiple types. When someone says “multimodal model,” they mean a model that crosses these boundaries. GPT-4o takes text and images as input and produces text, images, and audio as output. Claude takes text and images as input and produces text. Gemini handles text, images, audio, and video.
The direction matters. A model that takes text in and produces images out (DALL-E) is doing something fundamentally different from a model that takes images in and produces text out (image captioning). Both are “multimodal,” but the underlying machinery is very different.
This brings us to the machinery itself.
Architectures: the engine designs
An architecture is the fundamental design of the neural network — the blueprint for how data flows through the model and how it learns. Think of it like engine designs in cars: petrol, diesel, electric, hybrid. Different engineering, different trade-offs, different things they’re good at.
Transformers
If you read the previous post, you already know this one. The transformer architecture, introduced in “Attention Is All You Need” (Vaswani et al., 2017), is the engine behind virtually every major text-generating AI. Claude, GPT-4, Llama, Gemini, Mistral — all transformers.
The key innovation is the attention mechanism: instead of processing text sequentially (one word at a time, left to right), the transformer looks at the entire input at once and figures out which parts relate to which. This parallelism makes them fast to train and excellent at capturing long-range dependencies in text.
Transformers aren’t limited to text. Vision Transformers (ViT, Dosovitskiy et al., 2021) apply the same architecture to images by splitting an image into patches and treating each patch like a token. The attention mechanism then figures out which patches relate to which — exactly the same principle, different input.
The transformer has been remarkably dominant. But it has a known weakness: the attention mechanism scales quadratically with sequence length. Double the input, quadruple the compute. For very long inputs (millions of tokens), this becomes expensive. Which is part of why alternatives exist.
Diffusion models
Diffusion models are the engine behind most modern image generation: Stable Diffusion, DALL-E 3, Midjourney, and Flux.
The core idea is beautifully counterintuitive. During training, the model learns to reverse the process of adding noise to an image. You take a real image, gradually add random noise over many steps until it’s pure static, and train the model to predict what the image looked like one step earlier — slightly less noisy.
At generation time, you start with pure random noise and ask the model to denoise it, step by step. Each step removes a little noise and adds a little structure. After enough steps (typically 20-50), you have a coherent image.
The text prompt enters the picture through conditioning. The model doesn’t just denoise randomly — it denoises in a direction guided by a text description. The text “a cat riding a submarine in the style of Studio Ghibli” gets encoded into a numerical representation (usually by a text encoder like CLIP), and that representation steers every denoising step. The model has learned, from millions of image-caption pairs, which visual patterns correspond to which text descriptions.
This is fundamentally different from how LLMs work. An LLM generates output one token at a time, left to right. A diffusion model generates the entire image at once, refining it in passes from noise to clarity. There’s no concept of “next pixel” the way there’s a “next token.”
| LLM (transformer) | Diffusion model | |
|---|---|---|
| Generates | One token at a time | Entire output at once, refined iteratively |
| Training signal | "Predict the next token" | "Remove the noise" |
| Output type | Sequential (text, code) | Spatial (images, video frames) |
| Guided by | All previous tokens | Text embedding + previous denoising step |
| Speed | Fast per token, slow for long outputs | Fixed number of steps regardless of complexity |
The idea was first made practical by Ho et al. (2020). The breakthrough that made it work for high-resolution images was latent diffusion (Rombach et al., 2022) — instead of denoising the full image pixel by pixel (which is absurdly expensive at high resolution), you first compress the image into a much smaller representation, do the denoising there, and then decompress the result. It’s the difference between sculpting a full-size statue and sculpting a maquette that gets scaled up. This is the approach behind Stable Diffusion.
GANs (Generative Adversarial Networks)
Before diffusion models, GANs were the dominant approach to image generation. Introduced by Goodfellow et al. (2014), the idea is elegant: train two neural networks against each other.
The generator creates fake images. The discriminator tries to tell real images from fake ones. The generator gets better at fooling the discriminator. The discriminator gets better at detecting fakes. They push each other to improve, like a counterfeiter and a detective in an arms race.
GANs produced stunning results — StyleGAN (Karras et al., 2019) generated photorealistic faces that were indistinguishable from real photographs. But they were notoriously difficult to train. The two networks can fall out of balance (the generator collapses to producing one image, or the discriminator becomes unbeatable), and the training process is unstable compared to diffusion models.
Diffusion models have largely replaced GANs for general-purpose image generation, but GANs remain useful in some niches — real-time applications where the single-pass generation is faster than iterative denoising, and super-resolution tasks where you’re enhancing an existing image rather than generating from scratch.
State-space models
Transformers aren’t the only game in town for text. State-space models (SSMs), most notably Mamba (Gu and Dao, 2023), are an alternative architecture that processes sequences without the quadratic attention cost.
Instead of letting every token attend to every other token, SSMs maintain a compressed hidden state that evolves as each token is processed. Think of it as the difference between re-reading an entire book every time you want to recall something (attention) versus keeping a running set of notes that you update as you read (state-space). The notes are lossy — you can’t recall every detail — but updating them is fast and the cost scales linearly with sequence length, not quadratically.
SSMs are still emerging. They show promising results on long sequences where the quadratic cost of attention is prohibitive, but transformers remain dominant for most tasks as of early 2026. The two approaches may converge — hybrid architectures that combine attention for local precision with state-space mechanisms for long-range efficiency are an active area of research.
Paradigms: patterns built on top
Architectures are the engine. Paradigms are how you drive. These are patterns and techniques that sit on top of the fundamental architectures, often combining them in clever ways.
Reasoning models
Standard LLMs generate text in a single pass — the model reads your prompt, then starts producing tokens immediately. Reasoning models add an explicit thinking phase before answering.
OpenAI’s o1 and o3 models, and DeepSeek-R1, are the most prominent examples. When you ask a reasoning model a hard question, it generates a long internal chain of thought — sometimes thousands of tokens of deliberation — before producing the visible response. The model might consider multiple approaches, check its own reasoning, backtrack from dead ends, and work through intermediate steps.
This isn’t just chain-of-thought prompting (which we covered in the LLM post). Chain-of-thought prompting asks a standard model to show its working. Reasoning models are specifically trained — often using reinforcement learning — to use that thinking time productively. The training process rewards not just correct answers but effective reasoning strategies.
The trade-off is straightforward: reasoning models are slower and more expensive, but substantially better at tasks that require genuine multi-step reasoning — mathematics, formal logic, complex code, and scientific analysis. For a simple question like “what’s the capital of France?”, a reasoning model is overkill. For “find the bug in this 500-line concurrent program,” the extra thinking time pays for itself.
Recursive Language Models (RLMs)
RLMs are a recent inference-time paradigm from MIT (Zhang, Kraska, and Khattab, 2026) that addresses one of the most stubborn limitations of LLMs: the context window.
The insight is simple and surprisingly effective. Instead of cramming a massive prompt directly into the model’s context window, an RLM loads the prompt as a variable in a Python REPL and lets the model write code to examine, decompose, and process it. The model can peek at snippets, chunk the input, search through it, and — crucially — call itself recursively on sub-sections.
This means a model with a 272K token context window can effectively process inputs of 10 million tokens or more. The model never sees the whole input at once. Instead, it writes a program that strategically examines the parts it needs, delegates sub-questions to copies of itself, and assembles the results.
It’s not a new architecture — the underlying model is still a standard transformer. It’s a scaffold, a way of using an existing model more effectively. But the results are striking: RLMs outperformed both the base model and existing long-context approaches (summarisation agents, retrieval-augmented generation) by large margins on four diverse benchmarks, while maintaining comparable cost.
The pattern here is worth noting: some of the most impactful advances aren’t new architectures at all. They’re clever ways of using existing architectures differently.
Agents
An agent is an AI system that can take actions in the world — not just generate text, but use tools, browse the web, execute code, call APIs, and make decisions about what to do next.
The underlying model is typically an LLM, but instead of just producing a response, it produces a plan: “I need to search for X, then read the result, then calculate Y, then write a file.” Each step generates a new prompt that includes the results of previous steps. To understand agents, you need a few pieces of vocabulary.
Prompts are the instructions you give a model. You already know this — you type something, the model responds. But there’s a layer most people don’t see: the system prompt. Before your message ever reaches the model, the application wraps it with hidden instructions that shape behaviour. “You are a helpful assistant. Answer concisely. Do not produce harmful content.” That’s a system prompt. When ChatGPT refuses to help you build a bomb, that’s not some deep moral reasoning — it’s following instructions in a system prompt, reinforced by RLHF training. When Claude writes code in a particular style, that’s partly system prompt too. The system prompt is the invisible hand that makes the same underlying model behave differently in different products.
Tools are capabilities granted to an agent — things it can do beyond generating text. A bare LLM can only produce words. Give it tools and it can read files, search the web, execute code, query databases, send messages, or call external APIs. The model doesn’t inherently have these abilities. They’re defined by the developer who builds the agent, and the model learns to invoke them by generating structured requests (“I want to call the read_file tool with the path /src/main.py”). The set of tools available to an agent defines what it can accomplish — and its limits.
Sub-agents extend this further. A complex task might be too large or too varied for a single agent to handle efficiently. Instead, the agent can spawn sub-agents — smaller, focused agents that handle specific sub-tasks. An agent reviewing a large codebase might spawn one sub-agent to explore the directory structure, another to search for specific patterns, and a third to read and summarise relevant files — all working in parallel. Each sub-agent has its own context, its own tools, and returns its results to the parent. It’s delegation, the same way a manager breaks work into tasks for a team.
Skills are pre-packaged workflows — reusable recipes that an agent can invoke rather than figuring out from scratch. Instead of reasoning through the twelve steps of “create a git commit with the right message format,” an agent might invoke a commit skill that encapsulates that workflow. Skills trade flexibility for reliability: the agent doesn’t need to reinvent common procedures every time.
Agents blur the line between “AI as a tool” and “AI as a collaborator.” A tool responds to a single prompt. An agent pursues a goal across multiple steps, adapting its approach based on what it discovers along the way.
RAG (Retrieval-Augmented Generation)
RAG is a pattern that addresses a fundamental limitation: the model’s knowledge is frozen at training time. If you ask about something that happened after the training cutoff, or about your company’s internal documentation, the model can only hallucinate.
RAG works by retrieving relevant documents before generating a response. Your question gets converted into an embedding (a numerical representation), that embedding is compared against a database of document embeddings, the most relevant documents are pulled in, and those documents are included in the prompt alongside your question. The model then generates a response grounded in the retrieved text, rather than relying solely on what it learned during training.
This is how most enterprise AI deployments work in practice. The model might be Claude or GPT-4, but the knowledge comes from your documentation, your codebase, your internal wiki. RAG lets you get domain-specific answers from a general-purpose model without fine-tuning it.
The models: what you can actually use
All of the above is theory. Here’s the practical bit: what models exist, who makes them, and what can you do with them?
GPT is not a generic term
Let’s start with the biggest source of confusion. GPT stands for Generative Pre-trained Transformer. It’s the name of OpenAI’s model family — GPT-3, GPT-4, GPT-4o, GPT-5. It is not a generic term for AI models, despite being used that way in roughly half of all conversations about AI.
Calling all AI models “GPTs” is like calling all vacuum cleaners “Hoovers” or all search engines “Google.” Understandable, but imprecise. When someone says “we should use a GPT for this,” they might mean “we should use an LLM” — or they might specifically mean OpenAI’s product. It’s worth asking.
The major LLM families
| Model family | Made by | Open / closed | Notable for |
|---|---|---|---|
| GPT (GPT-4o, o1, o3, GPT-5) | OpenAI | Closed | First mover, reasoning models (o-series), broad multimodal support |
| Claude (Haiku, Sonnet, Opus) | Anthropic | Closed | Long context (1M tokens), strong at code and structured reasoning, Constitutional AI safety approach |
| Gemini | Google DeepMind | Closed | Natively multimodal (text, image, audio, video), integrated with Google services |
| Llama (Llama 3, 4) | Meta | Open-weight | Largest open model ecosystem, strong community, commercially usable |
| Mistral / Mixtral | Mistral AI | Open-weight | European, efficient MoE architecture, strong multilingual |
| Qwen | Alibaba | Open-weight | Strong multilingual (especially CJK), good code models, range of sizes |
| DeepSeek | DeepSeek AI | Open-weight | Reasoning focus (DeepSeek-R1), competitive with frontier closed models at lower cost |
| Grok | xAI | Partially open | Integrated with X (Twitter) data, less filtered |
Open-weight vs closed: why it matters
This distinction is one of the most important practical decisions you’ll make.
Closed models (GPT, Claude, Gemini) are accessible only through an API. You send your prompt to someone else’s servers and get a response back. You can’t see the model’s weights, can’t run it on your own hardware, and can’t modify it. The provider controls the model’s behaviour, pricing, and availability.
Open-weight models (Llama, Mistral, Qwen, DeepSeek) publish their model weights. You can download them, run them on your own hardware, fine-tune them for your specific use case, and inspect them. “Open-weight” rather than “open-source” because many of these models have restrictive licences — you can use the weights but the training code, data, and full methodology are often proprietary.
When does this matter?
- Fine-tuning: If you want to train a model on your own data (say, a dataset of space game sprites), you need open weights. You cannot fine-tune GPT-4 or Claude from scratch. OpenAI and others offer limited fine-tuning APIs, but the level of customisation is constrained.
- Privacy: If your data can’t leave your infrastructure (medical, legal, financial), you need a model you can run locally.
- Cost at scale: API calls add up. If you’re making millions of inference calls, running your own model on your own GPUs can be cheaper — though the upfront hardware cost is significant.
- Control: Closed models can change behaviour between versions, add or remove capabilities, or adjust content policies in ways that break your workflow. Open-weight models are a snapshot — the version you downloaded today will behave the same way tomorrow.
For most individuals and small teams experimenting with AI, the closed model APIs are the pragmatic starting point. They’re the most capable, the easiest to use, and the per-query cost is manageable at small scale. Open-weight models become compelling when you need customisation, privacy, or cost control at volume.
Image generation models
| Model | Made by | Architecture | Open / closed | Notable for |
|---|---|---|---|---|
| DALL-E 3 | OpenAI | Diffusion | Closed | Integrated with ChatGPT, good prompt adherence |
| Midjourney | Midjourney | Diffusion (proprietary) | Closed | Aesthetically striking defaults, strong at artistic styles |
| Stable Diffusion / SDXL | Stability AI | Latent diffusion | Open-weight | Enormous community, fine-tunable, runs locally |
| Flux | Black Forest Labs | Flow matching | Open-weight | Founded by original Stable Diffusion researchers, strong prompt adherence, efficient |
| Imagen | Google DeepMind | Diffusion | Closed | Integrated with Google products |
The open-weight image models — particularly Stable Diffusion and Flux — have spawned an enormous ecosystem of community-trained variants, style adaptations, and fine-tuning techniques. This is where LoRA (Low-Rank Adaptation) and Dreambooth come in: techniques for teaching an existing model a new style or concept with relatively little data and compute. Want a model that generates pixel art sprites in a specific style? Fine-tune Stable Diffusion or Flux with LoRA on a few hundred examples. We’ll dig deeper into this in a future post.
Video, audio, and beyond
The landscape for non-text, non-image modalities is moving fast but less mature:
- Video generation: Sora (OpenAI), Runway Gen-3, Kling (Kuaishou), Veo (Google). These typically extend diffusion models to generate sequences of frames. Quality has improved dramatically but consistency across long videos (characters changing appearance, physics breaking) remains challenging.
- Music and audio: Suno and Udio generate full songs from text descriptions. Whisper (OpenAI) is the standard for speech-to-text. Text-to-speech models (ElevenLabs, XTTS) produce increasingly natural-sounding voices.
- 3D generation: Still early. Point-E (OpenAI), various NeRF-based approaches. Generating 3D assets from text or images is an active research area but not yet reliable enough for production use in most cases.
Mixture of Experts: an architecture trick worth knowing
You’ll encounter the term Mixture of Experts (MoE) and it’s worth understanding because it explains how some models can be very large without being very expensive to run.
A standard transformer activates all of its parameters for every token. A 70-billion-parameter model does 70 billion parameters’ worth of computation for every single token it processes.
A Mixture of Experts model has many more total parameters, but only activates a subset of them for each token. The model contains multiple “expert” sub-networks, and a learned routing mechanism decides which experts to use for each token. Mixtral 8x7B, for example, has 8 expert networks of 7 billion parameters each (about 47 billion total), but only activates 2 experts per token — so the effective compute per token is closer to a 14-billion-parameter model, while having access to a much larger knowledge base.
This is how some models can be “bigger” without being proportionally slower or more expensive. The total parameter count (which gets the headlines) is much larger than the active parameter count per token (which determines the actual cost).
Embeddings: the hidden infrastructure
Embeddings deserve special mention because they’re everywhere and rarely explained.
An embedding is a vector that represents the meaning of a piece of text (or an image, or an audio clip) in a high-dimensional space that captures semantic similarity. Two texts that mean similar things will have similar embeddings, even if they use completely different words.
“The cat sat on the mat” and “A feline rested on the rug” would have very similar embeddings. “The stock market crashed” would have a very different one.
This matters because embeddings are the glue behind:
- Semantic search: Instead of keyword matching (“does this document contain the word ‘cat’?”), you compare embeddings (“is this document about a similar concept?”).
- RAG: The retrieval step in retrieval-augmented generation uses embeddings to find relevant documents.
- Clustering and classification: Group similar items together without hand-written rules.
- Recommendation systems: “You liked X, here are similar things.”
Embedding models are typically smaller, faster, and cheaper than generative models. They don’t produce text — they produce vectors. OpenAI’s text-embedding-3, Cohere’s Embed, and various open-source options (e5, GTE, BGE) are the main choices.
Making sense of it all: a decision framework
If you’ve read this far, you have the vocabulary. Now let’s make it practical. You have a task. Which model type do you need?
| I want to… | You need | Start here |
|---|---|---|
| Write or edit text, summarise documents, answer questions | An LLM | Claude or GPT-4o via API |
| Solve hard maths, logic, or coding problems | A reasoning model | Claude (extended thinking), o3, DeepSeek-R1 |
| Generate images from text descriptions | A diffusion model | Midjourney (quality), Stable Diffusion / Flux (open, fine-tunable) |
| Generate images in a specific style | A fine-tuned diffusion model | Stable Diffusion or Flux + LoRA fine-tuning |
| Generate video | A video generation model | Sora, Runway, Kling |
| Transcribe speech to text | A speech recognition model | Whisper |
| Generate music | A music generation model | Suno, Udio |
| Search my own documents using meaning, not keywords | An embedding model + vector database | text-embedding-3 + Pinecone/Chroma/pgvector |
| Build an AI that uses tools, browses the web, writes code | An agent framework around an LLM | Claude Code, LangChain, or build your own |
| Answer questions using my company's internal knowledge | RAG (embedding model + LLM) | Embed your docs, retrieve relevant ones, pass to Claude/GPT |
| Process inputs far beyond any model's context window | An RLM scaffold or chunking strategy | RLM framework, or manual chunking with an LLM |
| Run AI locally, on my own hardware, with full privacy | An open-weight model | Llama or Mistral via Ollama |
The pace of change
One thing this post can’t give you is a stable picture. It won’t last.
The landscape described here is accurate as of mid-2026. Six months ago, some of these models didn’t exist. Six months from now, some of them will have been superseded. The pace is genuinely unprecedented in software engineering — not just incremental improvements, but new categories of capability appearing every few months.
What will last is the framework. Modalities, architectures, paradigms, and models. New things will appear, but they’ll slot into this structure. A new model will operate on specific modalities, use a specific architecture (or a hybrid), employ specific paradigms, and be open or closed. If you understand the categories, you can evaluate new developments without starting from scratch every time.
Where to from here?
This post gave you the map. Future posts in this series will zoom into specific squares on it — picking a real problem, choosing the right model type, and walking through the process end to end, including what it actually costs.
Because the real test of understanding a landscape isn’t being able to name everything in it. It’s being able to pick the right path through it for where you’re trying to go.