Every developer has heard the promise: “build once, run anywhere.” And every developer has lived the reality: “works on my machine, crashes on the server.” Containers didn’t invent the idea of portable software, but they made it actually work. Not by virtualising hardware, not by simulating an operating system, but by drawing boundaries around a regular Linux process and convincing it that it’s alone.
The problem: environments are fragile
Software doesn’t run in a vacuum. It runs in an environment, a specific operating system version, specific libraries, specific configurations, specific file paths. When you develop on macOS with Python 3.11 and deploy to Ubuntu with Python 3.9 and the wrong version of libssl, things break. When your application assumes /tmp is writable and the production server’s /tmp is mounted read-only, things break. When two applications on the same server need different versions of the same shared library, things break.
Before containers, we dealt with this in various ways, none of them great. We wrote installation scripts that pulled dependencies and hoped they worked. We used configuration management tools like Puppet and Chef to converge servers toward a known state. We created virtual machines, complete operating system installations, for isolation. Each approach traded one problem for another: installation scripts were fragile, configuration management was complex, and virtual machines were heavy.
The container approach is different: package the application and its environment together, and run it in a way that isolates it from everything else on the host. The application thinks it’s alone. The host thinks it’s running a process. Everyone is happy.
The ancestors: chroot, jails, and zones
The idea of isolating a process from the rest of the system is older than most people realise.
chroot appeared in Unix Version 7 in 1979, nearly half a century ago. The chroot system call changes the apparent root directory for a process, so that / points to some subdirectory of the actual filesystem. A process running inside a chroot can’t see or access files outside its designated root. It was originally designed for building and testing software in a clean environment, and it’s still used for that purpose today.
But chroot is not security isolation. It only restricts filesystem access. A chrooted process still shares the same network interfaces, the same process table, the same users, and the same kernel. A root process inside a chroot can trivially escape it (create a new chroot, change directory to the real root, break free). Chroot is a convenience, not a boundary.
FreeBSD jails (2000) took the concept further. Introduced by Poul-Henning Kamp, jails provided filesystem isolation plus process isolation plus network isolation. A jailed process had its own root filesystem, its own process ID space (it couldn’t see processes outside the jail), and its own IP address. Jails were a genuine security boundary, breaking out of a properly configured jail was hard enough that FreeBSD used them in production hosting environments. The web hosting company that served your PHP website in 2005 was probably running jails.
Solaris Zones (2005) extended the idea even further with resource controls. A zone was a virtualised operating system instance running on a shared Solaris kernel, with strict limits on CPU, memory, and I/O. Zones could run different versions of the Solaris userland, had their own network stack, and were managed as first-class administrative units. They were elegant, well-designed, and confined to a platform (Solaris) that was already losing ground to Linux.
Each of these was a step along the same path: give a process (or a group of processes) the illusion of having the machine to itself, without the overhead of virtualising the hardware. The ideas worked. The problem was that each implementation was tied to a specific operating system. What Linux needed was its own version of these concepts, built into the kernel.
Namespaces: the illusion of isolation
The Linux kernel’s answer to process isolation is namespaces. A namespace wraps a global system resource in an abstraction that makes it appear to processes within the namespace as though they have their own isolated instance of that resource.
Linux has eight namespace types, added incrementally between 2002 and 2016:
| Namespace | Isolates | Since |
|---|---|---|
| Mount (mnt) | Filesystem mount points | 2002 (Linux 2.4.19) |
| UTS | Hostname and domain name | 2006 (Linux 2.6.19) |
| IPC | Inter-process communication | 2006 (Linux 2.6.19) |
| PID | Process IDs | 2008 (Linux 2.6.24) |
| Network (net) | Network devices, ports, routing | 2009 (Linux 2.6.29) |
| User | User and group IDs | 2013 (Linux 3.8) |
| Cgroup | Cgroup root directory | 2016 (Linux 4.6) |
| Time | System clocks | 2020 (Linux 5.6) |
The PID namespace is the easiest to understand. In your container, ps aux shows your application as PID 1, the init process, the first thing running. But from the host’s perspective, that same process has a completely different PID, sitting alongside hundreds of other processes. The container process genuinely believes it’s PID 1. It isn’t. The kernel is maintaining two views of the same process table.
The network namespace gives each container its own network stack: its own interfaces, its own IP addresses, its own routing table, its own port space. Two containers can both listen on port 80 without conflict, because they’re in different network namespaces. The container runtime sets up virtual ethernet pairs (veth) to connect the container’s network namespace to the host, typically through a bridge device.
The mount namespace gives each container its own view of the filesystem. The container sees its own root filesystem (the image), its own /proc, its own /sys. It doesn’t see the host filesystem unless you explicitly mount volumes into it.
The user namespace (the newest and most complex) allows a process to have root privileges inside its namespace while being an unprivileged user on the host. This is what enables rootless containers, containers that run without any host-level root access at all.
Here’s what matters: a container is just a Linux process that’s been placed into a set of namespaces. There’s no container hypervisor, no container kernel, no special container execution mode in the CPU. It’s the same kernel, the same scheduler, the same syscall interface. The namespaces just restrict what the process can see and interact with.
Cgroups: the resource police
Namespaces handle what a process can see. Control groups (cgroups) handle how much of the system’s resources it can use.
Cgroups were developed by Google engineers Paul Menage and Rohit Seth, and merged into the Linux kernel in 2008 (version 2.6.24). Google had been using an internal version for years to manage resource allocation across their vast fleet of servers. The problem was straightforward: on a machine running hundreds of processes, how do you prevent one runaway process from consuming all the CPU, memory, or I/O bandwidth and starving everything else?
Cgroups let you set hard limits:
- CPU: limit a process to a specific number of CPU cores or a percentage of available CPU time
- Memory: set a maximum memory usage; if the process exceeds it, the OOM (out of memory) killer terminates it
- I/O: limit read/write bandwidth to disk
- PIDs: limit the number of processes (prevents fork bombs)
- Network: control network bandwidth allocation
When you run docker run --memory=512m --cpus=1.5 myapp, Docker is creating cgroup entries that cap the container at 512 MB of RAM and 1.5 CPU cores. The process can use up to these limits and no more. If it tries to allocate more memory than its cgroup allows, the kernel kills it.
This is important to understand because it explains a common production problem: your application reports that the system has 64 GB of RAM (because it can see the host’s /proc/meminfo by default), allocates memory accordingly, and gets OOM-killed because the cgroup limit is 512 MB. Many languages and runtimes now detect cgroup limits, the JVM has done this since Java 10, but it’s worth checking that yours does.
What Docker actually did
Here’s a question that confuses a lot of people: if namespaces and cgroups existed since the late 2000s, and Linux Containers (LXC) provided a userspace interface to them since 2008, why didn’t containers take off until Docker appeared in 2013?
The answer is developer experience.
LXC gave you the tools to build containers, but using them required understanding namespaces, cgroups, filesystem setup, network configuration, and a dozen other things. It was powerful but complex, a tool for systems engineers, not application developers.
Docker’s genius was making containers accessible. Solomon Hykes and his team at dotCloud (a PaaS company) built a layer on top of LXC (later replaced with their own runtime, libcontainer, which became runc) that introduced several key innovations:
The Dockerfile: a simple, declarative text file that describes how to build a container image. Each line is an instruction – FROM ubuntu:22.04, RUN apt-get install -y python3, COPY app.py /app/, CMD ["python3", "/app/app.py"]. Anyone who could read a shell script could read a Dockerfile. This was revolutionary compared to the existing alternatives.
Image layers: each instruction in a Dockerfile creates a new layer. Layers are cached and shared. If you change your application code but not your system dependencies, Docker only rebuilds the changed layers. This made builds fast and images space-efficient.
The registry: Docker Hub provided a place to publish and share images. docker pull nginx gives you a working nginx installation in seconds. The network effect was powerful, once people started publishing images, everyone benefited.
docker run: a single command that pulls an image, creates a container, sets up namespaces and cgroups, configures networking, and starts the process. What previously required pages of configuration became one line in a terminal.
Docker didn’t invent containerisation. It made it usable. That’s a different kind of innovation, but no less significant.
Image layers and union filesystems
A container image is not a single file. It’s a stack of layers, each representing a set of filesystem changes. You need to understand this to build efficient images.
When Docker processes a Dockerfile, each instruction creates a layer:
FROM ubuntu:22.04 # Layer 1: base Ubuntu filesystem
RUN apt-get update # Layer 2: updated package lists
RUN apt-get install nginx # Layer 3: nginx and its dependencies
COPY index.html /var/www/ # Layer 4: your custom file
Each layer stores only the differences from the layer below it. Layer 2 contains only the files that changed when apt-get update ran, the new package list files. Layer 3 contains only the files added or modified by installing nginx.
This is made possible by a union filesystem (also called an overlay filesystem). The most common implementation on modern Linux is OverlayFS (merged into the kernel in version 3.18, 2014). OverlayFS takes multiple directory trees and presents them as a single merged view. Lower layers are read-only. The top layer is read-write.
When a container starts, all the image layers are stacked as read-only, and a thin read-write layer is added on top. This is the container’s writable layer. Any changes the container makes to the filesystem, creating files, modifying files, deleting files, happen in this layer. The underlying image layers are never modified. This is copy-on-write: when a container modifies a file from a lower layer, the file is first copied to the writable layer, and the modification happens there.
This design has several consequences:
Sharing is efficient. If ten containers are running from the same image, they share all the read-only layers. Only the writable layers are unique. A 200 MB image running ten times doesn’t use 2 GB of disk. It uses 200 MB plus ten small writable layers.
Layer order matters for build caching. Docker caches each layer and reuses it if the instruction and its inputs haven’t changed. If you put COPY . /app/ before RUN pip install -r requirements.txt, changing any source file invalidates the cache for the pip install layer, even if requirements.txt hasn’t changed. Putting the rarely-changing dependency installation before the frequently-changing code copy means your builds only repeat the expensive steps when they actually need to.
Container filesystems are ephemeral. When a container is removed, its writable layer is deleted. Any data written to the container’s filesystem is gone. This is why persistent data, database files, uploaded files, logs you want to keep, must be stored on volumes, which are directories on the host filesystem mounted into the container, bypassing the union filesystem entirely.
Containers vs virtual machines
The distinction between containers and virtual machines is fundamental, and getting it wrong leads to bad architectural decisions.
A virtual machine runs a complete operating system (the guest) on virtualised hardware provided by a hypervisor. The guest has its own kernel, its own device drivers, its own system services. The hypervisor (KVM, Xen, VMware ESXi, Hyper-V) mediates between the guest and the physical hardware. The guest doesn’t know it’s virtualised (or doesn’t care).
A container runs as a process on the host’s kernel, isolated by namespaces and resource-limited by cgroups. There’s no guest kernel, no virtualised hardware, no hypervisor overhead.
| Property | Containers | Virtual machines |
|---|---|---|
| Kernel | Shared with host | Own kernel |
| Startup time | Milliseconds | Seconds to minutes |
| Memory overhead | Minimal (just the process) | Significant (guest OS + kernel) |
| Image size | Megabytes (typically 50-500 MB) | Gigabytes |
| Density | Hundreds per host | Tens per host |
| Isolation | Process-level (kernel shared) | Hardware-level (kernel separate) |
| Security boundary | Weaker (shared kernel attack surface) | Stronger (hypervisor boundary) |
The performance difference is real and significant. A container starts in milliseconds because it’s just spawning a process. A VM takes seconds to minutes because it’s booting an entire operating system. A container uses only the memory its process needs. A VM reserves memory for the guest kernel, the init system, the system services, and the application.
But the isolation difference is equally real. Containers share a kernel with the host and with each other. A vulnerability in the Linux kernel, a privilege escalation bug, an escape from a namespace, compromises all containers on that host and the host itself. VMs, by contrast, have a much smaller attack surface: the hypervisor, which is far simpler than a full kernel.
This isn’t theoretical. Container escape vulnerabilities have been found and exploited:
- CVE-2019-5736: a vulnerability in runc (the OCI container runtime) that allowed a malicious container to overwrite the host’s runc binary and gain root access on the host
- CVE-2020-15257: a vulnerability in containerd that allowed containers with host network access to escalate to host root
- CVE-2022-0185: a Linux kernel heap overflow in the filesystem context code that allowed container escape
The practical takeaway: containers are excellent for isolation between your own workloads. They’re not sufficient for isolating untrusted workloads, running code you don’t trust requires either VMs or specialised sandboxes (like gVisor or Firecracker) that provide a stronger boundary.
Container orchestration: why you need it
Running a single container on a single machine is straightforward. Running hundreds of containers across dozens of machines, keeping them healthy, routing traffic to them, scaling them up and down, updating them without downtime, is a different problem entirely. This is the domain of container orchestration.
An orchestrator handles:
- Scheduling: deciding which machine runs which container, based on available resources, constraints, and placement rules
- Health checking: detecting when a container is unhealthy and replacing it
- Scaling: running more or fewer copies of a container based on demand
- Networking: connecting containers to each other and to the outside world, load balancing traffic across replicas
- Service discovery: letting containers find each other by name rather than by IP address (which changes every time a container restarts)
- Rolling updates: deploying new versions gradually, rolling back if something goes wrong
- Secret management: distributing sensitive configuration (database passwords, API keys) to containers securely
You could do all of this manually. People did, in the early days of Docker. It was a nightmare. The moment you have more than a handful of containers, you need automation.
ECS and Fargate: AWS’s approach
Amazon Elastic Container Service (ECS) is AWS’s container orchestrator. It’s opinionated, tightly integrated with the AWS ecosystem, and significantly simpler than Kubernetes.
In ECS, you define a task definition (what to run: the container image, CPU and memory requirements, environment variables, port mappings) and a service (how to run it: how many copies, which load balancer, what deployment strategy). ECS handles scheduling, health checking, and replacement.
ECS offers two launch types that determine where your containers actually run:
EC2 launch type: your containers run on EC2 instances that you manage. You provision the instances, keep them patched, handle capacity planning, and pay for the instances whether they’re fully used or not. You get more control, you can choose instance types, configure the AMI, attach EBS volumes directly, but you also get more responsibility.
Fargate launch type: AWS manages the infrastructure. You specify the CPU and memory for each task, and Fargate runs it on infrastructure you never see and never manage. No EC2 instances to patch. No capacity planning. You pay per vCPU-second and per GB-second of memory, for the time your tasks are actually running.
Fargate is the correct choice for most teams. You trade some control (you can’t SSH into the host, you can’t choose the instance type, you have less control over networking) for a dramatic reduction in operational burden. If you’re spending time patching ECS container instances, troubleshooting capacity issues, or managing Auto Scaling Groups just to have somewhere to run containers, Fargate eliminates all of that.
The EC2 launch type makes sense when you need GPU instances, specific instance types, or very high density (running many small containers on large instances can be cheaper than Fargate at scale). It also makes sense when you need host-level access, custom AMIs, specific kernel parameters, or direct hardware access.
Kubernetes: the open standard
Kubernetes (often abbreviated K8s) was open-sourced by Google in 2014, based on their internal cluster management system, Borg. It has become the de facto standard for container orchestration, supported by every major cloud provider and most infrastructure vendors.
Kubernetes provides everything ECS does and more: service mesh integration, custom resource definitions, a rich plugin ecosystem, multi-cloud portability, and a vast community. Its API is a standard that tools and platforms build upon.
It is also significantly more complex.
A minimal Kubernetes deployment involves an API server, etcd (a distributed key-value store for cluster state), a scheduler, a controller manager, kubelets on each node, a container runtime, a networking plugin (CNI), and often an ingress controller, a service mesh, a monitoring stack, and a secrets management solution. The learning curve is steep. The operational overhead is real. The YAML configuration files are… plentiful.
For most teams, teams running a handful of services, teams without dedicated platform engineers, teams where the product is the business (not the infrastructure). Kubernetes is more complexity than it’s worth. ECS with Fargate, or similar managed offerings, provides 80% of the capability at 20% of the operational cost.
Kubernetes makes sense when you need multi-cloud portability, when you have a platform team to operate it, when you need the ecosystem of tools built on the Kubernetes API, or when your scale genuinely demands the flexibility. For many organisations, though, running Kubernetes to deploy a web application and a database is like hiring a crane to hang a picture frame. It’ll work, but there are simpler options.
The security model: shared kernels, shared risk
This is the point that gets lost in the enthusiasm for containers: containers share a kernel with the host.
When a container makes a system call, reading a file, opening a network connection, allocating memory, that call goes directly to the host kernel. The kernel enforces the namespace and cgroup boundaries, but the syscall interface is the same one exposed to host processes. The attack surface is the entire Linux kernel syscall interface, which includes hundreds of system calls with decades of code behind them.
A kernel exploit that allows privilege escalation from an unprivileged process to root doesn’t just affect one container. It affects every container on that host and the host itself. This is the fundamental security tradeoff of containers: they’re lightweight and fast because they share a kernel, and they’re less isolated because they share a kernel.
Mitigations exist:
- Seccomp profiles restrict which system calls a container can make. Docker’s default seccomp profile blocks about 44 of the 300+ syscalls, including dangerous ones like
reboot,mount, andclock_settime. - AppArmor and SELinux provide mandatory access control, restricting what files and resources a container can access beyond what namespaces provide.
- User namespaces (rootless containers) ensure that even if a process is root inside the container, it’s an unprivileged user on the host.
- Read-only root filesystems prevent containers from modifying their own filesystem, limiting the impact of a compromise.
- gVisor (from Google) implements a user-space kernel that intercepts syscalls from the container and handles them in a sandboxed process, dramatically reducing the host kernel’s attack surface.
- Firecracker (from AWS, used by Lambda and Fargate) runs each container or function in a lightweight microVM, providing VM-level isolation with near-container startup times.
The trend is toward stronger isolation without giving up the developer experience of containers. Fargate running on Firecracker gives you the docker run interface with a hardware-level isolation boundary underneath. That’s the best of both worlds, but it’s important to understand that you’re getting VM-like isolation, not container-like isolation, which is exactly the point.
The OCI standard: beyond Docker
Docker defined the container era, but it doesn’t own it anymore.
The Open Container Initiative (OCI), founded in 2015 by Docker, CoreOS, Google, and others under the Linux Foundation, defines open standards for container formats and runtimes. The two key specifications are:
- OCI Image Specification: defines the format for container images (layers, manifests, configuration)
- OCI Runtime Specification: defines the interface for container runtimes (how to create, start, stop, and delete containers)
runc, originally extracted from Docker, is the reference implementation of the OCI runtime spec. But it’s not the only one:
- containerd is a container runtime (used by Docker and Kubernetes) that manages container lifecycle and image management, calling runc to create containers
- Podman (from Red Hat) is a Docker-compatible CLI that runs containers without a daemon, no background process needed, no root access needed by default
- CRI-O is a lightweight container runtime designed specifically for Kubernetes, implementing the Container Runtime Interface (CRI)
This means you can build images with Docker, run them with Podman, orchestrate them with Kubernetes using CRI-O, and everything works because they all speak the OCI standard. The image format is the same. The runtime contract is the same. The tooling is interchangeable.
Container networking: connecting the pieces
Container networking is one of those topics that seems simple until you actually try to debug it.
When a container starts, it gets its own network namespace, an isolated network stack with its own interfaces, routing table, and port space. By default, Docker creates a bridge network on the host (called docker0), and each container gets a virtual ethernet pair: one end inside the container’s namespace (typically eth0), the other end attached to the bridge on the host.
The bridge acts like a virtual switch. Containers on the same bridge can communicate with each other using their IP addresses. The host uses iptables NAT rules to forward traffic from the outside world to containers (this is what -p 8080:80 does, it adds a NAT rule mapping host port 8080 to container port 80).
This works fine on a single host. But when containers span multiple hosts, which they do in any production orchestration setup, things get more interesting. The container on host A needs to reach the container on host B, and both are in private network namespaces that don’t exist outside their respective hosts.
Orchestrators solve this with overlay networks. An overlay network creates a virtual network that spans multiple hosts, using encapsulation (typically VXLAN) to tunnel container traffic through the host network. Each container gets an IP address on the overlay network and can communicate with any other container on the same overlay, regardless of which host it’s running on. The encapsulation and routing are handled transparently.
Service discovery is the complement to networking. In a dynamic environment where containers start, stop, and move between hosts, you can’t hard-code IP addresses. Instead, you refer to services by name, and the orchestrator resolves names to the current set of healthy container IPs. ECS integrates with AWS Cloud Map and Route 53 for service discovery. Kubernetes has a built-in DNS service that resolves service names to cluster IPs.
The networking model is where containers diverge most from VMs. A VM gets a virtual NIC that looks and behaves like a physical NIC. A container gets a namespace with a virtual ethernet pair and software-defined routing. It’s more flexible, and more layers to debug when something goes wrong.
What this means for you
If you’re building software today, containers are almost certainly part of your workflow, even if you’re not thinking about namespaces and cgroups. But understanding what’s underneath changes how you use them:
Your container is a process, not a VM. Don’t run multiple services in a single container. Don’t install SSH. Don’t think of it as a little server. It’s a process with a filesystem and a network interface. One process, one container, one concern.
Layers matter for build speed. Order your Dockerfile instructions from least-frequently changed (base image, system dependencies) to most-frequently changed (application code). Your CI/CD pipeline will thank you.
The kernel is shared. Don’t run untrusted code in containers that share a host with trusted code. Use Fargate, gVisor, or dedicated hosts when the threat model requires it.
Persistent state needs volumes. Anything written to the container filesystem is ephemeral. Databases, uploads, anything you can’t afford to lose, put it on a volume or in a managed service.
Use Fargate unless you have a specific reason not to. The operational savings of not managing container hosts outweigh the cost premium for most teams.
Multi-stage builds save you from bloated images. Your build environment (compilers, development libraries, test frameworks) doesn’t belong in your production image. Use a multi-stage Dockerfile: one stage to build your application, a second stage that copies only the compiled output into a minimal base image. A Go application built this way can produce a final image under 20 MB, compared to hundreds of megabytes if you include the build tools.
Health checks matter. Define a health check in your task definition or Dockerfile. The orchestrator uses it to determine whether your container is ready to receive traffic and whether it needs to be replaced. Without a health check, the orchestrator can only tell if your process is running, not if it’s actually working. A web server that’s running but returning 500 errors to every request is worse than one that’s been killed and replaced.
Containers solved a real problem, environment consistency and dependency isolation, by using kernel features that had been developing for decades. Docker made those features accessible. OCI made them standard. Orchestrators made them operational. Understanding the stack from cgroups to Fargate doesn’t just make you a better operator. It helps you make better architectural decisions about what to run, how to run it, and when the lightweight isolation of a container is enough, and when it isn’t.