How CI/CD Pipelines Work — Barking Iguana

The first time most developers deploy code, they copy files to a server. It works. It’s terrifying. It scales to exactly one person. Everything that follows – every CI/CD pipeline, every deployment strategy, every infrastructure-as-code template – is an attempt to make that process reliable, repeatable, and safe enough that you can do it on a Friday afternoon without breaking into a cold sweat.

A short history of building and deploying software

Before continuous integration, there was make.

Stuart Feldman wrote make at Bell Labs in 1976. It read a file called Makefile that described the dependencies between source files and the commands needed to compile them. If you changed parser.c, make knew to recompile parser.o and then re-link the binary, but not to recompile lexer.o because it hadn’t changed. It was a build tool – it automated the process of turning source code into executables.

make didn’t know anything about testing, deploying, or integration. It just built things. But it established a principle that persists to this day: the build process should be automated and reproducible. If it takes a human to remember the right sequence of commands, it will eventually be done wrong.

Through the 1980s and 1990s, build automation evolved. make begat ant (Java, 2000), which begat Maven (Java, 2004), which begat Gradle (Java, 2012). Each generation added capabilities – dependency management, plugin ecosystems, multi-module builds – but the core idea remained: describe your build in a file, and let the tool execute it.

Deployment, meanwhile, remained largely manual. A developer or sysadmin would build the software on their machine, copy it to the server (via FTP, SCP, or a shared filesystem), stop the old version, start the new version, and hope. The deployment process lived in people’s heads, in wiki pages, in READMEs that were perpetually out of date.

This is exactly where the GreenBox team started. In the first week, Tom deployed by “SSH-ing into the server from his laptop and running a script he wrote on the first day. It takes about twelve minutes. Nobody else has the credentials or knows the steps.” It worked because GreenBox had one developer deploying to one server. It was already a problem by the time Priya joined.

Continuous integration: the idea that changed everything

In 1991, Grady Booch used the phrase “continuous integration” in his book on object-oriented design, but the practice as we know it was codified by Kent Beck in Extreme Programming Explained (1999) and Martin Fowler in his influential 2006 article “Continuous Integration”.

The core principle is simple: developers should integrate their work frequently – at least daily – and each integration should be verified by an automated build and test suite.

Before CI, teams would work independently on separate branches for weeks or months, then attempt to merge everything together in a painful “integration phase.” The longer the branches diverged, the harder the merge. A two-week branch might take two days to merge. A three-month branch might take three weeks. The integration phase was where projects went to die.

CI eliminates this by making integration continuous. Instead of a big-bang merge at the end, developers merge to the main branch every day (or multiple times a day). Each merge triggers an automated build and test run. If the tests fail, the integration is broken, and the team fixes it immediately – while the change is small and the context is fresh.

The benefits are substantial:

Merge conflicts are small because branches are short-lived
Bugs are found early because the test suite runs on every integration
The main branch is always in a deployable state (or close to it)
Developer confidence increases because they know the tests are catching problems

The first generation of CI tools emerged in the early 2000s. CruiseControl (2001) was one of the first open-source CI servers. It monitored a version control repository, automatically built the project when changes were detected, and reported the results via a web dashboard and email notifications. The concept of the “build radiator” – a big monitor on the wall showing the build status – dates from this era.

Hudson (2005), later forked as Jenkins (2011), became the dominant CI server. Jenkins was (and remains) extraordinarily flexible – a plugin architecture that can automate almost anything, configured through a web UI or (later) through a Jenkinsfile that describes the pipeline as code. Its flexibility was also its curse: Jenkins installations tend to accumulate plugins, custom configurations, and undocumented modifications until they become their own maintenance burden.

The modern generation – GitHub Actions (2019), GitLab CI (2015), CircleCI (2011), Buildkite (2014) – shifted toward configuration-as-code from the start. Your pipeline is defined in a YAML file (.github/workflows/ci.yml for GitHub Actions, .gitlab-ci.yml for GitLab CI), checked into the repository alongside the code. The pipeline definition is versioned, reviewable, and reproducible.

Continuous delivery vs continuous deployment

These terms are often confused. They’re related but distinct.

Continuous delivery means that every change that passes the automated tests is ready to be deployed to production. The deployment itself is a manual decision – someone (a product owner, a release manager, the developer) pushes a button. But the process of getting from code change to deployable artefact is fully automated.

Continuous deployment goes further: every change that passes the tests is automatically deployed to production. No manual step. No button. The pipeline builds, tests, and deploys without human intervention.

Practice	Build	Test	Deploy to staging	Deploy to production
Continuous Integration	Automated	Automated	Manual	Manual
Continuous Delivery	Automated	Automated	Automated	Manual (push-button)
Continuous Deployment	Automated	Automated	Automated	Automated

Most organisations practice continuous delivery rather than continuous deployment. The manual deployment step provides a checkpoint for things that are difficult to automate: regulatory reviews, coordinated releases across multiple teams, marketing launches that need to be timed, or simply the human judgement that “this is a good time to deploy.”

What a pipeline actually does

A typical CI/CD pipeline has five stages, each building on the one before it.

1. Checkout

The pipeline starts by fetching the source code from the repository. On GitHub Actions, this is the actions/checkout step. On Jenkins, it’s the SCM checkout. The pipeline gets a clean copy of the code at the exact commit that triggered the build.

This sounds trivial, but it’s load-bearing. The pipeline must build from the same code every time. No local modifications. No uncommitted files. No “it works on my machine.” The checkout step guarantees that what the pipeline builds is what’s in the repository.

2. Build

The pipeline compiles the code, installs dependencies, and produces a build artefact – the thing that will be deployed.

For a compiled language (Go, Rust, Java), the artefact is a binary or JAR file. For an interpreted language (Python, Ruby, JavaScript), the artefact might be a Docker image, a zip file of the application code, or a bundled JavaScript application. For infrastructure, the artefact might be a Terraform plan or a CloudFormation template.

The build step typically includes:

Installing dependencies (npm install, bundle install, pip install)
Compiling source code
Running code linters and formatters (checking style, catching common errors)
Building Docker images (if using containers)

Build reproducibility is crucial. Given the same source code, the build should produce the same artefact. This is harder than it sounds – dependency versions can change, build tools can behave differently on different platforms, and non-deterministic elements (timestamps, random values) can sneak into artefacts. Tools like lock files (package-lock.json, Gemfile.lock, poetry.lock) pin dependency versions to ensure reproducibility.

3. Test

The pipeline runs the automated test suite. This typically includes multiple levels:

Unit tests: fast, isolated tests of individual functions or classes. They run in milliseconds, don’t depend on external services, and catch logic errors. A healthy codebase might have thousands of unit tests that run in under a minute.

Integration tests: tests that verify the interactions between components – the API talks to the database correctly, the message queue processes events in the right order. These are slower (seconds to minutes per test) and may require test databases, mock services, or Docker containers.

End-to-end tests: tests that exercise the entire system from the user’s perspective – a browser test that loads the home page, clicks through a workflow, and verifies the result. These are the slowest (seconds to minutes per test) and the most brittle (they break when the UI changes), but they catch integration failures that other tests miss.

The test stage is where most pipeline time is spent. A mature project might have a 30-minute test suite – fast enough to run on every push, slow enough to be annoying. Optimising test execution time (parallel test runners, test splitting across multiple machines, caching between runs) is an ongoing engineering challenge.

This is exactly the problem the GreenBox team hit at scale. When the test suite reached 90 minutes across eight city environments, the pipeline itself became the bottleneck. Developers queued their PRs, waited hours for a slot, and lost trust in the build when flaky tests produced random failures. The fix – parallelising the pipeline, stabilising the flaky tests, and hiring an SRE – was a turning point.

4. Deploy to staging

If the tests pass, the pipeline deploys the artefact to a staging environment – a replica of production that’s used for final validation before the code reaches real users.

Staging should be as close to production as possible: the same infrastructure, the same configuration, the same data shapes (though not real customer data – use anonymised or synthetic data). The purpose of staging is to catch problems that the test suite can’t: configuration issues, infrastructure interactions, performance under realistic conditions.

Some pipelines include automated tests against staging: smoke tests (basic health checks), performance tests, security scans. Others rely on manual QA in staging. Many organisations do both.

5. Deploy to production

The final step: the artefact is deployed to the production environment and begins serving real users. How this happens depends on the deployment strategy.

Artefacts: what gets deployed

An artefact is the immutable output of the build process. The same artefact is deployed to every environment – staging, production, and whatever’s in between. You don’t rebuild the code for each environment. You build it once and deploy the same thing everywhere, varying only the configuration (database URLs, API keys, feature flags).

This is a critical principle. If you rebuild for production, you’re deploying something that wasn’t tested in staging. Small differences in build environments, dependency resolution, or timing can produce subtly different artefacts. The artefact that passed all the tests is the artefact that should reach production.

Common artefact types:

Docker images: the dominant artefact type in modern deployments. The image contains the application, its dependencies, and its runtime environment. It runs identically everywhere.
JAR/WAR files: Java applications packaged for deployment to application servers.
Static site bundles: HTML, CSS, and JavaScript files produced by a frontend build step, deployed to a CDN or static hosting.
Infrastructure definitions: Terraform plans, CloudFormation stacks, or Kubernetes manifests that describe the desired state of the infrastructure.

Artefacts are stored in registries (Docker Hub, AWS ECR, GitHub Container Registry) or artefact repositories (Nexus, Artifactory, S3 buckets). Each artefact is tagged with a version – typically the git commit SHA – so you can trace any running deployment back to the exact code that produced it.

Environments: dev, staging, production

Most organisations maintain at least three environments:

Development (dev): an environment where developers can deploy and test their work before merging. Some organisations have one shared dev environment; others give each developer (or each feature branch) its own ephemeral environment.

Staging (also called pre-production, UAT, or QA): a production-like environment used for final validation. Staging typically has the same infrastructure as production but may use smaller instance sizes or fewer replicas to save cost.

Production (prod): the real thing. Real users, real data, real consequences.

The principle of environment parity – keeping staging as similar to production as possible – is one of the twelve-factor app principles. Divergence between staging and production is a source of “works in staging, breaks in production” surprises.

Configuration that varies between environments (database connections, API endpoints, feature flags, log levels) should be externalised – injected via environment variables, configuration files, or a configuration service – rather than baked into the artefact. The artefact is the same everywhere; the environment provides the context.

Feature flags: decoupling deployment from release

One of the most powerful ideas in modern software delivery is the separation of deployment (putting code on a server) from release (making a feature available to users).

Feature flags (also called feature toggles) are conditional statements in the code that control whether a feature is active:

if feature_enabled("new-checkout-flow", user):
    show_new_checkout()
else:
    show_old_checkout()

With feature flags, you can:

Deploy code to production with the feature disabled. The code is there, but nobody sees it. This eliminates the risk of the deployment itself – if the code is broken, it doesn’t matter because it’s not active.
Enable the feature for a subset of users. Roll it out to 1% of users, monitor for errors, then 10%, then 50%, then 100%. This is percentage-based rollout.
Enable the feature for specific users. Internal staff first, then beta testers, then everyone. This is useful for gathering feedback before a wide release.
Disable the feature instantly if something goes wrong. No rollback, no redeployment. Just flip the flag.

Feature flag services like LaunchDarkly, Split, and open-source alternatives like Unleash or Flipper provide the infrastructure for managing flags at scale.

The GreenBox team’s journey illustrates this evolution. In the early days, deploying meant enabling – the feature was available the moment the code reached the server. By the time they were operating across multiple cities, feature flags were essential for city-specific rollouts, A/B testing, and managing the complexity of a multi-tenant platform.

Deployment strategies

How you get the new version of your software onto production servers matters enormously. The strategy you choose determines how much risk you accept, how quickly you can roll back, and how much infrastructure you need.

Rolling deployment

The simplest strategy: update servers one at a time (or a few at a time). Start with one server, verify it’s healthy, move to the next. If a problem is detected, stop the rollout and roll back the affected servers.

AWS Auto Scaling Groups, Kubernetes Deployments, and ECS services all support rolling deployments natively. The key parameters are: how many instances to update simultaneously, how long to wait between batches, and what health checks must pass before proceeding.

The risk is that during the rollout, some servers are running the old version and some the new. Users might hit either version. If the two versions are incompatible (different database schemas, different API contracts), this causes problems. Database migrations must be backwards-compatible with the old code version.

Blue-green deployment

Two identical production environments: blue (the current live environment) and green (the new version). You deploy the new version to the green environment, test it thoroughly, then switch all traffic from blue to green in one atomic operation (typically by updating a load balancer or DNS record).

The advantage is zero-downtime deployment and instant rollback: if the new version has problems, switch traffic back to blue. The disadvantage is cost – you need double the infrastructure during the deployment.

Canary deployment

Named after the canary in the coal mine. You deploy the new version to a small subset of production instances (the “canary”) and route a small percentage of traffic to it. You monitor the canary closely – error rates, latency, user behaviour. If everything looks good, gradually increase the canary’s traffic share until it’s handling 100%.

If the canary shows problems (higher error rate, increased latency, user complaints), you route traffic back to the old version. The blast radius is limited to the small percentage of users who were hitting the canary.

Strategy	Downtime	Rollback speed	Blast radius	Extra infrastructure
Rolling	None (if health checks pass)	Minutes (roll back batch by batch)	Increases as rollout progresses	None
Blue-green	None (traffic switch is atomic)	Seconds (switch back)	All users (after switch)	2x production
Canary	None	Seconds (route away from canary)	Small (canary traffic only)	Canary instances

In practice, many organisations combine strategies: a canary deployment that, once validated, triggers a rolling deployment to the rest of the fleet.

Rollback: the safety net

When a deployment goes wrong – and they do go wrong, no matter how good your tests are – the ability to roll back to the previous version is your safety net.

The simplest rollback is redeploying the previous artefact. If your artefact is a Docker image tagged with the git commit SHA, rolling back means deploying the image tagged with the previous commit’s SHA. Blue-green deployments make this trivial: switch traffic back to the blue environment. Canary deployments make it equally simple: stop routing traffic to the canary.

Rollback becomes complicated when the deployment includes database migrations. If the new version added a column to the database, the old version might not know about it (which is usually fine) or might be confused by it (which is a problem). If the new version renamed a column or changed a data format, the old version will break.

The solution is to make migrations forwards-compatible: the old code must be able to run against the new database schema. In practice, this means:

Add new columns before deploying code that uses them
Never rename columns – add a new column, migrate data, update code, then remove the old column in a later release
Never delete columns that old code still references
Use expand-contract migrations: first expand the schema (add new things), then update the code, then contract the schema (remove old things) in a separate deployment

This discipline adds complexity to each individual change, but it means any deployment can be safely rolled back.

The GreenBox evolution

The GreenBox team’s deployment story is a microcosm of the industry’s evolution.

Week 1: Tom deploys by SSH-ing into the server and running a bash script. He’s the only person who can deploy. The bus factor is one.

Sprint 2: Tom writes a README and walks Priya through the deploy script. The bus factor goes to two. It’s still a manual process, but at least two people know it.

Series 4 (architecture): The team introduces bounded contexts and starts thinking about deployment as an architectural concern. The ADR work captures decisions about deployment processes alongside technical architecture.

Two squads: When the team splits into two squads, deployment coordination becomes necessary. Both squads deploy to the same infrastructure, and they need to avoid stepping on each other. A shared deployment queue and basic CI (running tests automatically) emerge.

The platform crisis: At eight cities, the deployment pipeline becomes the primary bottleneck. Six-hour sequential deployments across eight environments. Flaky tests. A deploy queue with five PRs waiting since Thursday. Jess, the SRE, rebuilds the pipeline: parallel deploys, per-city staging, automated canary checks, and monitoring that covers every city.

Post-acquisition: The mature pipeline supports multiple squads across multiple cities, deploying independently to their own bounded contexts, with automated quality gates and rollback capabilities.

Each step was a response to a specific pain point. Nobody woke up one morning and said “let’s build a CI/CD pipeline.” They built the simplest thing that worked, and then improved it when it stopped working.

Infrastructure as code

The pipeline doesn’t just deploy application code. Increasingly, it also manages the infrastructure the application runs on.

Infrastructure as Code (IaC) means defining your infrastructure – servers, databases, load balancers, DNS records, networking – in configuration files that are versioned in the same repository as your application code.

Terraform (HashiCorp, 2014) is the most widely used IaC tool. It uses a declarative configuration language (HCL) to describe the desired state of your infrastructure:

resource "aws_instance" "web" {
  ami           = "ami-0c55b159cbfafe1f0"
  instance_type = "t3.medium"

  tags = {
    Name = "greenbox-web-perth"
  }
}

You run terraform plan to see what changes Terraform would make, and terraform apply to make them. The state of your infrastructure is tracked in a state file, and Terraform computes the diff between the current state and the desired state.

AWS CloudFormation, Pulumi (which uses general-purpose programming languages instead of a DSL), and AWS CDK (which generates CloudFormation from TypeScript, Python, or other languages) are alternatives. The principle is the same: infrastructure is described in code, reviewed in pull requests, tested in CI, and deployed through the pipeline.

IaC provides the same benefits for infrastructure that version control provides for application code: history, auditability, reproducibility, and the ability to roll back to a known good state. When Jess rebuilt GreenBox’s deployment pipeline, the infrastructure changes were reviewed in pull requests alongside the pipeline configuration. The architecture decision records captured why each infrastructure choice was made.

GitOps: the pipeline is the truth

GitOps, popularised by Weaveworks in 2017, takes IaC to its logical conclusion: the Git repository is the single source of truth for both application state and infrastructure state. Changes are made by committing to Git. Deployment happens by syncing the running system with the state described in Git.

In a GitOps workflow:

A developer commits a change (application code or infrastructure definition) to Git
The CI pipeline builds and tests the change
If tests pass, the change is merged to the main branch
A GitOps operator (like ArgoCD or Flux) detects the change in the repository and applies it to the running system
If the desired state in Git doesn’t match the running state, the operator reconciles them

The key principle is that you never make changes directly to the running system. No SSH-ing into servers to modify configuration. No clicking through the AWS console to change a setting. Everything goes through Git, which means everything is versioned, auditable, and reversible.

This is, in some ways, the opposite of where Tom started: one person, one laptop, one SSH session. GitOps is the mature end state – infrastructure and application state are both code, both versioned, both deployed through automated pipelines, both auditable.

What a modern pipeline looks like

A GitHub Actions workflow for a typical web application might look something like this (simplified):

name: CI/CD Pipeline
on:
  push:
    branches: [main]
  pull_request:
    branches: [main]

jobs:
  test:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
        with:
          node-version: 20
      - run: npm ci
      - run: npm run lint
      - run: npm test

  build:
    needs: test
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: docker build -t app:$ .
      - run: docker push registry/app:$

  deploy-staging:
    needs: build
    if: github.ref == 'refs/heads/main'
    runs-on: ubuntu-latest
    steps:
      - run: deploy --env staging --image app:$
      - run: smoke-test --env staging

  deploy-production:
    needs: deploy-staging
    if: github.ref == 'refs/heads/main'
    environment: production
    runs-on: ubuntu-latest
    steps:
      - run: deploy --env production --image app:$
      - run: smoke-test --env production

The environment: production line on the production deploy job creates a manual approval gate – someone must approve the deployment before it proceeds. This is continuous delivery: the pipeline automates everything up to the point of production deployment, then a human makes the final call.

The principles that matter

CI/CD pipelines come in infinite varieties, but the principles underneath are consistent:

Automate everything. If a human has to remember a step, that step will eventually be forgotten. The pipeline should handle checkout, build, test, deploy, and verification without manual intervention.

Build once, deploy everywhere. The same artefact goes to every environment. Configuration varies; the artefact doesn’t.

Make the pipeline fast. A 90-minute pipeline kills feedback loops and encourages bad habits (batching changes, skipping tests, deploying without waiting for CI). Invest in parallelism, caching, and test suite performance.

Make deployments boring. The goal is not zero-downtime deployments or canary releases for their own sake. The goal is to deploy so often, with so little risk per deployment, that deploying becomes unremarkable. The best deployment is the one nobody notices.

Treat the pipeline as code. Version it. Review it. Test it. The pipeline is production infrastructure – if it breaks, nothing ships.

Rollback should be easier than fixing forward. When something goes wrong in production, the fastest path to recovery is usually reverting to the previous version, not debugging under pressure. Make rollback trivial.

Tom’s SSH script wasn’t wrong for week one. It was the simplest thing that could work, and it worked. The pipeline Jess built wasn’t overengineered for a 30,000-subscriber operation across eight cities. It was the simplest thing that could work at that scale. The art of CI/CD is matching the complexity of your pipeline to the complexity of your problem – and growing it incrementally as the problem grows.

That’s the lesson the GreenBox team learned over nineteen posts: every practice exists because someone hit a wall without it. CI/CD pipelines exist because someone deployed the wrong code to production at 4pm on a Friday, and the only person who knew how to fix it had already left for the weekend.

The pipeline makes sure that doesn’t happen. And when it does happen anyway – because it will – the pipeline makes sure you can undo it in minutes, not hours.

Series

Under the Hood — deep dives into the technology we use every day.

What Time Is It? — coming around 21 April
The Wonderful, Absurd Science of Colour — coming around 28 April
How LLMs Actually Work — coming around 26 May
Time Is Weirder Than You Think — coming around 2 June
How Estimation Works (And Why It Doesn't) — coming around 30 June
How Clocks Work — coming around 21 July
How CI/CD Pipelines Work (you are here)
How Email Works — coming around 15 September
How Computer Networks Work — coming around 22 September
How Randomness Works — coming around 27 October
A Gentle Guide to Typography: From Chisels to Character Sets — coming around 17 November
How TLS Works: From Trusted Networks to Trust-No-One — coming around 24 November
How Multi-Factor Authentication Works — coming around 12 January
How the Internet Routes Traffic — coming around 2 March
A Guide to DNS — coming around 23 March