How Compression Works

There’s a moment most developers have experienced without noticing. You open a text file full of repeated patterns and save it. The file on disk is a fraction of the size you’d expect. You open a photograph from your phone and notice the file is 2MB, even though the image has millions of pixels in millions of colours. You wonder, briefly, how that’s possible. Then you move on with your day.

The answer is compression, a set of mathematical and engineering ideas that have quietly become the foundation of almost every digital experience. This post is about how compression works, why it works, and why it can’t always work.

Why compression is possible at all

Start with a question. If a file is a sequence of bytes, and each byte can hold one of 256 values, and you need every byte to represent your data, how could the file possibly be smaller?

The answer is: most real-world data isn’t random. It has patterns. It has structure. It has repetition. And those patterns let you describe the data more efficiently than listing every byte in order.

Consider the sentence “the quick brown fox jumps over the lazy dog.” It’s 44 characters long. Each character takes one byte in ASCII, so the sentence takes 44 bytes. But the word “the” appears twice. What if you could say “the” once and then just refer back to it the second time? You’d save a couple of bytes. Not much for a single sentence, but scale this up to a book, a codebase, a web page, and the savings become enormous.

That’s the core insight. Compression works because real data contains redundancy, and redundancy is describable. If you can describe the same information with fewer bits, you can store it in fewer bytes.

The field of information theory, founded by Claude Shannon in his 1948 paper “A Mathematical Theory of Communication”, made this idea precise. Shannon defined a quantity called entropy that measures the minimum number of bits needed to represent a message. If your data has low entropy, meaning it’s predictable and repetitive, you can compress it a lot. If it has high entropy, meaning every bit looks random, you can’t compress it at all.

This has a consequence that surprises people: you cannot compress random data. A truly random file has maximum entropy, and no algorithm can make it smaller without losing information. When compression tools seem to shrink any file, it’s because real files are almost never random. They have headers, patterns, repetition, and structure that compression can exploit.

Lossless vs lossy compression

The first big distinction in compression is between lossless and lossy algorithms.

Lossless compression preserves every bit of the original data. When you decompress, you get back exactly what you started with. This is what you want for text, code, databases, executable files, anything where a single changed bit would break the file. Examples include ZIP, gzip, bzip2, LZ4, Brotli, and Zstandard.

Lossy compression throws away some information in exchange for a much smaller file. The trick is throwing away information that humans won’t notice. Your eye doesn’t perceive the tiniest variations in colour. Your ear doesn’t hear the quietest frequencies in the presence of louder ones. Lossy algorithms exploit these perceptual limitations to discard data that you couldn’t consciously experience anyway. Examples include JPEG for images, MP3 and AAC for audio, and H.264 and AV1 for video.

The choice between lossless and lossy is domain-specific. Your source code must be lossless, a single bit flip would break the compile. Your holiday photographs can be lossy, you’d never notice the difference between the original and a well-encoded JPEG. Your music library is typically lossy (MP3, AAC), though audiophiles insist on lossless (FLAC). Your movies are definitely lossy, storing a feature film losslessly would require hundreds of gigabytes.

Lossless compression: how the tricks work

There are two main families of ideas in lossless compression, and most practical algorithms combine them.

Dictionary compression works by finding repeated substrings and replacing them with shorter references. The foundational algorithm is LZ77, invented by Abraham Lempel and Jacob Ziv in 1977. LZ77 scans through the data looking for sequences that have already appeared, and when it finds one, it replaces the repetition with a short code meaning “copy N bytes from M positions ago.”

Here’s a worked example. Suppose the input is:

the quick brown fox jumps over the lazy dog

After “the quick brown fox jumps over “ has been written, the encoder sees “the “ again. Instead of writing the four characters t, h, e, ` ` a second time, it writes a reference: <back 32, length 4>. That reference takes maybe two or three bytes. The original “the “ takes four bytes. Small saving, but real.

Scale this up to a document with thousands of repeated words, or a codebase with hundreds of copies of the same function signature, and the savings become dramatic. LZ77 and its descendants (LZ78, LZW, LZMA) form the basis of most modern compression: gzip, ZIP, Brotli, Zstandard, and LZMA all use variants of this idea.

Entropy coding works at a lower level, assigning shorter codes to more frequent symbols. The classic algorithm is Huffman coding, invented by David Huffman in 1952 as a term project for an information theory class at MIT. Huffman realised you could build an optimal prefix code by building a binary tree bottom-up, starting with the least frequent symbols and working up.

Here’s the intuition. In English text, the letter e is much more common than z. If you give both letters an 8-bit code, you’re wasting bits on e. What if e got a 3-bit code and z got a 12-bit code? The total bits would go down, because you’d use the short code far more often than the long one. Huffman coding computes the optimal assignment of code lengths given the symbol frequencies.

Entropy coding usually runs after dictionary compression. Dictionary compression handles long-range repetition (“this phrase appeared earlier”). Entropy coding handles short-range frequency (“certain bytes are more common than others”). Together, they compound.

Most real-world lossless compressors, gzip, Brotli, Zstandard, are combinations of these two families. gzip uses LZ77 followed by Huffman coding. Brotli uses LZ77 with a static dictionary of common web content plus a more sophisticated entropy coder. Zstandard uses LZ77 plus Finite State Entropy coding, which is slightly more efficient than Huffman.

What about arithmetic coding and its successors

There’s a more theoretical competitor to Huffman coding called arithmetic coding, which can achieve compression ratios closer to the entropy limit. Huffman coding is only optimal when symbol frequencies happen to be powers of two. Arithmetic coding doesn’t have this limitation, it can encode symbols using fractional numbers of bits by treating the entire message as a single number in the interval [0, 1).

Arithmetic coding was patented for many years, which slowed its adoption. Its modern successor, asymmetric numeral systems (ANS), was invented in 2014 by Jarosław Duda and provides similar compression ratios at much higher speed. ANS is now used in Zstandard, LZFSE (Apple’s compression), and several other modern algorithms.

The practical upshot: if you’re using a modern compressor built in the last ten years, it’s probably using ANS-based entropy coding, and it’s probably faster and smaller than the gzip that dominated for two decades.

Lossy compression: throwing things away intelligently

Lossy compression is where the clever perceptual tricks come in.

JPEG, the standard format for photographs since 1992, works by transforming small blocks of the image into the frequency domain using the discrete cosine transform (DCT). The DCT decomposes each 8×8 block of pixels into a sum of sinusoidal patterns at different frequencies, very low frequencies (broad gradients) and very high frequencies (sharp edges and fine detail).

Here’s the key perceptual insight: the human eye is sensitive to broad gradients but less sensitive to high-frequency detail. So JPEG quantises the high-frequency coefficients more aggressively, effectively rounding them to zero. The low-frequency information is preserved faithfully. The high-frequency information is partly thrown away. The result is an image that looks nearly identical to the human eye but contains far fewer distinct values, which then compresses beautifully with entropy coding.

The “quality” slider in a JPEG encoder controls how aggressively the high-frequency coefficients are quantised. High quality keeps more of them. Low quality throws more away. At very low quality, you start to see the JPEG “blockiness” artefact, the 8×8 DCT blocks become visible because the quantisation has been so aggressive that entire blocks collapse to a single colour value.

MP3 and AAC use a similar trick for audio. They transform the signal into the frequency domain and then use a psychoacoustic model to decide which frequencies to keep. The model accounts for effects like auditory masking, the fact that a loud frequency makes nearby quieter frequencies inaudible. If a loud tone at 1kHz is playing, you can’t hear a quiet tone at 1.1kHz anyway, so the encoder can safely discard the quiet one without you noticing. MP3 at 128kbps discards about 90% of the original audio data, and most people can’t tell the difference from the original.

Video codecs (H.264, H.265, AV1) take this even further by exploiting temporal redundancy, the fact that consecutive frames in a video are usually very similar. Instead of storing each frame independently, they store a few keyframes in full, plus difference frames that describe what changed since the previous frame. Combined with motion estimation (tracking how blocks of pixels move between frames), this reduces the storage requirement dramatically. A two-hour 1080p film would be about 1.5 terabytes as raw pixel data. Encoded with H.264 at reasonable quality, it’s about 5 gigabytes, three hundred times smaller.

The compression-speed-ratio trade-off

Every compressor makes a choice along three axes: compression ratio (how small the output is), compression speed (how fast the encoder runs), and decompression speed (how fast the decoder runs). You cannot maximise all three. Better compression usually takes more time.

Different use cases prioritise different axes:

Web traffic (HTTP) needs fast decompression on the client side, because the browser does it on every page load. Compression speed matters less because servers can cache compressed responses. This is why Brotli and Zstandard are popular for HTTP, they decompress fast even at high ratios.
Backup storage prioritises maximum compression, because backups are written once and read rarely. LZMA and its variants (used in 7-Zip and xz) sacrifice speed for ratio.
Real-time data streams prioritise both compression and decompression speed over ratio. LZ4 achieves only modest compression but runs at hundreds of megabytes per second. It’s used in databases, log pipelines, and memory-to-memory compression.
Software distribution (app bundles, container images) wants maximum compression, because files are downloaded many times but decompressed only once per install. The asymmetry means spending hours at compression time is worth it if it shaves a few hundred MB off the download size.

Modern compressors often expose the trade-off as a “level” parameter. gzip -1 is fast and small-ish. gzip -9 is slow and smaller. Zstandard goes from level 1 to level 22, spanning a huge range of trade-offs. Choosing the correct level for your use case is a practical engineering question, benchmark with your own data.

Why “already compressed” data doesn’t compress further

One consequence of information theory is that once you’ve compressed data well, you can’t compress it further. The entropy of a well-compressed file is near the maximum, every bit carries nearly a full bit of information, and there’s no redundancy left to exploit.

This is why zipping a folder full of JPEGs barely reduces the total size. The JPEGs are already lossily compressed; the remaining data is near the entropy floor. Zipping it adds a few headers and a tiny amount of metadata compression, but the pictures themselves barely shrink.

It’s also why layered compression usually isn’t worth it. Running gzip on an already-gzipped file might save a few bytes from the container format, but the data inside is incompressible. Worse, sometimes it actually grows the file, because you’ve added a new header on top of something that was already as small as it’s going to get.

Compression is everywhere, and mostly invisible

Every web page you load is compressed. Brotli or gzip, negotiated between your browser and the server via the Accept-Encoding header. Every photograph on your phone is JPEG or HEIC. Every video you stream is H.264, H.265, or AV1. Your database is compressing its pages. Your filesystem is compressing cold files. Your network protocols compress their headers. Your messaging app compresses attachments before uploading them.

It happens invisibly because it has to. If compression had a visible cost, a user-facing decision to make, it would drag down every interaction. Instead, it’s a quiet layer beneath almost everything, silently shaving off the redundancy so that the bytes you care about can move faster, store cheaper, and reach further.

Shannon was the first to realise that information could be measured, that there was a mathematical floor on how small you could make a message without losing meaning. Seventy-eight years later, we’ve got algorithms that approach that floor closely, run fast enough to be invisible, and power more of the modern world than almost any other single idea in computing.

Compression is a superpower. It’s also a ceiling. Both facts come from the same mathematics, and together they explain why the data you work with is as small as it is, and why, sometimes, it can’t be any smaller.