Choosing Between Iceberg, Delta Lake, and Hudi

November 29, 2027 · 17 min read

Data Engineer · DEA-C01 · part of The Exam Room

The situation

We’ve been running a plain Parquet-on-S3 warehouse for two years. Glue crawls new partitions every hour, Athena queries the catalog, and the analysts mostly get what they need. Then four things happened in a month.

  • Finance found an upstream bug that had been silently inflating a revenue_cents column for three weeks. The fix was easy; the “what did the table look like before we noticed?” query was impossible. The data we’d need to compare against was overwritten when the daily job rewrote the partition.
  • Compliance asked for per-subscriber deletion. GDPR right-to-be-forgotten. Finding every partition that contained a given subscriber and rewriting it by hand took an engineer two days.
  • The streaming team landed a new pipeline that writes every thirty seconds. The Glue crawler can’t keep up, and Athena occasionally reads a half-written partition and returns truncated results.
  • An ML team started asking for “the view of this table as of 2027-10-15” to retrain a model on historical data without leaking future signal.

Three of those problems are one problem: our storage layer has files but not a transaction log. We know which files exist; we don’t know which files belonged to the table at a given point in time. The fourth, the streaming writes, is a different problem in the same neighbourhood: we need writers and readers to stop stepping on each other.

The open-source world has converged on three table formats that solve this. The question is which one fits.

What actually matters

Before reaching for a format, it’s worth asking what “table format” is actually buying us.

A table format sits between the query engine and the object store. It keeps a manifest, a list of files that make up the current version of the table, plus metadata (row counts, min/max values, null counts) that the query engine uses to prune. When a writer adds or removes data, it writes new files and a new manifest that points at the new set. Readers pick up whichever manifest is current when their query starts. The object store becomes append-only at the file level; the manifest becomes the source of truth about what’s actually in the table.

That gives us three things for free.

Time travel, because old manifests still exist and still point at files that haven’t been garbage-collected yet, a query can ask “what did this table look like at manifest version 42?” and get a consistent snapshot. The finance bug becomes a diff between two manifests; the ML team’s retraining becomes a query with a VERSION AS OF clause.

Atomic writes, the commit is the manifest swap, not the file copy. A streaming writer can stage new Parquet files for thirty seconds and then flip a pointer; readers either see the old manifest or the new one, never a half-written mix. The crawler race goes away because there’s no crawler: the manifest already knows.

Row-level operations, with a manifest listing every file, the format can record “rows matching predicate P in file F are deleted” without rewriting F immediately. GDPR deletion becomes a small delete-file plus a compaction job, not a two-day partition rewrite.

Those three capabilities are table stakes across all three formats. Where they diverge is in the mechanics, how the manifest is structured, how deletes are encoded, how concurrent writers are reconciled, and which AWS service reads each one without extra configuration. Those mechanics map onto workload shapes differently, which is why there are still three formats instead of one.

The question to hold, then: not “which format has time travel” (all three), but how each format encodes change, and which encoding matches the pipeline we’re actually running.

What we’ll filter on

Distilling that into filters we can score each format against:

  1. AWS-native read support, which AWS query engines read the format out of the box, without bolting on a connector?
  2. Write patterns supported, batch only, or streaming, or change-data-capture style upserts?
  3. Delete and update mechanics, how is a row-level delete encoded? Copy-on-write (rewrite the file) or merge-on-read (write a delete file)?
  4. Schema evolution, can columns be added, renamed, reordered, dropped without rewriting history?
  5. Concurrent writer story, what happens when two jobs commit at the same time?

The table format landscape

  1. Apache Iceberg. Originally from Netflix; now the format AWS has bet hardest on. Manifest-based: a top-level metadata file points at a list of manifest files, each of which points at data files. Snapshots are first-class, every commit produces a new snapshot, and queries can reference any snapshot by ID or timestamp. Schema evolution is fearless: columns have stable IDs rather than positions, so renames and reorders are pure metadata operations. Iceberg supports both copy-on-write and merge-on-read delete modes, configurable per-table.

  2. Delta Lake. Originally from Databricks; the open-source spec is stewarded by the Linux Foundation. Manifest is a transaction log, a directory of JSON files (_delta_log/) recording every add, remove, or metadata change, checkpointed periodically to Parquet for read performance. Deletes and updates default to copy-on-write; merge-on-read (“deletion vectors”) shipped in the 3.x line and is increasingly the default. Schema evolution is supported but positional, so column renames historically meant a rewrite.

  3. Apache Hudi. Originally from Uber; optimised for streaming-first workloads with heavy upserts. Two table types: Copy-on-Write (COW) rewrites entire file groups on update, same as Iceberg or Delta’s default; Merge-on-Read (MOR) writes small delta log files alongside each base file, with readers merging them at query time or a compaction job folding them in. Hudi’s indexing layer (Bloom, HBase, record-level) makes record-key upserts cheap, which is the headline feature if the pipeline is CDC-driven.

The AWS service landscape

Worth naming which AWS service reads which format, because this is the first thing to verify before committing:

  • Athena (engine v3), reads Iceberg and Delta Lake natively, reads Hudi (COW and MOR snapshot queries). Writes Iceberg natively via CREATE TABLE and INSERT/UPDATE/DELETE/MERGE statements; read-only for Delta and Hudi.
  • Glue ETL (3.0+, 4.0, 5.0), reads and writes all three via built-in connectors. Glue 4.0+ ships the native libraries in the job runtime; --datalake-formats iceberg (or delta or hudi) on the job config enables it.
  • EMR, reads and writes all three, with the widest feature surface. EMR on EC2 and EMR Serverless both carry the connectors.
  • Redshift Spectrum, reads Iceberg and (more recently) Delta Lake; no native Hudi support.
  • Lake Formation, manages permissions on Iceberg and Hudi tables registered in Glue Data Catalog; Delta support is evolving.
  • S3 Tables. AWS’s managed Iceberg-only storage tier, launched 2024, with automatic compaction and snapshot expiration.

The pattern is clear: Iceberg has the broadest AWS service coverage, Delta Lake has strong Glue/EMR/Athena support and growing Redshift reach, Hudi is EMR-and-Glue-centric with Athena read support.

Side by side

Format Native AWS reads Write patterns Delete mechanics Schema evolution Concurrent writers
Iceberg Athena, Glue, EMR, Redshift, S3 Tables Batch + streaming COW or MOR per-table Column IDs, rename-safe Optimistic concurrency, snapshot isolation
Delta Lake Athena, Glue, EMR, Redshift Batch + streaming COW default, MOR via deletion vectors Positional, rename is metadata-only in 3.x Optimistic concurrency, log-based
Hudi Athena (read), Glue, EMR Streaming-first + batch COW or MOR per-table Positional, rename requires rewrite Timeline-based, multi-writer with locks

Reading the table by workload rather than by format:

  • Hourly batch warehouse, mostly appends, occasional GDPR delete, any of the three works. Iceberg wins on AWS service breadth; pick it unless there’s a reason not to.
  • Streaming CDC from RDS with millions of tiny upserts. Hudi’s MOR + record-level index is the shape that’s been optimised exactly for this. Delta with deletion vectors is competitive; Iceberg with MOR is catching up but historically was the lightest on this axis.
  • Analytics on top of Databricks-produced data. Delta Lake is the path of least resistance; the producers already speak Delta.
  • Broad multi-service reads (Athena + Redshift Spectrum + EMR + occasional QuickSight direct-query). Iceberg, because every AWS service reads it.

How the formats encode a delete

Iceberg (MOR) positional delete file + new snapshot Delta Lake log entries + deletion vector (3.x) Hudi (MOR) delta log beside base file data-0.parquet (unchanged) data-1.parquet (unchanged) delete-0.parquet (positional) (data-1.parquet, pos=4712) snapshot-v43 data files: [data-0, data-1] delete files: [delete-0] Reader merges base + delete files Compaction rewrites on schedule Time travel: snapshot-v42 still references data-1 without delete part-0000.parquet (retained) dv-0000.bin deletion vector _delta_log/00043.json {"add": "part-0000.parquet", "deletionVector": "dv-0000"} {"commitInfo": {...}} Vector = bitmap of deleted row IDs Reader filters at scan time COW fallback: rewrite part-0000 without the row, log a remove + add Checkpoint every 10 commits to keep log reads fast base-fileId1.parquet committed at instant t0 .fileId1_t1.log.1 delete block: record-key=k .hoodie/timeline t0.commit, t1.deltacommit t2.compaction.requested Snapshot query: merge base + log Read-optimised: base only Compaction at t2 writes new base, logs garbage-collected after
Same logical delete; three very different encodings. Iceberg writes a separate positional-delete file; Delta Lake writes a deletion vector bitmap; Hudi writes a delta log file beside the base.

The picks in depth

Iceberg for the general case. For the hourly-batch warehouse with occasional GDPR deletes and time-travel requests, Iceberg is the default. The manifest-and-snapshot model maps cleanly onto Athena’s query planner, the column-ID-based schema evolution means a rename doesn’t rewrite history, and every AWS analytics service reads it. S3 Tables packages Iceberg with automatic compaction and snapshot expiration, which removes the “you also need a maintenance job” footnote that comes with self-managed Iceberg.

Practical shape for our warehouse: CREATE TABLE orders (...) USING ICEBERG LOCATION 's3://warehouse/orders/' TBLPROPERTIES ('format-version' = '2', 'write.delete.mode' = 'merge-on-read'). format-version = 2 is the one that supports positional deletes and row-level updates; format-version = 1 is append-only-friendly but can’t do the GDPR work. write.delete.mode = merge-on-read keeps deletes cheap; a weekly OPTIMIZE orders REWRITE DATA FILES compaction folds the delete files back into rewritten data files so read performance doesn’t degrade.

Time travel is SELECT * FROM orders FOR TIMESTAMP AS OF TIMESTAMP '2027-10-15 00:00:00 UTC' in Athena engine v3, or FOR VERSION AS OF 42 by snapshot ID. Snapshot retention is controlled by history.expire.min-snapshots-to-keep and history.expire.max-snapshot-age-ms; default is five snapshots and five days, which is usually too tight for audit use cases, bump to 30 days for the finance-investigation window.

Hudi for CDC-heavy streaming. If the pipeline shape is “Kafka or DMS feeding millions of UPSERT operations keyed by customer_id”, Hudi’s MOR table with a record-level index is the shape that’s been optimised exactly for that. Each upsert lands in a delta log file, Hudi looks up the record key to find the owning file group, and the delta log grows until a compaction rolls it into the next base file.

The index choice matters: BLOOM index is default and works for most cases; RECORD_INDEX is a Hudi 0.14+ feature that stores an HFile mapping of record key to file ID for fast upsert lookups on very large tables; HBASE index backs onto an external HBase cluster for the highest-volume cases. For a DMS-to-Hudi pipeline, RECORD_INDEX is usually where to land unless volume forces HBase.

Athena reads Hudi MOR tables in two modes: _ro (read-optimised, base files only, fast but stale) and _rt (real-time, merges base + deltas, fresh but slower). Published as two separate catalog entries by the same Hudi write, the analyst picks whichever freshness they need per query.

Delta Lake when the producers already speak Delta. If upstream is Databricks, Spark on EMR already configured for Delta, or a team that’s internalised the Delta mental model, Delta Lake is the path of least resistance. Athena reads Delta tables natively; Glue 4.0+ writes them with --datalake-formats delta; Redshift Spectrum reads them through an external schema.

Deletion vectors (Delta 3.x) have closed most of the historical gap with Iceberg’s MOR deletes, a delete writes a small bitmap file instead of rewriting the affected Parquet. Enable with ALTER TABLE orders SET TBLPROPERTIES ('delta.enableDeletionVectors' = 'true'). The downside that keeps it from being our default-default: Delta’s schema evolution has historically been positional, and Delta 3.x’s column-mapping mode (which fixes this) has to be opted into per-table and isn’t universally supported across every tool that reads Delta.

A worked delete: three files, one row

Same scenario in all three formats. Our orders table has two data files, each with 10,000 rows. An engineer runs DELETE FROM orders WHERE order_id = 'o-xyz', which sits in file 1 at row 4,712.

Iceberg MOR. A new positional-delete file is written containing one record: (file_path='s3://warehouse/orders/data/data-0001.parquet', pos=4712). A new snapshot (v43) is written whose manifest references both original data files and the new delete file. data-0001.parquet is untouched. Total bytes written: roughly 1 KB for the delete file plus metadata. SELECT ... FOR VERSION AS OF 42 still returns 20,000 rows; SELECT ... FOR VERSION AS OF 43 returns 19,999.

Delta Lake with deletion vectors. A new deletion-vector file is written (a tiny roaring bitmap recording row 4,712). A new log entry in _delta_log/ records an add action with the original file path and the deletion-vector reference. data-0001.parquet is untouched. Bytes written: similar scale to Iceberg. Time travel via VERSION AS OF 42 works the same way.

Hudi MOR. A delta log file .fileId1_t1.log.1 is written next to the base file fileId1.parquet, containing a delete block for the record key. The timeline records a new deltacommit instant. Snapshot queries see the deletion; read-optimised queries on orders_ro don’t, until the next compaction rewrites fileId1.parquet without the row. Bytes written: small, plus eventual compaction cost.

Three formats, three encodings, same outcome. Scale the delete to “every row belonging to customer C, across 18 months of hourly partitions” and the differences compound: Iceberg and Delta MOR both stay cheap because they’re writing small delete files; Hudi MOR stays cheap for the same reason; any format in COW mode rewrites every file that touches the customer, which is where the performance gap opens up.

What’s worth remembering

  1. Table formats solve three problems at once. Atomic writes, time travel, and row-level operations all fall out of moving the source-of-truth from “the files in S3” to “the manifest that lists the files”. Any of the three formats gets us there.
  2. Iceberg is the AWS default. Broadest service coverage (Athena write, Redshift Spectrum read, S3 Tables managed tier, Lake Formation permissions), rename-safe schema evolution via column IDs, MOR or COW per-table. Pick it unless something specific pushes you elsewhere.
  3. Hudi is the CDC specialist. Record-level indexing, streaming-first design, cheap upserts when the record key is the access path. MOR with _ro / _rt table variants lets analysts trade freshness for speed per query.
  4. Delta Lake is strong where Databricks lives. Deletion vectors closed the MOR-delete gap; Glue and Athena read and write it; Redshift Spectrum reads it. Column mapping is the feature to enable for rename-safe schema evolution.
  5. format-version = 2 on Iceberg is table stakes. v1 is append-only-friendly and won’t do row-level deletes or updates. Create new tables as v2 unless there’s a reason.
  6. Compaction isn’t optional. MOR deletes accumulate small delete files that degrade read performance over time. Iceberg’s OPTIMIZE ... REWRITE DATA FILES, Delta’s OPTIMIZE command, Hudi’s compaction schedule, all run on a cadence. S3 Tables runs this for you on Iceberg.
  7. Snapshot retention is a policy, not a default. Default retention (Iceberg 5 days / 5 snapshots, similar elsewhere) is usually too short for the “investigate a bug we noticed last week” use case. Set it to the investigation window plus a margin.
  8. Read support is the first thing to verify. “Is this format readable by Athena / Redshift Spectrum / QuickSight direct-query?” is the question that eliminates candidates fastest. All three read Iceberg; not all three read Hudi; Delta’s Redshift story is newer.

The format isn’t the warehouse; it’s the transaction log sitting between the query engine and S3. Each of the three has a shape of workload it was designed for, and each has closed the gap with the others over time. The work is matching the format to the pipeline: Iceberg for breadth, Hudi for upsert-heavy streaming, Delta where the rest of the stack already speaks Delta.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.