Choosing Athena, Glue, or EMR for Parquet on S3

November 08, 2027 · 15 min read

Data Engineer Associate · DEA-C01 · part of The Exam Room

The situation

A product analytics team at a consumer app company has 5 TB of clickstream events in S3. The raw JSON arrives hourly from the telemetry pipeline, and a separate data engineering team runs nightly AWS Glue jobs that convert the raw JSON into Parquet, partitioned by event_date and event_type, written back to a different S3 prefix. That upstream pipeline is already owned, funded, and paged-for; it is out of scope for what happens next.

The question is what sits on top of that Parquet. Analysts want to ask questions interactively: scan one to seven days at a time out of hundreds of partitions, run several queries a day for a week or two of focused work, then leave the dataset alone for weeks. They do not want to own a Spark cluster, size it, patch it, or remember to stop it on a Friday evening. The workload is bursty, exploratory, and mostly silent, and the infrastructure has to match that shape without anyone remembering to turn it off.

What actually matters

Before reaching for a service, it helps to name what the team actually owes the business on this workload, and what they explicitly do not.

The ownership story is the loudest one. The DE team already owns the ingest pipeline; the analytics team wants to own the questions, not another piece of infrastructure. Anything that bolts a cluster, a control plane, or a scheduler onto the analytics team’s plate moves ownership in the wrong direction. The test is: who gets paged when this breaks at 02:00 on a Saturday, during a sprint that ends on Monday?

The cost shape matters more than the cost level. Five terabytes sitting in S3 is cheap to store and cheap to query if the per-query cost tracks the actual scan. What the team cannot afford is a fixed hourly bill that runs during the weeks between sprints, when nobody is asking anything. That rules out any pricing model where idle capacity shows up on the invoice, even if the per-query number looks great when someone is actively using it.

The blast radius of an accidental expensive query is worth weighing. An analyst typing SELECT * without a date filter on a growing dataset is an operational reality, not an edge case. The chosen tool needs a natural place to put a ceiling on the damage, a per-query byte limit, a workgroup quota, a cluster bound, so a missed WHERE clause does not become a finance incident.

The failure mode under growth is the next thing to ask about. The dataset is 5 TB today and will keep growing. Tools that behave fine at 5 TB but start needing tuning at 20 TB set a time bomb the team has to defuse later, often under pressure. A model that scales invisibly, where doubling the data doubles the scan cost and nothing else, is cheaper over the life of the workload than a model that looks cheaper per hour but demands a re-architecture at the next magnitude.

And finally, the coupling to the existing pipeline. The DE team’s nightly Glue job already registers tables in the AWS Glue Data Catalog. Anything that reads those tables for free is a small architectural choice; anything that requires a second schema store, a second crawler, or a second copy of the data is a second surface to keep in sync, and a quiet failure mode the day those two surfaces disagree.

What we’ll filter on

Distilling that into filters the candidates can be scored against:

  1. Ad-hoc and interactive. The analyst writes SQL, waits seconds to a couple of minutes, decides what to ask next.
  2. No infrastructure to manage. No EC2, no Kubernetes, no bootstrap scripts, no Spark configuration.
  3. Pay-per-query, not pay-per-hour. Weeks of silence cost nothing beyond the S3 storage bill the dataset was already incurring.
  4. Scales to terabytes without re-architecting. 5 TB today, 20 TB next year, and the same tool still answers.
  5. Reuses the existing Glue-registered Parquet without a second ingest step.

The analytics-on-S3 landscape

AWS publishes several services for querying data in S3; they overlap at the edges, and a lot of the confusion in this space comes from not seeing which lane each one lives in.

Amazon Athena. Serverless interactive SQL on S3. No cluster; submit SQL, Athena runs it, pay per terabyte of data scanned. Uses the AWS Glue Data Catalog as its default schema store, so any table the nightly Glue job registers is immediately queryable. Engine version 3 is built on the open-source Trino project; supports ANSI SQL, window functions, complex types, CTEs. Tuned for ad-hoc interactive queries that complete in seconds or low minutes.

AWS Glue. Two different things share the name. The Glue Data Catalog is a Hive-compatible metastore, a central schema registry that Athena, EMR, Redshift Spectrum, and Glue jobs all consult. The Glue ETL service runs Apache Spark, Python Shell, or Ray jobs on managed workers, billed per DPU-hour. A DPU is 4 vCPU and 16 GB of memory; standard Spark jobs bill at $0.44 per DPU-hour, one-minute minimum, per-second afterwards. Glue ETL is shaped for scheduled ingest and transformation, read source, transform, write target, sleep. Not tuned for interactive SQL.

Amazon EMR. Managed Hadoop/Spark/Hive/Presto/Trino clusters in four deployment flavours: EMR on EC2 (classic provisioned clusters, 7-8 minute cold start, pay per EC2 instance-hour plus EMR uplift); EMR on EKS (Spark on your Kubernetes cluster); EMR Serverless (auto-scaling, per-second billing, 1-3 minute cold start); EMR on Outposts (on-premises). EMR’s sweet spot is long-running Spark, custom JARs, GPU acceleration, and workloads that share state across stages.

Redshift Spectrum. Lets a Redshift cluster read S3 data as external tables. Useful when the analytics team already lives inside Redshift. Not relevant here, no Redshift cluster, and the team wants no cluster at all.

Lake Formation. A permissions and governance layer on top of the Glue Data Catalog. It does not execute queries; it tells Athena, Redshift Spectrum, EMR, and Glue what each user is allowed to see.

Side by side

Option Ad-hoc / interactive No infra to manage Pay-per-query Scales to TB Reuses Glue-registered Parquet
Amazon Athena
AWS Glue ETL
EMR Serverless
EMR on EC2
Redshift Spectrum

Athena is the only row with all ticks. Glue ETL is the correct shape for scheduled transformation (which is already running) but wrong for interactive SQL. EMR Serverless removes the cluster-management burden but still bills for compute-seconds rather than bytes scanned, so weeks of silence still cost something when any capacity stays pre-initialised. EMR on EC2 and Redshift Spectrum both require a cluster the team has said they don’t want.

Matching the shape to the service

Five candidates, three gates, one survivor Athena serverless SQL on S3 Glue ETL scheduled Spark jobs EMR Serverless auto-scaling Spark EMR on EC2 managed Hadoop/Spark Redshift Spectrum S3 via Redshift Gate 1: ad-hoc interactive SQL? Glue ETL out: scheduled, not interactive Gate 2: no cluster to run? EMR on EC2 + Redshift Spectrum out: cluster required Gate 3: zero cost between queries? EMR Serverless out: per-second compute, not per-scan Amazon Athena
Three sequential gates, interactive, cluster-free, zero-cost-when-idle, eliminate the other four candidates in turn. Athena is the only service that passes every gate.

Athena, in depth

The query model. Athena accepts ANSI SQL, compiles it through its Trino-based engine (engine version 3), parallelises execution across a pool of workers AWS manages entirely behind the scenes, and returns results. There are no nodes to see, no task scheduler to tune. SQL in, rows out.

Schema via the Glue Data Catalog. Athena has no schema store of its own; it reads table definitions from the Glue Data Catalog by default. When the DE team’s nightly Glue job writes Parquet and registers the table’s partitions, Athena can query them immediately, no separate DDL, no crawler run, no second step.

Pricing: $5.00 per TB scanned (us-east-1 list price), rounded up to the nearest megabyte, with a 10 MB per-query minimum. Failed, cancelled, and DDL queries are free. No storage fee beyond the S3 bill. No charge when idle.

Two pricing facts do most of the work. Athena bills on bytes scanned, not bytes returned, a SELECT COUNT(*) over a 5 TB table scans 5 TB if the planner cannot prune partitions or columns. Columnar formats drop the scan dramatically. Parquet and ORC are columnar, so Athena reads only the referenced columns. AWS’s own pricing page walks through a 1 TB CSV query at $5 dropping to $1.25 in Parquet with ~75% fewer bytes scanned. The scenario’s data is already Parquet; half the battle is won.

Partition pruning. If a query filters on a partition column (WHERE event_date = '2026-05-19'), Athena’s planner consults the Glue Data Catalog, identifies which S3 prefixes match, and skips every other prefix entirely. For a 5 TB dataset partitioned by event_date and event_type across ~180 days and 10 event types, a WHERE event_date BETWEEN '2026-05-13' AND '2026-05-19' filter scans 7/180 of the data, ~195 GB instead of 5,000 GB. Adding AND event_type = 'click' drops that to ~19.5 GB. The bill goes from $25 to $0.10. Partition pruning is the single biggest cost lever.

Workgroups. Athena groups queries into workgroups, each with a per-query scan limit, per-workgroup cumulative scan limit, default output location, CloudWatch metrics stream, and encryption settings. A 100 GB per-query ceiling stops an analyst from accidentally scanning the whole dataset with a missing WHERE clause, the blast-radius control the team needs.

Provisioned capacity and federated queries. For predictable high-concurrency workloads, Athena offers provisioned capacity at $0.30 per DPU-hour, relevant for dashboards running thousands of concurrent queries, not for this pattern. Federated queries reach beyond S3 via Lambda-packaged connectors; out of scope here, worth knowing the pattern exists.

A worked cost trace

Take the workload’s shape: a single analyst running a query that scans 7 days of the 180-day dataset, filtered on one event type out of ten, picking maybe 5 columns out of 40.

Scanned volume: 5 TB x (7/180) x (1/10) ~= 19.5 GB. Parquet column projection cuts that further to ~2.4 GB read from disk. Athena cost: 2.4 GB x $5/TB = $0.012 per query. Ten queries a day for a two-week sprint: ~$1.70 total. Between sprints, zero. Time to result: single-digit seconds.

The same query on EMR on EC2 would launch a cluster of 3 m5.xlarge instances at ~$0.192/hr each plus EMR uplift (~$0.048/hr each), about $0.72/hr. Cold-start is 7-8 minutes; the query itself takes perhaps 30 seconds. Even run-and-tear-down, a single query costs roughly $0.72, 60x Athena, with an 8-minute wait before the first row. Leave the cluster running “just in case” and it’s ~$40/week for a workload Athena does for $1.70 across the whole sprint.

EMR Serverless narrows the gap (per-second billing, 1-3 minute cold start), but still pays for compute seconds rather than bytes scanned. Athena’s per-byte model wins for “scan 2 GB once” queries; EMR Serverless starts winning when the pattern shifts to long-running CPU-bound Spark.

ServiceTime to answerCost per queryCost between queries
Athena~5 seconds$0.012$0.00
Glue job~2 minutes$0.04$0.00
EMR Serverless~1-3 minutes~$0.10$0.00 (or pre-init fee)
EMR on EC2 (persistent)~30 seconds~$0.72/hr prorated$0.72/hr

Athena is cheapest, fastest to answer, and charges nothing between sprints.

When EMR or Glue ETL is the correct answer

EMR earns its keep when the workload stops fitting the “run a SQL query, read a result” shape: long-running Spark pipelines, custom Spark UDFs bundled with JARs, GPU-accelerated ML training, streaming pipelines via Spark Structured Streaming or Flink, and the broader non-SQL Hadoop ecosystem. None of that applies here.

Glue ETL is the scheduled-ETL engine, already running in this scenario doing the nightly raw-JSON to Parquet conversion. Glue ETL wins for scheduled ingest and transformation jobs, Glue Data Quality rules, lineage tracking via the Data Catalog, crawlers that auto-discover schema, and native connectors for Iceberg, Hudi, and Delta Lake. For the analytics team’s interactive exploration, Glue ETL is the wrong shape. PySpark at an analyst who wants SQL, paying per DPU-hour instead of per scan.

The Iceberg / Hudi / Delta angle

Raw Parquet in S3 is append-only and schema-on-read. Partitions can be added; rows cannot be UPDATEd in place or DELETEd efficiently. For clickstream this is fine, events are immutable. For datasets that need row-level mutations, plain Parquet is not enough.

Apache Iceberg, Apache Hudi, and Delta Lake layer an ACID transaction log on top of Parquet files in S3. They enable row-level inserts, updates, and deletes via MERGE; time travel to a past snapshot; schema evolution; and background compaction. Athena queries all three; Iceberg has the deepest integration.

For this workload, immutable clickstream, no row mutations, plain Parquet is fine. If the DE team later needs GDPR deletes, Iceberg becomes a meaningful conversation. Not today.

What’s worth remembering

  1. Athena is the serverless SQL-on-S3 engine; Glue ETL is scheduled ETL; EMR is managed Hadoop/Spark clusters. The three overlap but fit different shapes.
  2. Athena uses the Glue Data Catalog as its default schema store, so tables registered by Glue jobs are immediately queryable with no extra step.
  3. Athena pricing is $5/TB scanned with a 10 MB per-query minimum; failed and cancelled queries are free; no cost between queries. Columnar formats and partition pruning are the dominant cost levers.
  4. Athena engine version 3 is Trino-based, ANSI SQL with a rich function library; engine version is set per workgroup.
  5. Partition projection computes partition lists in-memory from table properties instead of calling GetPartitions on the Glue Data Catalog.
  6. EMR cold-start on EC2 is roughly 7-8 minutes; EMR Serverless is 1-3 minutes. Both charge for compute time regardless of bytes scanned.
  7. Glue Spark jobs bill at $0.44 per DPU-hour (one-minute minimum, per-second afterwards); a DPU is 4 vCPU + 16 GB RAM.
  8. Lake Formation is a permissions layer on top of the Glue Data Catalog, not a query engine. Athena, Redshift Spectrum, EMR, and Glue all respect Lake Formation grants.
  9. Workgroups are where blast-radius controls live, per-query data scan limits cancel runaway queries for free before they bill.
  10. Plain Parquet is fine for immutable append-only datasets; Iceberg/Hudi/Delta are for row-level mutation and time travel.

Run the exploratory queries in Amazon Athena, reading the Parquet directly from S3 via the table definitions the nightly Glue job already registers in the Glue Data Catalog. Use partition pruning via WHERE event_date and WHERE event_type filters to keep per-query scans in the single-gigabyte range. Configure an Athena workgroup with a sensible per-query data-scan limit so a missing filter can’t accidentally scan the whole 5 TB. Leave the existing Glue nightly job untouched. At a cent per query, a handful of queries a day during sprints, and nothing between sprints, the annual query bill lands in the tens of dollars.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.