The situation
A billing system dumps a day’s transactions to S3 every night as gzipped CSV, in the obvious folder layout a daily cron writes:
s3://acme-billing/transactions/2026-07-25-0001.csv.gz
s3://acme-billing/transactions/2026-07-25-0002.csv.gz
s3://acme-billing/transactions/2026-07-26-0001.csv.gz
The dataset has grown to 8 TB. A finance analyst writes a SELECT account_id, SUM(amount) FROM transactions WHERE transaction_date = DATE '2026-07-25' GROUP BY account_id. Athena returns the answer in about four minutes. The query history says 8 TB scanned, $40.00. There are ~30 GB of data for that single date inside the dataset; the query did exactly the same physical work as a SELECT * over everything would have done.
The team runs ~50 of these per day. The bill is $2,000 daily, $60,000 monthly, climbing as the dataset grows. The job: close the gap between the bytes the query intends to read and the bytes Athena actually reads, without the team writing a daily metadata-registration job, and without inflating the S3 bill.
What actually matters
Before reaching for a fix, it’s worth weighing what is actually wrong, because the Athena bill is behaving correctly. Eight terabytes at $5/TB is forty dollars; the pricing model is doing what it says on the tin. The defect is not the cost per byte; it is the bytes Athena is forced to touch for a query whose logical footprint is thirty gigabytes.
The ownership question sits underneath the technical one. Whatever the fix is, someone has to own it going forward. A solution that requires a nightly Glue Crawler or an MSCK REPAIR TABLE cron introduces a new dependency: the day that cron doesn’t run, the latest day’s queries silently return nothing, and the team discovers the failure from a finance analyst rather than from an alarm. A fix whose correctness is self-maintaining, data lands, and queries against it work without a metadata sync, is materially cheaper to live with than one that adds a new moving part.
The blast radius of a bad query also deserves explicit thought. Today the worst case is $40 and four minutes, and the team has tolerated that because the bill has not quite got their attention. After the fix, a typical query will cost cents; but the same typo that would cost $40 today could cost nothing, or could still cost $40 if the filter is on the wrong column. A ceiling on per-query spend, rather than a hopeful expectation of good behaviour, is the control that makes the fix robust.
The cost shape under growth matters because the dataset is growing. A fix that is 10x better today but still scales linearly with dataset size buys a few months; a fix that scales with the query’s logical footprint instead of the dataset’s total size is durable. Partitioning earns its keep precisely here: it decouples per-query cost from total dataset size.
The coupling to the data itself is worth weighing too. The CSV files are the billing system’s output; a fix that changes how that system writes would couple two teams that currently don’t talk. A fix that operates on top of the existing files, re-encoding them on a schedule, or layering a new partitioned copy, preserves the boundary and lets the finance team move without waiting on billing.
And finally, the failure mode when a filter is missing. The current behaviour is silent: a missing WHERE clause silently costs $40. After the fix, the default should be loud, a missing filter should fail fast, not silently bill. Whatever per-query spend ceiling the query engine exposes is the moment to turn on.
What we’ll filter on
- Per-query bytes scanned drops from 8 TB to one day’s slice. Athena’s planner must know which S3 objects belong to which date before it issues reads, not by inspecting object contents.
- No manual partition registration on every new day, no daily
MSCK REPAIR TABLEcron, no daily Glue Crawler run, no per-dayALTER TABLE ADD PARTITION. - Compression beyond raw partitioning, columnar encoding plus column projection moves the bytes-scanned counter further down after pruning.
- Sustainable cost, a typical query costs cents, not dollars, and stays there as the dataset grows.
- A hard ceiling on per-query spend, a missing filter fails loudly, not silently.
The Athena cost landscape
Athena’s pricing model has three rules that explain every cent of the $40.
Rule 1: $5.00 per terabyte of data scanned (us-east-1 list price). Bytes returned to the client are not billed; bytes Athena’s worker pool reads from S3 are.
Rule 2: rounding up to the nearest megabyte, with a 10 MB minimum per query. A query that touches 100 KB still bills as 10 MB.
Rule 3: failed and cancelled queries are free, as are DDL statements (CREATE TABLE, ALTER TABLE, DROP TABLE). CREATE TABLE AS SELECT is not DDL for billing, it scans the source data, so it bills the scan.
8 TB at $5/TB is $40. The bill is correct. The waste is upstream, in how Athena is forced to read the dataset. The separate provisioned pricing tier ($0.30/DPU-hour) reserves worker capacity for thousands-of-queries-per-hour dashboards; for 50 ad-hoc queries a day, the per-byte tier is correct and the levers worth pulling are partitioning, columnar encoding, and column projection.
Partitioning in depth
Athena inherits its partitioning model from Hive. A partitioned table declares one or more partition columns at CREATE TABLE time, and the values of those columns are encoded into the S3 prefix structure rather than into the data files themselves. The convention is key=value:
s3://acme-billing/transactions-parquet/dt=2026-07-25/file-0001.parquet
s3://acme-billing/transactions-parquet/dt=2026-07-26/file-0001.parquet
The partition column dt doesn’t appear inside the Parquet files. Its value is read off the S3 key by Athena’s planner. When a query has a WHERE dt = DATE '2026-07-25' predicate, the planner does partition pruning: it identifies the matching prefixes, and tells the worker pool to read only the objects under those prefixes.
For the planner to do this, it needs to know which partition values exist and where they live. There are three ways to tell it.
Option A: register every partition in the AWS Glue Data Catalog. Each partition becomes a row in Glue’s metadata pointing at its S3 location. Athena calls GetPartitions at query-plan time. Partitions are registered three ways: explicit ALTER TABLE ... ADD PARTITION per partition; MSCK REPAIR TABLE which scans the S3 prefix tree for key=value directories; or a Glue Crawler on a schedule. All three need to run after new data lands and before a query expects to see it, a missed cron means the latest day’s queries silently return nothing.
Option B: partition projection. Skip Glue partition registration entirely. Declare the shape of the partition values as table properties, and Athena computes the partition list in-memory at query time. No GetPartitions call. No metadata round-trip. No daily registration job. New days are queryable the moment their data lands.
Option C: an open table format (Iceberg, Hudi, Delta Lake). These add a transaction log of file-level metadata in S3 alongside the data files. The table tracks which files belong to which partition without Glue registration, and supports row-level updates, deletes, and time-travel. For an immutable daily append, the table-format machinery is more than this workload needs.
Side by side
| Option | Prunes to one day | No daily registration job | Columnar encoding benefit | Sustainable cost |
|---|---|---|---|---|
| Unpartitioned CSV (status quo) | ✗ | ✓ | ✗ | ✗ |
| Glue-registered partitions (CSV) | ✓ | ✗ | ✗ | — |
| Partition projection (CSV) | ✓ | ✓ | ✗ | — |
| Glue-registered partitions + Parquet | ✓ | ✗ | ✓ | — |
| Partition projection + Parquet | ✓ | ✓ | ✓ | ✓ |
| Open table format (Iceberg/Hudi/Delta) | ✓ | ✓ | ✓ | — |
The fix, in one picture
Partition projection + Parquet, in depth
Athena supports four projection types: date (continuous sequence of timestamps via a Java DateTimeFormatter pattern; range supports literal or NOW-relative endpoints); integer (continuous integer range); enum (fixed comma-separated list); and injected (values supplied by the query itself). For the daily-dump scenario, date is the obvious choice.
The empty-partition rule matters: if more than 50% of projected partitions would be empty, prefer registered partitions in Glue instead. Projection generates the candidate list whether or not the prefix exists; many empty candidates means many wasted LIST calls. For daily data with every day populated, projection wins outright.
Columnar encoding does the second multiplication. CSV is row-major: each row’s columns are stored together, so reading one column still reads every byte of every row. Parquet is column-major with data grouped first by row group (default ~128 MB), then within a row group by column. Each column is stored contiguously, with its own dictionary, encoding (RLE, delta, bit-packing), and compression. Column projection reads only the byte ranges of requested columns; row-group footers carry min/max per column for predicate pushdown.
AWS’s published example: 1 TB on CSV at $5 becomes $1.25 on Parquet, a 4x reduction from encoding alone, before partition pruning enters the picture. On this workload the multiplier is larger: the SELECT list is narrow (a handful of columns out of 40) and the column data is repetitive (account IDs, dates) which Parquet’s dictionary encoding crushes.
The conversion path is a one-off CTAS in Athena itself, re-writing the 8 TB once at ~$40 of spend; from then on the nightly billing job writes Parquet directly (or a small Glue job converts the day’s CSV). The one-off $40 pays for itself in the first few queries of the new layout.
The table, recreated with projection
CREATE EXTERNAL TABLE transactions_parquet (
transaction_id string,
transaction_date date,
account_id string,
amount decimal(18,2)
)
PARTITIONED BY (dt date)
STORED AS PARQUET
LOCATION 's3://acme-billing/transactions-parquet/'
TBLPROPERTIES (
'projection.enabled' = 'true',
'projection.dt.type' = 'date',
'projection.dt.format' = 'yyyy-MM-dd',
'projection.dt.range' = '2024-01-01,NOW',
'projection.dt.interval' = '1',
'projection.dt.interval.unit' = 'DAYS',
'storage.location.template' = 's3://acme-billing/transactions-parquet/dt=${dt}/'
);
The analyst’s query becomes WHERE dt = DATE '2026-07-25'. Athena’s planner reads the table properties, sees projection.enabled = true, generates the candidate partition list from the range and interval in-memory, intersects with the predicate, substitutes into storage.location.template, and lists only the resulting prefix. No call to Glue’s GetPartitions. No daily metadata sync.
A worked cost trace
Status quo: CSV, no partitioning. The analyst’s WHERE transaction_date = DATE '2026-07-25' is, to Athena, a row filter on column data, the planner cannot translate it into a prefix prune. So Athena lists every object under the prefix, reads every object end-to-end, decompresses the gzip, parses every row, discards 99.6% of the rows whose transaction_date doesn’t match, and aggregates what’s left. Bytes scanned: 8,000 GB. Cost: $40.00. Time: ~4 minutes.
After the fix: Parquet, partitioned by dt, projection enabled. The data is re-written into Parquet, partitioned by dt, and the table recreated with the TBLPROPERTIES above. Now the planner prunes to a single prefix, reads only the Parquet files under dt=2026-07-25/, pushes the predicate into row-group footers, and reads only the two columns the SELECT list and GROUP BY mention (account_id and amount) out of 40.
Bytes-scanned arithmetic: original day’s CSV gzipped is ~30 GB; same day in Parquet with Snappy is ~8 GB on disk; reading 2 columns of 40 is ~5% of on-disk bytes; final scan lands at ~400-600 MB. Cost: ~0.0006 TB x $5/TB ~= $0.003 per query. 50 queries/day x $0.003 = $0.15/day, ~$4.50/month. From $60,000 to $4.50 is roughly a 13,000x reduction. Even keeping just the partitioning win and skipping the format conversion (CSV partitioned by dt) the bill drops to about $300/month, a ~200x improvement on its own.
Workgroup guard rails
The final piece is making sure a future “oops” doesn’t re-create the original problem. Athena workgroups support a per-query data scan limit, a hard ceiling, in megabytes, that automatically cancels any query exceeding it. The minimum is 10 MB and the maximum is 7 EB. For this workload, a 10 GB ceiling is generous (a typical query reads under 1 GB) and would catch a missing WHERE clause before it becomes a $40 query.
Workgroups also support per-workgroup cumulative scan alerts, a daily or hourly aggregate ceiling that triggers SNS notification. For finance, 100 GB/day per workgroup as an alert threshold catches a runaway dashboard before the bill notices.
The cancellation behaviour matters: a query cancelled by a workgroup limit is free. Athena bills cancelled queries at zero. The guard rail costs nothing when it fires.
What’s worth remembering
- Athena bills $5/TB scanned, rounded up to the megabyte, with a 10 MB per-query minimum. Bytes scanned is what Athena’s worker pool reads from S3, not bytes returned, not bytes after filtering. Failed and cancelled queries are free.
- Partitioning embeds partition values into the S3 prefix as
key=value. AWHEREpredicate against a partition column lets the planner prune to matching prefixes before any reads happen. - Glue-registered partitions require maintenance –
ALTER TABLE ADD PARTITION,MSCK REPAIR TABLE, or a Glue Crawler must run after each new partition lands, or the data is invisible to queries. - Partition projection computes the partition list from
TBLPROPERTIESin-memory at query time, with noGetPartitionscall. AWS recommends switching back to registered partitions if more than 50% of projected candidates would be empty. - Columnar formats (Parquet, ORC) store data by column, so column projection reads only referenced columns; row-group footers carry min/max statistics for predicate pushdown.
- Workgroups enforce per-query data scan limits that cancel queries above a configurable byte ceiling (10 MB to 7 EB). Cancelled queries are free.
- Open table formats (Iceberg, Hudi, Delta Lake) layer ACID metadata on top of Parquet for row-level updates, deletes, and time travel.
- CTAS is not DDL for billing, a
CREATE TABLE AS SELECTscans the source data and bills the scan. - The two multipliers stack. Partition pruning cuts scanned bytes by the ratio of matching partitions; columnar encoding cuts the remainder by the ratio of referenced columns and compression.
Convert the CSV to Parquet partitioned by dt=YYYY-MM-DD (a one-off CTAS scans the existing 8 TB once at ~$40 of spend, then the nightly job writes Parquet from then on). Register the table with partition projection. No MSCK REPAIR TABLE, no Glue Crawler, no daily metadata cron. Add an Athena workgroup with a per-query data scan limit of around 10 GB so a missing WHERE clause cancels itself before it bills. Per-query cost drops from $40 to roughly $0.003, four orders of magnitude, and the team’s monthly Athena bill on this workload falls from ~$60,000 to under $5.