The situation
Our raw zone in S3 receives eight ingests a day from five upstream systems. A Glue job runs hourly, catalogs new partitions, transforms them into a curated Parquet dataset, and registers them in the warehouse. Analysts query the curated dataset through Athena and QuickSight.
Last Tuesday, one of the upstream systems sent a file where a column header had moved. The transform job didn’t know the new header was wrong; it mapped positionally, wrote Parquet, cataloged the partition, and reported success. The bad data sat in the warehouse for three days. On Friday an analyst noticed the revenue chart looked like someone had sold six million dollars worth of null. Two days of root-cause, one angry customer call, one apologetic Slack thread.
The interesting thing is that the bad data had signatures we could have caught. The revenue_cents column was 40% null when it’s normally under 1%. The currency column had values that weren’t in our allowed list. The row count for the partition was half what it should have been. Every one of those checks is a single line of code we didn’t write because nobody had owned writing it.
What we want is a set of assertions that run as part of the ingest pipeline, reject partitions that violate them, and make the rejection visible enough that someone is paged before the analyst notices.
What actually matters
Before reaching for a tool, it’s worth asking what “data quality” actually means in practice.
Data quality rules fall into a handful of families.
Completeness. Does the column have values where we expect values? IsComplete "revenue_cents" means “no nulls”; Completeness "revenue_cents" > 0.99 means “at least 99% non-null”. Completeness rules catch the “upstream stopped sending this field” bug.
Validity. Are the values within the domain we expect? ColumnValues "currency" in ["USD", "GBP", "EUR"] rejects rows with unknown currencies; ColumnValues "age" between 0 and 120 rejects impossible ages. Validity rules catch the “upstream sent a value we don’t know how to handle” bug.
Uniqueness. Are the values that should be unique actually unique? IsUnique "order_id" rejects duplicate orders. Uniqueness rules catch the “upstream replayed a batch” bug.
Cardinality and volume. Does the partition have roughly the rows we expect? RowCount between 10000 and 50000 rejects half-loaded partitions; ColumnCount = 42 rejects schema drift. Volume rules catch the “upstream file is truncated” bug.
Distribution. Does the data look like it usually does? Mean "order_total_cents" between 2000 and 8000 rejects partitions where the average order has collapsed or exploded. Distribution rules catch the subtle bugs where the data looks valid row-by-row but wrong in aggregate.
Cross-column and referential. Do the relationships between columns or tables hold? ColumnValues "shipping_country" in SELECT country_code FROM countries rejects shipments to countries not in our allow-list. Referential rules catch the “upstream made up values that aren’t in our lookup” bug.
A useful quality suite has rules from several of these families. A complete cover isn’t the goal, the ten rules that would have caught the last ten bugs is the goal. The suite grows by incident, not by guessing.
The second question is what happens when a rule fails. Three reasonable answers, each appropriate for different rules:
- Fail the job. The partition doesn’t get promoted. The on-call is paged. Use for rules that would corrupt the warehouse, completeness on primary keys, uniqueness on IDs, referential integrity.
- Quarantine the partition. The partition is moved to a quarantine location, not promoted, and a ticket is filed. Use for rules where a human should review but the pipeline doesn’t need to stop.
- Warn and continue. The rule’s result is written to metrics; the partition is promoted anyway. Use for soft rules (distribution, volume bounds) where a threshold crossing is a signal, not a failure.
The tool has to support all three, because a single hard threshold for everything turns either into “alert fatigue” or “the pipeline blocks every time the weekend volume dips”.
What we’ll filter on
Distilling that exploration into filters we can score each option against:
- Rule language expressiveness, can we assert completeness, validity, uniqueness, distribution, and referential integrity?
- Integration with Glue jobs and the Data Catalog, does the check run inside the pipeline and know about the same tables the job does?
- Actions on failure, can we fail the job, quarantine, or warn, per-rule?
- Metrics and history, are rule outcomes stored somewhere we can query to see drift over time?
- Anomaly detection, can the tool learn “normal” and flag deviations without us hard-coding thresholds?
The quality-check landscape
-
AWS Glue Data Quality. A native Glue feature, built on Amazon Deequ (the library AWS open-sourced in 2018). Rules are expressed in DQDL (Data Quality Definition Language), a compact SQL-like grammar. Rules run either as a standalone evaluation against a catalog table (scheduled via Glue triggers or EventBridge) or as an inline node in a Glue ETL job’s visual graph, the evaluation becomes part of the pipeline and its results drive the job’s next step. Observations go to CloudWatch, results go to Glue’s data-quality result store, and optionally to S3 for audit. Supports static thresholds and ML-based anomaly detection for statistical metrics.
-
AWS Deequ library on EMR or Glue. The underlying library, open-source, written in Scala for Spark. More expressive than DQDL (you can write arbitrary Spark checks), more work to integrate (you run it yourself inside a Spark job). The correct answer when DQDL’s grammar doesn’t cover the check you need; overkill for the ninety percent of checks it does cover.
-
Great Expectations. Open-source Python library with a large ecosystem of expectations, documentation generation, and data-doc rendering. Runs anywhere Python runs: Lambda, Glue Python-shell jobs, EMR, external compute. Strong community, strong storytelling around “data contracts”. Integration with AWS services is user-assembled rather than native.
-
Custom Glue ETL with PyDeequ. PyDeequ is the Python wrapper over Deequ; import it in a Glue ETL job and write the checks inline. Between Glue Data Quality (the managed layer above Deequ) and raw Deequ (just the library). Works, but reinvents what Glue Data Quality now does for you.
-
Lake Formation tag-based data filters + DQ integration. Lake Formation can enforce row-level and cell-level security based on tags. Paired with Glue Data Quality’s results (which can tag partitions with their DQ score), permissions can be conditioned on quality, “analysts only see partitions with DQ score > 0.95”. Interesting layering once the DQ foundation is in place.
Side by side
| Option | Rule expressiveness | Glue / Catalog integration | Actions on failure | Metrics + history | Anomaly detection |
|---|---|---|---|---|---|
| Glue Data Quality | High (DQDL + custom SQL) | Native | Fail job / quarantine / warn | CloudWatch + result store | ✓ (statistics-based) |
| Deequ on EMR/Glue | Very high (arbitrary Spark) | Manual | User-coded | User-coded | ✓ (via library) |
| Great Expectations | Very high (Python) | Manual | User-coded | Rendered docs | ✗ (plugins only) |
| PyDeequ in Glue ETL | High (library) | Manual | User-coded | User-coded | ✓ (via library) |
Reading by use case:
- Start with Glue Data Quality. It’s the native, managed, catalog-integrated answer. DQDL covers the great majority of rules you’d want to write on a warehouse dataset. The escape hatch to custom SQL rules covers most of the rest.
- Reach for PyDeequ or Great Expectations when the rule grammar doesn’t fit. Rare. Keep the DQ suite in Glue Data Quality and run the exotic checks alongside it in the same job; don’t move the whole suite because one rule needed more expressiveness.
The DQDL rule language
Rules are declared in a Rules = [ ... ] block. A rule has a function name, a column reference, and a threshold expression:
Rules = [
RowCount between 10000 and 50000,
IsComplete "order_id",
IsUnique "order_id",
Completeness "revenue_cents" > 0.99,
ColumnValues "currency" in ["USD", "GBP", "EUR", "AUD"],
ColumnValues "order_total_cents" between 0 and 1000000,
Mean "order_total_cents" between 2000 and 8000,
StandardDeviation "order_total_cents" between 500 and 4000,
DistinctValuesCount "customer_id" > 100,
ColumnDataType "placed_at" = "TIMESTAMP",
Uniqueness "order_id" = 1.0,
DatasetMatch "orders" "ref_orders" 0.99,
ReferentialIntegrity "currency" "dim_currency.code" = 1.0,
CustomSql "SELECT COUNT(*) FROM primary WHERE revenue_cents < 0" = 0
]
Families of functions worth knowing:
- Completeness family.
IsComplete,Completeness. The first is a shortcut forCompleteness > 1.0; the second takes an explicit threshold for partial-completeness rules. - Uniqueness family.
IsUnique,Uniqueness,DistinctValuesCount.IsUniqueis the shortcut;Uniquenessallows thresholds (“at least 99% unique” for systems with known-tolerable duplicates). - Value-domain family.
ColumnValues ... in [...]for allow-lists,ColumnValues ... between X and Yfor numeric ranges,ColumnValues ... matches "regex"for patterns like email or postcode. - Statistical family.
Mean,Sum,StandardDeviation,Entropy,Correlation. These are where anomaly detection shines, instead of hard-codingMean ... between 2000 and 8000, you can setMean "order_total_cents" with threshold = 3 * stddev, and DQ compares against the learned history. - Schema family.
ColumnCount,ColumnExists,ColumnDataType,RowCount. Cheap canaries for schema-drift bugs. - Referential family.
DatasetMatchcompares two datasets;ReferentialIntegritychecks that values in one column exist as values in another dataset’s column. - Custom SQL.
CustomSql "SELECT ..." = Nexecutes the query against the dataset under evaluation and compares its scalar result to the expected value. The escape hatch for rules DQDL doesn’t express directly.
A second block, Analyzers, collects metrics without asserting thresholds on them, useful for rules that are “track this, we don’t know the normal yet” before graduating them to Rules with thresholds once you do.
How the pipeline uses the result
The inline evaluation pattern
Glue Data Quality can run in two modes. The first is standalone evaluation: pick a catalog table, attach a ruleset, schedule a run with a Glue trigger or EventBridge. Useful for monitoring datasets that aren’t produced by a Glue job we control, a table populated by an external pipeline, where all we can do is check after the fact.
The second, and the one that matters for “fail loudly”, is inline in a Glue ETL job. The job’s visual graph gains an Evaluate Data Quality node. Connect the node’s input to the output of the transform step that reads the raw partition; connect its output to a Conditional Router node that branches on the DQ outcome. The pass branch connects to whatever would have been the next step (write curated, catalog, etc.); the fail branch connects to a different sink, quarantine bucket, SNS publish, or failed job termination via raise Exception.
The Evaluate Data Quality node has two action settings that matter: On data quality failure and Ignore data quality failures. Set the first to “Fail job” if any rule failure should stop the pipeline; leave it at “Pass job” when the conditional router takes the action for you. Set the second at the rule level via DQDL’s with_action modifier: IsComplete "order_id" with_action = WARN marks a specific rule as non-fatal while keeping the rest strict.
Inline evaluation gets results into the result store the same way standalone does, so history queries work regardless of which mode you used.
A worked quarantine: what the failure does
Same scenario as the opener: the upstream sends a file with a moved column header. The Glue job reads the partition. The Evaluate Data Quality node runs. Completeness "revenue_cents" > 0.99 evaluates to 0.60, 40% of the column is null. The rule fails. The rule outcome is recorded in the DQ result store with the failing metric value.
The Conditional Router sees ruleOutcomes contains at least one Failed. It routes the DynamicFrame to the quarantine branch, which writes to s3://acme-quarantine/orders/dt=2027-11-29/ingest-2027-11-29T14-00/ with a sidecar dq-failures.json listing which rules failed and why.
Meanwhile, Glue emits a Data Quality Evaluation Results Available event to the default EventBridge bus with detail.state: FAILED and a link to the result. An EventBridge rule matching that pattern fans out: a Lambda files a Jira ticket with the failure context, an SNS topic pages the on-call, an S3 target archives the event for audit.
The curated zone is untouched. The analyst’s Friday query still shows Thursday’s good data. The on-call pages at 14:04, investigates at 14:10, confirms the upstream header change at 14:30, deploys a mapping fix at 15:00. Compare against the original timeline: three days of bad data, two days of root-cause, one angry call.
The cost of this isn’t the DQ service (metered per-compute); it’s the discipline of writing rules that were worth writing. The ten rules that would have caught the last ten bugs is the rule-writing budget. Start there.
What’s worth remembering
- DQDL covers the common rules. Completeness, uniqueness, value-domain, row-count, statistical distribution, referential integrity, column-count/type/existence. Plus
CustomSqlas the escape hatch for anything DQDL doesn’t express. - Inline evaluation in Glue ETL is the “fail loudly” path. The DQ node is part of the job graph; the Conditional Router branches on outcome; the pass branch promotes, the fail branch quarantines. Standalone evaluation is for monitoring datasets you don’t own the producer of.
- Per-rule actions matter.
with_action = WARNlets soft rules (volume bounds, distribution checks) record signal without blocking. Hard rules stay strict. Don’t tune one global threshold; tune per rule. - Anomaly detection is statistics, not ML magic. DQ tracks metric history and flags deviations against the learned mean/stddev. Useful when you can’t state a hard threshold; still requires enough history to learn from.
- Results write to the DQ result store. Queryable via Athena. Use it to build a drift dashboard. DQ score per dataset per day, so degradation shows up before a rule outright fails.
- Failures fan out on EventBridge.
Data Quality Evaluation Results Availableevent, matchable ondetail.state = FAILED. Route to ticketing, paging, archival. Don’t build this as a separate notification system; hook into the event. - Analyzers are rules without thresholds. Track a metric before asserting on it. Graduate to Rules when you know what “normal” looks like.
- Deequ, PyDeequ, Great Expectations still exist. Reach for them when DQDL’s grammar doesn’t fit, not as the default. The managed integration with Glue and the Catalog is the reason to start with Glue Data Quality.
The rules aren’t the goal; the rules are the contract. The bad partition should not arrive in the warehouse without someone saying “this is what we expect”. Glue Data Quality is how that expectation becomes runnable. The hard work is writing the ten rules that would have caught the last ten bugs, not the tool that evaluates them.