The situation
The sales team collected leads at a conference. Fifteen thousand rows, a CSV full of inconsistencies: phone numbers in eight different formats, email addresses with trailing whitespace, company names that are obviously the same company spelt three different ways, a signup_date column that’s occasionally a Unix timestamp and occasionally a British date string, and a few dozen rows where the row-terminators landed inside a free-text field and corrupted the shape.
The finance analyst running the lead-scoring model wants this cleaned by Monday. She can write SQL, she can use Excel for her actual analysis, and she’s roughly allergic to Python. The data engineering team is mid-migration and hasn’t got hands for a one-shot ETL. The analyst’s last three “one-off” clean-ups turned out to be ongoing; the CSV arrives quarterly from the conference circuit and monthly from the partner-recruitment form. So “one-off” is a label, not a fact.
The job itself is not technically hard. It’s a pipeline of small transformations: parse dates, normalise phones, lowercase and trim emails, dedupe on email, standardise company names with a lookup, drop obviously-broken rows, write out as Parquet. In Spark that’s forty lines. In Pandas it’s twenty. Neither is the point, the constraint is “who does the work, how is it repeatable, and where does the output land”.
What actually matters
Before reaching for a tool, it’s worth asking what “data prep” actually is on this kind of work.
A prep pipeline is a sequence of transformations applied to a dataset. Each transformation is small, “lowercase this column”, “split that column on comma into two columns”, “fill these nulls with the median”, and order matters, because later steps depend on the shape earlier steps produced. A useful prep tool makes each transformation discoverable (the analyst can find “the thing that parses dates” without knowing its name), previewable (the analyst can see what the step does to the data before committing to it), and replayable (the same sequence of steps runs again when new data arrives in the same shape).
That suggests three capabilities the tool has to provide.
A catalogue of transformations the analyst can reach for. The analyst isn’t a programmer; she’s a domain person with a lot of Excel experience. The prep tool has to meet her where she is. Not .str.replace(r'\s+', ' ', regex=True) but “Clean whitespace”. Not pd.to_datetime(..., errors='coerce') but “Parse as date, leave blanks for bad values”. A catalogue of common operations keyed by what they do rather than how they do it.
A preview at every step. Each transformation she applies should update a view of the data immediately. This is how she tells she got the step right, “I applied the date-parse step; is the column now a real date or did it make a mess?”. Without preview she’s writing code blind and hoping. With preview she’s doing exploratory data work.
A recipe that can be rerun. The sequence of steps she builds should be saveable, named, and applicable to a fresh dataset in the same shape. That’s what makes the “one-off” actually reusable without her having to redo the work next quarter. Bonus if the recipe can be versioned so a mistake in step 14 doesn’t throw away steps 1-13.
The fourth thing, which is more about fit than feature: the output has to land somewhere the rest of the stack can use. S3, Parquet, cataloged in Glue, queryable from Athena. The analyst’s work has to exit “a spreadsheet she emailed around” and enter “a dataset the warehouse knows about”.
What we’ll filter on
Distilling that exploration into filters we can score each option against:
- Visual recipe authoring, can a non-programmer assemble the pipeline without writing code?
- Built-in transformation library, does the tool ship with enough transforms that “I want to X” usually has an answer?
- Preview at each step, does applying a transform show its effect on the data immediately?
- Scheduled / reusable runs, can the recipe be saved and applied to new files that arrive later?
- Native AWS output integration, do recipe outputs land in S3 and register in the Glue catalog without extra work?
The data-prep landscape
-
AWS Glue DataBrew. A visual data-prep tool in the Glue family. The analyst opens a dataset (S3 file, JDBC source, Redshift, DynamoDB, or Glue catalog table), interacts with it in a spreadsheet-like grid, and builds a recipe by clicking transformations from a palette of 250+ built-in steps. Steps can be parameterised, reordered, disabled, and previewed against a sample. A project bundles a dataset and a recipe together for authoring. A job applies a recipe to a dataset at scale and writes results to S3 (CSV, JSON, Parquet, Avro, XML), Glue catalog, Redshift, or RDS. Jobs can run on demand or on a schedule.
-
AWS Glue Studio. A visual authoring tool for Glue ETL jobs, visual graph of nodes representing source/transform/target in Spark. More powerful than DataBrew (any Spark operation is expressible), more developer-oriented (expects familiarity with DataFrames and transformations in a Spark sense). Right answer for production ETL; overkill for lead-list cleanup by an analyst.
-
Amazon SageMaker Data Wrangler. Visual data prep inside SageMaker Studio, aimed at ML feature engineering. Strong on feature-engineering transforms (bias detection, feature encoding, target-leak analysis); weaker on general data-cleaning ergonomics than DataBrew. Right answer if the output is a training dataset; overkill if the output is “a clean Parquet for the warehouse”.
-
QuickSight with prepared datasets. QuickSight’s data-prep features let analysts clean and join data while they build dashboards. Scopes to the BI tool, useful when the output is a chart, less useful when the output is a warehouse table.
-
Pandas / Polars in a Glue Python Shell job. Twenty lines of Pandas, scheduled as a Glue Python Shell job. Requires the engineer who’d write it; the analyst can’t. Right answer when the transformations are code-level unique; wrong shape when the work is well-known steps the analyst could find on a palette.
Side by side
| Option | Visual recipe | Transform library | Per-step preview | Reusable / scheduled | S3 + Catalog output |
|---|---|---|---|---|---|
| Glue DataBrew | ✓ | 250+ built-ins | ✓ | ✓ (jobs) | ✓ |
| Glue Studio (visual ETL) | ✓ (graph) | Spark ops | Partial | ✓ | ✓ |
| SageMaker Data Wrangler | ✓ | ML-oriented | ✓ | Via SageMaker Pipelines | ✓ |
| QuickSight prep | ✓ | Limited | ✓ | ✓ (dataset refresh) | ✗ (BI-scoped) |
| Pandas in Glue Python | ✗ (code) | Any | ✗ (print-based) | ✓ | ✓ |
Reading by use case:
- Analyst-led cleanup of a reasonably structured file. DataBrew. The palette covers the common cases; the analyst owns the recipe.
- Production Spark ETL with multi-source joins and business logic. Glue Studio or hand-coded Glue ETL. DataBrew is the wrong tool; use it for what it’s good at and move on.
- Preparing a feature store for a training job. Data Wrangler is designed for this, including leak detection and bias analytics.
- The transformations are genuinely unique and not-on-any-palette. Python or Scala, scheduled as a Glue job. The visual tool isn’t earning its keep when every step is a custom expression.
DataBrew in depth
Projects and sessions. A project bundles a dataset with a recipe-in-progress. Opening a project starts an interactive session, which provisions a small Spark environment to preview transforms against a sample of the dataset. Samples are configurable, first N rows, random N rows, last-modified N rows, stratified sample by column value. Authoring happens against the sample for speed; the final job runs against the full dataset. Sessions are billed per half-hour and auto-close after idle; close the session when you’re done to stop the meter.
Recipes. A recipe is an ordered list of steps. Each step is one of the built-in transforms, applied to one or more columns, with parameters. Steps can be previewed, reordered, disabled (useful for “skip step 7 and see what happens”), and commented on for the next person who opens the project. Recipes have versions, publishing a recipe creates a new version; old versions remain available. A job references a specific recipe version, so updating a recipe doesn’t silently change what a scheduled job does.
Transform families worth knowing.
- Structural: rename columns, split column (by delimiter, by positions, by regex), combine columns, duplicate column, change data type, reorder columns.
- Text cleaning: trim whitespace, change case, remove special characters, extract substring by pattern, replace by regex, remove HTML/URL encoding.
- Numeric: round, ceiling, floor, min/max cap, absolute value, log/exp, discretize (bin into ranges), scale (min-max, z-score).
- Date/time: parse (with format hint or auto-detect), format, extract component (year, month, weekday), add/subtract duration, date diff.
- Filter: by column value (range, list, pattern match, is-null, is-not-null), by duplicate, by outlier.
- Join and union: inner/outer/left/right join to another dataset, union along rows or columns.
- Aggregate and pivot: group by + aggregate, pivot, unpivot.
- ML-powered: find and replace fuzzy matches (for “Acme Corp” / “ACME Corporation”), detect and handle outliers using automated statistics, auto-discover data type patterns.
- Sensitive data: detect PII by column (email, phone, credit card, SSN, name), mask (redact, hash, encrypt, shuffle, null-out), tokenize.
Data profile. Running a profile job against a dataset produces column statistics: distinct values, null counts, value distributions, outliers, detected patterns, correlation between columns. The profile output is JSON written to S3 and also rendered in the console, a fast way to understand what’s actually in a file before writing a recipe against it.
Jobs. Two kinds: recipe jobs apply a recipe to a dataset and write output; profile jobs run the profile against a dataset without writing transformed output. Jobs configure:
- Dataset, the input.
- Recipe (for recipe jobs), the transforms to apply.
- Output, format (CSV/JSON/Parquet/Avro/XML), S3 location, compression, partitioning, encryption (KMS), catalog registration.
- Max nodes, the parallelism cap (1-149); DataBrew scales Spark executors up to this.
- Timeout, default 48 hours; lower for short jobs so a hung job fails fast.
- Max retries, automatic retry count on transient failures.
- Tags, for cost allocation.
- Role. IAM role the job runs as; needs S3 read on the input, S3 write on the output, optionally Glue catalog access for registration.
- Schedule, cron expression; rerun the same recipe on the same dataset (or a dataset defined by path pattern) periodically.
Dataset definitions. A dataset can be a single file or a dataset parameter pattern, an S3 path with placeholders like s3://leads/conference-{conference_name}/{year}/{month}/leads.csv. The pattern matches multiple files; the job processes any matching file when run. Useful for the “this keeps arriving in the same shape” case, the recipe runs whenever a new matching file appears, without reconfiguring the dataset.
The recipe, visualised
The sensitive-data story
One of DataBrew’s quieter features is the PII detection and masking pipeline. Data profiles can include a PII scan that flags columns matching patterns for email, phone, credit card, SSN, IP, URL, driver’s licence, passport, and names. The scan reports a confidence score per column; the analyst or steward confirms and can attach a masking step to the recipe: redact ([MASKED]), hash (deterministic, keyed via SSM for consistency across runs), shuffle (randomise rows within the column so aggregate statistics survive but per-row identity doesn’t), encrypt (KMS), or null-out. The masked columns can then be written to a less-sensitive tier; the original column stays in a locked-down tier; the two versions share the same primary key so they can be rejoined by users with the right permissions.
The upside is that the analyst doesn’t need to know cryptographic primitives or write deterministic-hash code. The downside is that DataBrew’s PII detection is pattern-based and confidence-scored, not perfect; for regulated data, the profile is the first pass and a manual schema-level review is the second.
A worked job: one CSV to catalogued Parquet
The analyst creates a DataBrew dataset pointing at s3://acme-leads/raw/conference-xyz/2027-11/leads.csv. She creates a project against it; DataBrew provisions a session in thirty seconds and shows her a grid of the first 500 rows.
She builds the eleven-step recipe by clicking through the palette. Each step previews immediately: step 4 (parse date) shows the column turning from mixed strings into timestamps and three rows going to null, she inspects those rows and confirms they’re the ones with corrupted terminators (step 8 will filter them). Step 6 (fuzzy company match) shows ACME Corporation, ACME Corp., Acme, Inc. collapsing to Acme Corporation at threshold 0.92; she bumps the threshold to 0.95 when she sees a false-positive match between two genuinely different companies.
She saves the recipe as leads-clean at version 1. She defines a dataset parameter pattern s3://acme-leads/raw/conference-{name}/{yyyy}/leads.csv. She creates a recipe job: Parquet + Snappy output to s3://acme-curated/leads/, partitioned by conference_name, registered in the Glue catalog as curated.leads_cleaned, max nodes 4, schedule disabled for now. She runs the job once against the current file: 90 seconds, output appears, Athena can query it.
She enables the weekly schedule. Next quarter’s conference CSV lands; the job picks it up via the pattern, applies recipe version 1, writes to a new partition. The analyst hasn’t touched it again; the engineer never touched it at all.
Three months later, she notices that some phone numbers from a new region are failing normalisation. She opens the project, disables step 9 temporarily, tests step 5 with an updated country lookup, re-enables step 9, saves recipe version 2. The next scheduled job run uses version 2; the previous runs’ outputs are unchanged. Versioning paid for itself.
What’s worth remembering
- DataBrew is a visual recipe tool over Spark. Analysts build recipes by clicking transforms; the job runs Spark under the hood. No Spark knowledge required to author; Spark benefits applied at scale.
- 250+ built-in transforms cover the usual. Structural, text, numeric, date, filter, join, aggregate, ML-powered, PII. The escape hatch for what’s not there is exporting the recipe to a Glue job, rare, but available.
- Sessions for authoring, jobs for running. Sessions are interactive, sampled, billed by the half-hour, close them when done. Jobs run against the full dataset and are the scheduled/production path.
- Dataset parameter patterns make recipes reusable.
conference-{name}/{yyyy}matches future files in the same shape; the job runs against whatever matches. Turns “one-off” into “this is now a pipeline”. - Recipe versions are published artefacts. Jobs target a specific version; updating the recipe doesn’t silently change what runs. Keeps production deterministic while authoring continues.
- PII detection and masking are first-class. Profiles flag PII columns; recipes can mask with redact/hash/shuffle/encrypt/null. The deterministic-hash pattern (salt from SSM) lets you rejoin masked data with the key later.
- Output lands in S3 and the catalog. Parquet/Avro/CSV/JSON/XML to S3; optional Glue catalog registration in the same job; Athena and Redshift Spectrum pick it up without extra work.
- Glue Studio, Data Wrangler, and Pandas-in-Glue are the adjacent options. Glue Studio for Spark ETL with logic that doesn’t fit recipes. Data Wrangler for ML feature prep with SageMaker Pipelines. Pandas in Glue Python Shell for truly custom transformations the palette can’t express.
DataBrew is what the analyst reaches for when the work is well-known steps and the output has to land in the warehouse. It isn’t a replacement for production ETL, it’s a tool that keeps the analyst’s work runnable and reusable without routing through the engineering team. The hard part isn’t the tool; it’s recognising which “one-off” is actually a recipe waiting to be saved.