Choosing Between Glue and SageMaker Processing for Feature Pipelines

March 29, 2028 · 17 min read

ML Engineer · MLA-C01 · part of The Exam Room

The situation

A payments-analytics team builds training features nightly. The pipeline:

  • Reads ~3TB of raw transaction events from an S3 “bronze” prefix partitioned by date.
  • Joins to a ~50GB customer dimension and a ~10GB merchant dimension, both kept in the Glue Data Catalog.
  • Computes 60-day and 90-day rolling aggregates per customer (transaction count, amount stats, merchant-category mix).
  • Writes a ~200GB parquet dataset to the “silver” layer, partitioned by date, registered in the Data Catalog.
  • Feeds into a nightly SageMaker training job that trains a fraud classifier.

Today the feature job runs on AWS Glue, triggered by EventBridge at 01:00 UTC; the training job kicks off via a Step Functions state machine once Glue reports success. The ML team has proposed moving the feature job into SageMaker Processing so that the whole flow, features, training, evaluation, model registration, can sit inside a SageMaker Pipeline.

The question isn’t which tool is better in the abstract. Both can do the work. The question is what each tool is for, which parts of this team’s workflow lean on which, and what gets gained or lost when the feature job changes hands.

What actually matters

The first thing worth naming is that Glue and SageMaker Processing are both managed job runners for data-processing code. Both take a script or a container, run it on managed compute, and write outputs to S3. Both scale horizontally. Both integrate with the Glue Data Catalog. Surface-level, they overlap heavily.

The differences are in the direction the tool’s gravity pulls.

Glue pulls towards the data platform. It’s built on Apache Spark (Glue’s Spark runtime has a couple of AWS-specific extensions around DynamicFrames and the Data Catalog, but it is Spark). The natural unit of work is a Spark job reading from the Catalog, writing back to the Catalog. It has a scheduler (Glue triggers, or EventBridge), a metadata layer (Data Catalog), a visual authoring tool (Glue Studio), a development environment for iterating on notebooks (Glue interactive sessions), and connects cleanly to the rest of AWS’s data stack. Athena, Redshift Spectrum, Lake Formation, QuickSight. The job’s outputs are discoverable, anyone with Athena access can query them. Glue treats data-processing as a first-class operational shape.

SageMaker Processing pulls towards the ML pipeline. It’s a primitive in the SageMaker training/inference ecosystem, designed to be a node in a SageMaker Pipelines DAG alongside training, evaluation, and registration steps. You give it a container image (sagemaker-scikit-learn:1.0-1, sagemaker-spark:3.3, or a BYOC image), a script, input and output S3 locations, and an instance type, and SageMaker runs the container on the instance. The job’s outputs are artefacts of a pipeline run; the feature data is usually traceable back to the pipeline execution that produced it via SageMaker Lineage.

Four questions to ask when picking:

Who else consumes the feature data. If analysts query it through Athena, if BI tools read it through Redshift Spectrum, if it’s a shared data product used across teams. Glue’s natural habitat. The Data Catalog makes the data discoverable; the job’s outputs are first-class data-platform artefacts. SageMaker Processing can register data in the Catalog (by calling awswrangler or writing directly), but it’s not the service’s default stance.

Where the lineage lives. If the team cares about “what version of the code produced this dataset, and which training run used it, and which model was registered from that training run” as a single traceable chain, SageMaker Pipelines + Processing + Lineage is the story. Glue has its own lineage concept (ETL job history, Data Catalog versioning), but it doesn’t automatically thread into SageMaker’s Model Registry. A Processing step in a Pipelines DAG does.

What the compute model needs. Glue is Spark-on-managed-compute: DPUs (data processing units, a bundled CPU+memory+disk unit), G.1X/G.2X/G.4X/G.8X sizing, Spark-native parallelism, columnar predicate pushdown, adaptive query execution. Scales to very large joins and wide data via Spark’s shuffle-heavy approach. SageMaker Processing is more flexible in compute shape: Spark is an option (sagemaker-spark image), but so are a pandas-shaped single-instance job, a Dask cluster, a Ray cluster, or a bespoke container. For features that fit in memory on one box, the simpler single-node Processing job is often what the team actually wants.

Who maintains it. A feature pipeline owned by a data-platform team that also owns 40 other ETL jobs lives naturally in Glue, one tool, one operational model, one monitoring pane. A feature pipeline owned by an ML team that has no other ETL responsibility lives more naturally in SageMaker Processing, it’s adjacent to everything else the team does. “Where does the on-call rotation for this job sit?” is a surprisingly useful filter.

What we’ll filter on

Five filters for the pick:

  1. Who consumes the output, analysts and BI, or just the next pipeline step?
  2. Where the lineage lives. Data Catalog / ETL history, or SageMaker Pipelines / Lineage?
  3. Compute shape. Spark-native, or flexible (single-node pandas, Spark, Dask, Ray, BYOC)?
  4. Pipeline integration. Glue triggers / Step Functions, or SageMaker Pipelines?
  5. Team ownership, data-platform team or ML team?

The feature-pipeline landscape

1. AWS Glue (Spark). ETL-shaped managed Spark. Read from Data Catalog, write to Data Catalog. DPU-based pricing. Glue Studio for visual authoring; interactive sessions for notebook-style development. Features include bookmarks (resumable reads from S3), DynamicFrames (semi-structured-friendly wrapper over DataFrames), and tight Lake Formation integration. The data-platform team’s default.

2. AWS Glue (Python shell). Single-node Python jobs for small-data ETL. One DPU, no Spark. Useful for control-flow-heavy preprocessing that doesn’t need distributed compute. Often underused because teams reach for Spark by reflex.

3. SageMaker Processing (scikit-learn / pandas / custom). Single-instance script runner. sagemaker-scikit-learn image ships with pandas, scikit-learn, numpy; BYOC for everything else. Input channels mount at /opt/ml/processing/input/<channel>/, outputs at /opt/ml/processing/output/<channel>/. Well-suited to single-node pandas jobs on data that fits in memory, or to bespoke containers that wrap specific libraries.

4. SageMaker Processing (Spark). Distributed Spark on SageMaker. sagemaker-spark image; feels like Glue Spark on different infrastructure. Uses SageMaker-managed instances, priced per-second. Less tightly integrated with the Data Catalog than Glue, but Spark-the-runtime-is-Spark.

5. EMR / EMR Serverless. A broader, more flexible Spark / Hadoop / Presto platform. Better fit when the job is part of a heavier data engineering stack that uses multiple frameworks. Not the natural reach for a feature pipeline specifically, but worth naming.

6. SageMaker Data Wrangler. A visual feature-engineering tool in SageMaker Studio that produces a transformation recipe exportable to a SageMaker Pipelines Processing job. Lovely for exploration; less so for large-scale production pipelines with bespoke logic.

7. SageMaker Feature Store. Not a processing tool, a feature storage layer (online + offline stores, with time-travel). Sits downstream of either Glue or Processing. Worth naming because teams sometimes confuse “feature pipeline” with “feature store.”

Side by side

Option Consumers Lineage Compute shape Pipeline integration Team fit
Glue Spark Analysts + ML Data Catalog + Glue ETL history Spark-native Step Functions / EventBridge Data platform
Glue Python shell Analysts + ML Data Catalog + Glue ETL history Single-node Python Step Functions / EventBridge Data platform
SageMaker Processing (sklearn) Just the ML pipeline SageMaker Lineage Single-node pandas SageMaker Pipelines ML
SageMaker Processing (Spark) Just the ML pipeline SageMaker Lineage Spark-native SageMaker Pipelines ML
EMR / EMR Serverless Multi-team EMR job history Spark/Hadoop/Presto Step Functions Data platform
Data Wrangler ML (authored visually) SageMaker Lineage (via Pipelines) Single-node SageMaker Pipelines ML (exploratory)
Feature Store Storage layer Feature Store metadata n/a Either Shared

Reading the table against the payments-analytics scenario:

  • The feature data is consumed by both ML (training job) and analysts (Athena queries on the silver prefix). Two consumers → Glue’s Catalog integration earns its keep.
  • The compute is Spark-heavy, 3TB of raw events, two joins, a 90-day window. Fits Glue Spark naturally.
  • The team has the rest of its ETL in Glue, the payments-analytics team runs ten other Glue jobs.
  • Moving the feature job to Processing would buy SageMaker Lineage integration (currently handled via Step Functions + Pipelines tagging) but cost the shared Data Catalog consumption.

Glue is the correct answer for this team. If the consumers were only the training job, or if the team had no other Glue jobs, Processing would be the correct answer instead. Neither is “better”; they fit different shapes.

Two pipelines, one job

Glue-centred (current) features are a shared data product S3 bronze raw events 3TB partitioned by date Data Catalog transactions customers, merchants Glue Spark job G.2X × 20 workers nightly trigger S3 silver features 200GB in Data Catalog Athena / BI analysts query silver SageMaker training reads silver via S3 Model Registry versioned model Lineage: Glue ETL history + Data Catalog versions; Model Registry tags the source Glue job Processing-centred (proposed) features are an artefact of the ML pipeline S3 bronze raw events 3TB Processing step sagemaker-spark image ml.m5.4xlarge × 10 S3 silver features 200GB (Catalog optional) Training step reads silver writes model.tar.gz Evaluation step Processing metrics.json Register step Model Registry conditional on metrics SageMaker Lineage (all steps linked)
Same feature job, different surrounding architecture. Glue makes features a catalog-level data product usable by analysts and ML alike; Processing makes them an internal artefact of the ML pipeline, with lineage flowing through to the Model Registry.

The pick in depth

Glue is the fit for this team. The deciding factor is the Athena consumer: analysts run dozens of queries a day against the silver feature prefix for cohort analysis, operational reporting, and ad-hoc investigation. The Data Catalog registration is essential. Moving the job to Processing would either orphan the Catalog view (bad) or require the Processing step to also register output with the Catalog via awswrangler (possible but ungainly).

The Glue job shape is mature enough that the team already has it running:

# glue_feature_job.py
import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from awsglue.context import GlueContext
from awsglue.job import Job
from pyspark.context import SparkContext
from pyspark.sql.window import Window
from pyspark.sql import functions as F

args = getResolvedOptions(sys.argv, ["JOB_NAME", "target_date"])
sc = SparkContext()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext); job.init(args["JOB_NAME"], args)

transactions = glueContext.create_dynamic_frame.from_catalog(
    database="payments", table_name="transactions_bronze",
    push_down_predicate=f"date='{args['target_date']}'"
).toDF()
customers = glueContext.create_dynamic_frame.from_catalog(
    database="payments", table_name="customers_dim",
).toDF()
merchants = glueContext.create_dynamic_frame.from_catalog(
    database="payments", table_name="merchants_dim",
).toDF()

w60 = Window.partitionBy("customer_id").orderBy("timestamp").rangeBetween(-60*86400, 0)
w90 = Window.partitionBy("customer_id").orderBy("timestamp").rangeBetween(-90*86400, 0)

features = (
    transactions
    .join(customers,  "customer_id",  "left")
    .join(merchants,  "merchant_id",  "left")
    .withColumn("count_60d",  F.count("*").over(w60))
    .withColumn("amount_mean_60d", F.avg("amount").over(w60))
    .withColumn("amount_std_60d",  F.stddev("amount").over(w60))
    .withColumn("count_90d",  F.count("*").over(w90))
    .withColumn("mcc_entropy_90d", ...)
)

features.write.mode("overwrite").partitionBy("date").parquet(
    "s3://bucket/silver/transaction_features/"
)

# Register the updated partition with the Data Catalog
glueContext.write_dynamic_frame.from_catalog(
    frame=DynamicFrame.fromDF(features, glueContext, "features"),
    database="payments", table_name="transaction_features_silver",
)

job.commit()

Run on G.2X workers with an auto-scale configuration (Glue’s --enable-auto-scaling flag); schedule via Glue trigger or EventBridge. The training job downstream reads the resulting parquet via the SageMaker SDK’s training_input and trains as normal.

When Processing would be the correct answer instead. A different team, same data shape, no analyst consumers: a fraud-detection ML team that owns the feature pipeline entirely and no one else queries the output. In that case, putting the whole flow inside a SageMaker Pipeline. Processing → Training → Evaluation → Register, gives the team a single mental model (SageMaker), single lineage story (SageMaker Lineage), single orchestrator (SageMaker Pipelines), and a single IAM role surface. The Catalog isn’t where the weight lives; the Pipelines execution graph is.

The Processing shape:

from sagemaker.processing import ScriptProcessor, ProcessingInput, ProcessingOutput
from sagemaker.spark.processing import PySparkProcessor

spark_processor = PySparkProcessor(
    framework_version="3.3",
    role=role,
    instance_type="ml.m5.4xlarge",
    instance_count=10,
    max_runtime_in_seconds=3600,
)

spark_processor.run(
    submit_app="feature_job.py",
    submit_jars=["s3://bucket/code/awsglue-patches.jar"],
    inputs=[
        ProcessingInput(source="s3://bucket/bronze/transactions/",
                        destination="/opt/ml/processing/input/transactions/"),
        ProcessingInput(source="s3://bucket/bronze/customers/",
                        destination="/opt/ml/processing/input/customers/"),
    ],
    outputs=[
        ProcessingOutput(source="/opt/ml/processing/output/features/",
                         destination="s3://bucket/silver/features/"),
    ],
    arguments=["--target-date", "2027-09-18"],
)

Wired into a Pipeline, the step’s outputs flow into the next step’s ProcessingInput (or TrainingInput) via property references, and the lineage writes itself.

A worked migration decision

How this team worked through the choice:

  • Question 1: Do analysts query the silver features via Athena? Yes, 20-30 queries/day. Glue has Catalog integration for free; Processing requires extra work to replicate.
  • Question 2: Does the team own other ETL jobs in Glue? Yes, 10 other Glue jobs, all authored by the same team. Switching this one job out would fragment the operational model.
  • Question 3: Is there a compelling reason to need tight SageMaker Lineage between features and model? Somewhat, auditors ask “which features trained which model” occasionally. Currently answered by Step Functions tagging; good enough.
  • Question 4: Does the feature pipeline need a SageMaker Pipelines capability that Glue can’t give? No.

Verdict: stay on Glue. Keep the feature job as-is; keep Step Functions orchestrating the nightly chain (Glue → SageMaker training → SageMaker registration). Revisit if the shape of the team or the consumers changes.

What’s worth remembering

  1. Glue and Processing can both run feature pipelines. The pick isn’t “which is better,” it’s “which fits the surrounding architecture.”
  2. Glue pulls towards the data platform. Catalog-aware, Athena-queryable outputs, shared-data-product framing, data-team-owned.
  3. Processing pulls towards the ML pipeline. SageMaker Lineage integration, fits as a step in a SageMaker Pipelines DAG, ML-team-owned.
  4. Spark is available in both. Glue Spark with DynamicFrames and Catalog; Processing via the sagemaker-spark image. Runtime-is-Spark either way.
  5. Single-node jobs belong on Processing or Glue Python shell. Pandas-sized data doesn’t need Spark; the Processing single-instance pattern or Glue Python shell is cheaper and simpler.
  6. The consumer list is the biggest tell. Analysts + ML → Glue. Just the training job → Processing. Both via Feature Store → either.
  7. Lineage stories are different. Glue ETL history + Data Catalog versions vs SageMaker Lineage + Pipelines executions. Pick the one the team already reads.
  8. Migrating isn’t free. Even when Processing would theoretically fit better, the cost of re-authoring a mature Glue pipeline, switching operational tooling, and retraining the team often outweighs the benefit.

The same work can live in either tool; the correct tool is the one whose gravity matches the team’s. The payments team has gravity towards the data platform; the feature job stays on Glue. A different team would move it; both would be correct.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.