Choosing the Right Glue Job Shape for the Workload

January 12, 2028 · 14 min read

Data Engineer · DEA-C01 · part of The Exam Room

The situation

Fifty Glue jobs running on a Glue 4.0 cluster. The bill is roughly $18,000 a month, growing 15% quarter on quarter. Inspection reveals three rough groupings.

  • Twenty-two are large-dataset Spark transforms. Reading tens to hundreds of gigabytes of Parquet, doing joins, aggregations, writing curated outputs. Runtimes in tens of minutes to hours. Spark is the correct tool; these jobs are correctly shaped.
  • Fifteen are small-dataset scripts. Python code running at small scale: an API call to a SaaS tenant and a write to S3, a file-format conversion on a file under a gigabyte, a metadata update in DynamoDB. Running on 2 Glue workers at a minimum because Spark has a minimum node count. Typical runtime: 3 to 8 minutes. Typical data volume: kilobytes to tens of megabytes.
  • Thirteen are stream processors. Reading from Kinesis or Kafka, doing light enrichment, writing to S3 or DynamoDB. Running as “micro-batch” in a Glue Spark job with a loop that reads a chunk and processes. Running continuously.

The twenty-two big-transform jobs are doing the correct thing the correct way. The fifteen small-script jobs are paying a cluster’s minimum overhead for work that doesn’t need a cluster. The thirteen stream jobs are paying for continuously-running batch clusters for work whose natural billing shape is “continuous”.

The Glue job family has shapes for each of these workloads, with different cost profiles and different execution models. The opportunity is to match each workload to its correct shape, not to leave them all on the one that got built first.

What actually matters

Before reaching for a job type, it’s worth naming what each shape is optimised for.

A distributed-cluster batch job runs work across a cluster of workers, scaling to terabytes by adding nodes and paying per worker-hour. There’s a minimum cluster size, which means a minimum floor on cost per invocation; overhead per invocation includes cluster provisioning, session startup, and JVM warmup. For a job that does real distributed work on a real dataset, these overheads amortise. For a 3-minute script on 50 MB of data, they’re most of the job.

A single-process script job runs a single Python process on a small worker, no cluster involved. Fast to start, cheap per second, full PyPI ecosystem. No distributed-DataFrame API; if you outgrow a single process, you’re in the wrong shape. Right for scripts that were going to be Python anyway (API calls, file manipulations, small lookups, key-value updates, orchestration glue).

A continuous streaming job runs micro-batches against a stream source forever. Billed for as long as it runs, with no start/stop per record. Scales by worker count. Right for jobs that read continuously; wrong for jobs that poll a stream occasionally.

A distributed Python job (a newer shape): distributed Python with a task-graph model, for workloads that need Python parallelism but not the DataFrame surface of the cluster batch shape. Narrower fit than the other three; worth knowing about but not where most shape-mismatches live.

What we’ll filter on

Distilling into filters we can score each job type against:

  1. Minimum overhead per invocation (how much cost and time is paid just to start the job?)
  2. Scalability ceiling (how big can the input get before this job type stops fitting?)
  3. Python library ecosystem (can it use Pandas, Polars, SQLAlchemy, requests, etc.?)
  4. Continuous-execution model (suited for long-running streams or batch-on-demand?)
  5. Cost shape (how does the bill behave for this kind of workload?)

The Glue-job landscape

1. Glue Spark (G-type workers). The default Glue ETL job. Spark 3.x under the hood (Glue 4.0 = Spark 3.3; Glue 5.0 = Spark 3.5). Workers: G.1X (4 vCPU / 16 GB / 64 GB disk, 1 DPU), G.2X (8/32/128, 2 DPU), G.4X (16/64/256, 4 DPU), G.8X (32/128/512, 8 DPU). Minimum 2 workers; scale to hundreds. Auto-scaling within a job is available (Glue 3.0+). DataFrames via PySpark, Scala, or Glue’s DynamicFrame API (schema-evolution-aware). Billed per second per DPU, 1-minute minimum.

2. Glue Python shell. A single-process Python environment. Two sizes: 0.0625 DPU (small, ~1 vCPU / 1 GB RAM) or 1 DPU (medium, ~4 vCPU / 16 GB RAM). Billed per second, 1-minute minimum. Libraries: you can pre-install additional Python packages via a additional-python-modules argument or bundle a wheel/zip. No Spark.

3. Glue Spark Streaming. Structured Streaming, reads from Kinesis Data Streams, Kafka, or MSK. Micro-batch (configurable batch interval) or continuous trigger. Runs until stopped. Billed for wall-clock DPU-hours.

4. Glue Ray. Ray-based Python distributed execution. For ML-adjacent or embarrassingly-parallel Python workloads that don’t fit Spark’s DataFrame model well. Narrower ecosystem, newer.

5. Lambda (adjacent). For very small, very short jobs, Lambda is often cheaper than Python shell. 15-minute max runtime, 10 GB max memory. No in-flight retries like Glue has; simpler to operate for truly stateless work.

Side by side

Job type Min overhead Scale ceiling Python libs Continuous run Cost shape
Glue Spark 30-60s start + 2-worker minimum TB-scale PySpark + pip libs Batch Per-DPU-second, high floor
Glue Python shell Few seconds start, 1-process Single-node (GB-scale) Full PyPI Batch Per-DPU-second, low floor
Glue Spark Streaming Continuous TB-scale / unbounded PySpark Continuous Per-DPU-hour, continuous
Glue Ray Few seconds start Node-scale Python parallelism Full PyPI + Ray Batch Per-DPU-second
Lambda (adjacent) Ms start Single-request (10 GB RAM, 15 min) Full PyPI Batch/event Per-request + per-ms

Reading by workload shape:

  • Large-dataset transform, joins, aggregations: Glue Spark. Right shape for the work.
  • Small-data Python script running on a schedule or trigger: Glue Python shell (or Lambda if it fits the model). Don’t pay for Spark if you’re not using Spark.
  • Continuous stream processing from Kinesis/Kafka: Glue Spark Streaming. The billing model matches continuous execution.
  • Python parallelism over many small units, not DataFrame-shaped: Glue Ray, if the workload fits.
  • Very small, very short, event-triggered: Lambda. Under 15 minutes, under 10 GB, no complex orchestration: Lambda is usually cheaper and simpler than Glue Python shell.

Python shell in depth

The job definition. In the Glue job console or via CloudFormation, set Command.Name: pythonshell. The script is a standard Python file uploaded to S3. --additional-python-modules takes a comma-separated list of pip requirements (pandas==2.1.0,requests>=2.31), which Glue pre-installs before running the script. For more complex environments, bundle a wheel file or a .zip and reference it in --extra-py-files.

What you lose versus Spark. No DynamicFrame, no Glue catalog-driven schema helpers, no distributed DataFrame. You read S3 via boto3 or a library like s3fs or pandas.read_parquet. You don’t get Glue’s pushdown optimisations against catalogued tables; Athena via SDK is the usual substitute for “query this Glue table and get results”.

What you gain. Fast startup (seconds, not tens of seconds). Lower minimum cost (0.0625 DPU-min is roughly a cent; 2-worker Spark minimum is several times that). Full PyPI ecosystem. Familiar single-process debugging. No JVM to tune.

Sizing. 0.0625 DPU for small scripts; 1 DPU for work that actually uses memory (larger Pandas DataFrames, compute-heavy lookups). If neither fits, you’re probably in Spark territory anyway. Python shell doesn’t scale beyond 1 DPU, by design.

Spark streaming in depth

Source and sink. A Glue streaming job reads from Kinesis Data Streams, Amazon MSK, Kafka (self-managed), or Amazon Managed Streaming for Kafka with IAM authentication. It writes to the usual Spark sinks: S3, JDBC (with upsert patterns), Kinesis, DynamoDB via a foreach-batch handler.

Triggers. processingTime (micro-batch at a fixed interval, e.g. '10 seconds'), once (process available data and stop, useful for testing), continuous (experimental, low-latency), availableNow (process everything available right now and stop). processingTime is the default; the interval is the trade-off between latency (smaller = faster end-to-end) and throughput (larger = more efficient per-batch overhead).

Checkpointing. Streaming jobs maintain checkpoint state in S3 to recover from failures without replaying from source or double-counting. The checkpoint location is configured per-job; important to a stable path that survives job redeploys. Deleting checkpoints forces a full replay from the source’s retention.

Autoscaling. Glue 3.0+ supports auto-scaling within streaming jobs; workers scale up and down based on backlog. Configure min/max worker counts; the job monitors Kinesis/Kafka lag and adjusts.

Job bookmarks don’t apply to streaming. Bookmarks are a Glue ETL batch concept for “resume from where the last run left off”. Streaming uses checkpoints instead.

Right-sizing the fifty jobs

Picking the Glue job shape Continuous stream source? yes Glue Spark Streaming Kinesis / Kafka / MSK source checkpoint in S3, micro-batch trigger no Dataset > ~1 GB or needs distributed compute? yes Glue Spark (G.1X+) DynamicFrame + PySpark 2+ workers, scale with dataset no Under 15 min and under 10 GB RAM? yes Lambda often cheapest for small/short per-request billing, no Glue floor no Python shell 0.0625 or 1 DPU any PyPI lib or Glue Ray if parallel not DataFrame
Four questions to choose a shape. Stream source → streaming. Big data → Spark. Small and short → Lambda. Small but not Lambda-shaped → Python shell. Ray as a side branch for parallel-but-not-DataFrame work.

The migration plan

Inventory the fifty jobs. Produce a one-line summary of each: what it reads, what it writes, how big the data is, how long it runs, how often. A spreadsheet is fine. This is the diagnostic; without it, “we should use Python shell more” is a vibe, not a plan.

Categorise. For each job, answer the four questions from the diagram: stream source? Big data? Under 15 minutes and 10 GB? Each answer lands the job in one of the four shapes. Expect the split to look roughly like: 40% Spark batch (correctly shaped), 30% Python shell candidates, 20% Lambda candidates, 10% streaming.

Migrate the obvious wins first. The 15 small-script jobs become Python shell. Estimated savings: each Spark job’s 2-worker minimum and 30-second startup becomes a 0.0625-DPU script that costs roughly 5-10% as much to run. Across 15 jobs running nightly, that’s real money compounded over a year.

Reconsider the streams. The 13 stream processors are running as Spark batch jobs with manual loops, inefficient compared to Glue Spark Streaming’s native integration. Migration isn’t zero-effort (refactor the logic to Structured Streaming), but the payoff is checkpointing, autoscaling, and not paying for a spun-up cluster between polls.

Leave the correctly-shaped 22 alone. They’re doing Spark work that needs Spark. Optimise them on their own merits (right-sizing worker count, enabling auto-scaling, tuning shuffle partitions) but don’t move them to a smaller shape.

A worked migration: file-format conversion

Current: a Glue Spark job (G.1X, 2 workers) that reads a CSV from S3, transforms it slightly, and writes Parquet. 50 MB input; 4-minute runtime; runs hourly.

Cost estimate: 2 workers × 1 DPU × 4 minutes × 24 runs/day × 30 days = ~96 DPU-hours/month at whatever the prevailing G.1X rate is. Plus per-job-startup overhead the cluster pays.

Migration to Python shell:

import pandas as pd
import boto3
from io import BytesIO

s3 = boto3.client("s3")
obj = s3.get_object(Bucket="acme-raw", Key=f"daily/{date}/input.csv")
df = pd.read_csv(obj["Body"])
df["normalised_amount"] = df["amount"].str.replace(",", "").astype(float)
df = df[df["status"] != "cancelled"]

out = BytesIO()
df.to_parquet(out, compression="snappy", index=False)
s3.put_object(Bucket="acme-curated", Key=f"daily/{date}/output.parquet", Body=out.getvalue())

Job configuration: Command.Name: pythonshell, MaxCapacity: 0.0625, Timeout: 10. --additional-python-modules: pandas==2.1.0,pyarrow==14.0.0. Runtime: ~90 seconds. Cost: roughly 5% of the Spark version.

Downsides: no Glue catalog awareness (the job writes to a known S3 path, and a separate Glue crawler or direct catalog update handles registration); no DynamicFrame schema-evolution handling. For this job, neither matters.

What’s worth remembering

  1. Glue offers four job shapes for different workload shapes. Spark for big-data DataFrame work; Python shell for small scripts; Spark streaming for continuous stream processing; Ray for parallel Python that isn’t DataFrame-shaped.
  2. Spark has a floor of 2 workers and tens of seconds of startup. Small jobs pay this overhead as most of their cost. Moving small jobs to Python shell removes the floor and drops startup to seconds.
  3. Python shell is a single-process Python environment. 0.0625 or 1 DPU. Full PyPI. No Spark. No catalog helpers. The correct shape for scripts that were going to be Python anyway.
  4. Streaming jobs run continuously by design. Checkpointing in S3, autoscaling on backlog, micro-batch triggers. Billed per wall-clock DPU-hour, not per invocation. Match “job runs continuously” to streaming and “job runs on schedule” to batch.
  5. Lambda is often cheaper than Python shell for the smallest jobs. Under 15 minutes, under 10 GB, event-triggered or on a Scheduler cron. No per-job Glue minimum.
  6. Glue 4.0 → 5.0 moves Spark 3.3 → 3.5. Newer Spark features, Iceberg + Hudi + Delta Lake support baked in, updated Python versions. Upgrade when the release-cadence fits your regression testing.
  7. Auto-scaling within a Spark job exists since Glue 3.0. Set min/max workers; the job scales as stages demand. Useful for jobs with very uneven stage profiles (small setup + large shuffle + small write).
  8. Don’t let the first-shape choice become the every-shape default. Review the job inventory periodically; move jobs to the shape that fits rather than accreting all in one type.

Glue Spark is a powerful tool; it’s also the most expensive per-DPU shape when the job doesn’t need distributed compute. Python shell is Glue’s answer for the majority of small jobs that don’t need Spark. Streaming is its answer for continuous stream processing. Using all three for what each was designed for is the difference between a Glue bill that grows with workload and one that grows with waste.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.