Picking a Scheduler on AWS

September 13, 2027 · 17 min read

The situation

Acme’s platform team is auditing the roughly 200 “scheduled things” that run across their AWS estate. The list falls into three rough buckets.

Recurring, same-every-time jobs. A nightly report Lambda in each of three Regions. A weekly database vacuum. A five-minute health-check ping. Each of these needs “run this function at this cadence” and nothing else.
Sequenced, multi-step workflows. A month-end billing batch: fetch usage, aggregate per customer, generate invoices, email them, wait for payment, mark paid, handle failures. Fifteen steps, branches, retries, durable state.
One-shot schedules per entity. A reminder email for each new customer, 24 hours after signup. A quote expiry at 72 hours. A trial-end notification at day 14. One schedule per entity, fired once, usually with a payload.

Today, all three buckets share the same answer: an EC2 instance running cron, triggering SSH scripts and Lambda invocations via the AWS CLI. The instance has a single point of failure, no retry semantics, and logs that are “whatever cron wrote to /var/log.”

The audit’s question: which AWS-native scheduler fits which bucket, and what happens when we try to use one for all three?

What actually matters

Before reaching for a specific primitive, it’s worth asking what shape each job actually has, because three problems are hiding in “run something later” and they want different mechanisms.

How many schedules are there? A handful of recurring jobs is a very different scaling problem from “one schedule per customer” at millions of customers. A primitive sized for tens of rules at the account level is the wrong shape for per-entity schedules; the unit of scheduling has to scale with the unit of entity.

Does the work itself have multiple steps? If the work is a single invocation, the scheduler can call the target directly. If the work is a multi-step flow with branching, retries, and wait states, the scheduler’s job is to start the flow; the flow primitive owns everything after that. Putting a fifteen-step workflow inside a single function is how you end up hitting an execution-time limit at the worst possible moment.

Does the schedule itself depend on per-entity data? A reminder 24 hours after this specific customer’s signup is a per-entity schedule. Either the scheduler primitive supports a schedule-per-entity at scale, or the delay lives inside a durable workflow triggered by the signup event. Both shapes work; which fits depends on what else the workflow has to do once it wakes up.

What about timezones? “02:00 local” in three Regions is one schedule per Region with its own timezone, or three schedules in UTC with offsets to keep in sync against daylight saving. The first shape is cheap when the primitive supports timezones natively; the second is a maintenance bill that comes due twice a year.

What’s the failure story? When the target rejects the invocation (throttled, broken, IAM denied), the scheduling mechanism needs a retry policy and a dead-letter destination. Otherwise a missed invocation is a silent gap in the schedule, discovered when somebody notices their reminder didn’t fire.

What we’ll filter on

Cron / rate / one-shot: does it support all three schedule types?
Timezone-native: local time vs UTC only?
Millions of schedules: does it scale to one-per-entity?
Multi-step orchestration: does it own sequencing, branching, and retries?
Durable state across retries: does a step’s state survive restart?
Retry + dead-letter: what happens on target failure?

The scheduling landscape

EventBridge Scheduler. One-shot (at(...)), rate (rate(5 minutes)), or cron (cron(0 2 * * ? *)) expressions with explicit timezone (Asia/Singapore). Direct target integrations with 270+ AWS APIs: Lambda, SQS, SNS, Step Functions, ECS RunTask, EventBridge PutEvents, and most service APIs directly. Retry policy (MaximumRetryAttempts, MaximumEventAgeInSeconds), dead-letter queue, flexible time windows. One schedule group holds up to a million schedules; account quota is 1,000 schedules per second of creation.
EventBridge Rules. Cron/rate only (no one-shot). UTC only. Hundred-rule cap per event bus per Region. Same target set as Scheduler. Good for a small number of cron-driven Lambdas if you already have EventBridge rules for event patterns; not what AWS recommends for new scheduled-only work.
Step Functions (Standard). Durable workflow engine. Schedule the start via EventBridge Scheduler; the workflow then runs for up to a year. Choice for branches, Parallel / Map for fan-out (Map supports up to 10,000 concurrent child executions in Distributed mode), Retry with back-off per state, Wait for fixed or dynamic delays, Task for Lambda / Activity / service integration. State is durable; on a worker restart, the execution resumes from the last state.
Step Functions (Express). High-volume, short-lived variant. At-least-once (not exactly-once), 5-minute max duration, logs to CloudWatch Logs rather than keeping per-state history. Right for “this per-request work is multi-step but completes in seconds and happens thousands of times a minute.”
cron on EC2. Still works. Zero durability, logs wherever you configure them, operational overhead proportional to the number of hosts running cron. Worth keeping for “one host, one script, one cron line” as long as you’re honest that it’s a tactical choice.
AWS Batch / SQS FIFO delay queue. Adjacent. SQS delay queues (up to 15 minutes) handle short-delay “run this after a delay” fan-out. AWS Batch handles the “run this compute job when resources are available” case. Neither is a general-purpose scheduler.

Side by side

Option	Cron	One-shot	Timezone	Millions of schedules	Multi-step	Durable state	Retry / DLQ
EventBridge Scheduler	✓	✓	✓	✓	✗	✗	✓
EventBridge Rules	✓	✗	✗ (UTC)	✗ (100/bus)	✗	✗	✓
Step Functions Standard	via trigger	via trigger	n/a	n/a	✓	✓	✓
Step Functions Express	via trigger	via trigger	n/a	n/a	✓	Partial	✓
cron on EC2	✓	✗	local host	✗	✗	✗	✗

Reading by job:

Nightly report in three Regions. EventBridge Scheduler, three schedules, one per Region with timezone_utc_offset set. Target: the regional Lambda.
Billing batch, month-end. EventBridge Scheduler fires once a month; target is a Step Functions state machine that owns the fifteen steps, branches, retries, and the long wait-for-payment states.
Per-customer 24-hour reminder. Application creates an EventBridge Scheduler schedule with at(<timestamp>) on signup, target is the reminder Lambda, ActionAfterCompletion: DELETE so the schedule cleans itself up.

How the three shapes fit together

Recurring work goes straight from Scheduler to Lambda; multi-step work goes from Scheduler to Step Functions, which owns the flow; per-entity work creates one Scheduler schedule per entity with auto-delete.

EventBridge Scheduler in depth

A schedule resource carries four knobs of note:

aws scheduler create-schedule \
  --name nightly-report-eu \
  --group-name reports \
  --schedule-expression 'cron(0 2 * * ? *)' \
  --schedule-expression-timezone 'Europe/Dublin' \
  --flexible-time-window 'Mode=FLEXIBLE,MaximumWindowInMinutes=5' \
  --target '{
    "Arn": "arn:aws:lambda:eu-west-1:111122223333:function:nightly-report",
    "RoleArn": "arn:aws:iam::111122223333:role/SchedulerInvokeReport",
    "RetryPolicy": { "MaximumRetryAttempts": 3, "MaximumEventAgeInSeconds": 300 },
    "DeadLetterConfig": { "Arn": "arn:aws:sqs:eu-west-1:111122223333:scheduler-dlq" }
  }' \
  --state ENABLED

The knobs. ScheduleExpressionTimezone is the one EventBridge Rules doesn’t have; set it and the cron is interpreted in local time including DST transitions. FlexibleTimeWindow lets AWS spread invocations across a window (5 minutes here). When a million customer schedules all fire at the same timestamp, the flexible window is what stops them landing in the same 100ms. RetryPolicy governs what happens when the target fails (the target’s own retries still apply; these are scheduler-level retries on top). DeadLetterConfig captures events that exhaust retries.

Schedule groups are a resource-level grouping with independent limits and tags; one group per team or per function keeps IAM and cost-allocation tidy. ActionAfterCompletion: DELETE on one-shot schedules is the critical knob for per-entity use; without it, the scheduler keeps a million fired-and-useless schedules around and eventually runs into the 1,000,000-per-group quota.

IAM for the caller needs scheduler:CreateSchedule; the RoleArn inside the target needs lambda:InvokeFunction on the specific function ARN plus a trust policy allowing scheduler.amazonaws.com to assume it.

Step Functions in depth

A state machine is JSON (or YAML) with a States map. The billing batch’s interesting states:

{
  "Comment": "Month-end billing",
  "StartAt": "FetchUsage",
  "States": {
    "FetchUsage": {
      "Type": "Task",
      "Resource": "arn:aws:states:::lambda:invoke",
      "Parameters": {
        "FunctionName": "fetch-usage",
        "Payload.$": "$"
      },
      "ResultPath": "$.usage",
      "Retry": [
        { "ErrorEquals": ["Lambda.ServiceException", "Lambda.TooManyRequestsException"],
          "IntervalSeconds": 2, "MaxAttempts": 6, "BackoffRate": 2.0 }
      ],
      "Next": "PerCustomerAggregate"
    },
    "PerCustomerAggregate": {
      "Type": "Map",
      "ItemsPath": "$.usage.customerIds",
      "MaxConcurrency": 40,
      "Iterator": { "StartAt": "Aggregate", "States": { "Aggregate": {
        "Type": "Task", "Resource": "arn:aws:states:::lambda:invoke",
        "Parameters": { "FunctionName": "aggregate", "Payload.$": "$$.Map.Item.Value" },
        "End": true
      }}},
      "Next": "WaitForPaymentWindow"
    },
    "WaitForPaymentWindow": {
      "Type": "Wait",
      "TimestampPath": "$.paymentDueDate",
      "Next": "CheckPaid"
    },
    "CheckPaid": {
      "Type": "Choice",
      "Choices": [
        { "Variable": "$.paymentStatus", "StringEquals": "paid", "Next": "MarkPaid" }
      ],
      "Default": "RetryDunning"
    }
  }
}

Three states earn their keep. Map with MaxConcurrency: 40 fans out per-customer aggregation 40-at-a-time; Distributed Map (not shown) scales that to 10,000 concurrent children for genuinely huge batches. Wait with TimestampPath pauses the execution for days, cheaply and durably; the execution’s state sits in the Step Functions service, and no worker is burning compute. Choice branches on the payment result without polling or handwritten code.

Retries are per-state: Retry on FetchUsage handles transient Lambda throttling with exponential back-off; Catch (not shown here) handles permanent failures by routing to a named RollbackInvoice state. The state machine’s execution history (every transition, every input, every output) is stored for 90 days and queryable from the console, which is the audit artefact nobody writes by hand.

When Step Functions is wrong

Express workflows cap at 5 minutes. If the month-end batch’s Wait state is 7 days long, Express is not an option; Standard is the only choice.

Conversely, for a workflow that runs a hundred times a second and takes 800ms end-to-end, paying per state transition (Standard’s model) gets expensive fast. Express bills per duration and is the right pick. The two are different products that share a control plane.

And if the workflow is one step (“call this API, that’s it”), putting it in Step Functions is over-engineering. EventBridge Scheduler with a direct-to-API target is the simpler answer.

A worked per-entity pattern

Signup event arrives at the signup service. The service writes the customer row and, in the same transaction context, creates a schedule:

import boto3
from datetime import datetime, timedelta, timezone

scheduler = boto3.client('scheduler')
customer_id = 'c_0abc1234'
fire_at = (datetime.now(timezone.utc) + timedelta(hours=24)).strftime('%Y-%m-%dT%H:%M:%S')

scheduler.create_schedule(
    Name=f'welcome-reminder-{customer_id}',
    GroupName='welcome-reminders',
    ScheduleExpression=f'at({fire_at})',
    ScheduleExpressionTimezone='UTC',
    FlexibleTimeWindow={'Mode': 'FLEXIBLE', 'MaximumWindowInMinutes': 5},
    Target={
        'Arn': 'arn:aws:lambda:eu-west-1:111122223333:function:send-welcome-reminder',
        'RoleArn': 'arn:aws:iam::111122223333:role/SchedulerInvokeReminder',
        'Input': f'{{"customerId": "{customer_id}"}}'
    },
    ActionAfterCompletion='DELETE',
    State='ENABLED'
)

24 hours later, Scheduler fires the Lambda with {"customerId": "c_0abc1234"}, the Lambda sends the email, and Scheduler deletes the schedule. No table of pending reminders, no sweeper job, no risk of duplicates from retry loops, no code to handle “customer cancelled their signup” (delete the schedule when they cancel, same API).

The fallback path for deletions: when a customer cancels within 24 hours, the signup service calls DeleteSchedule(Name=f'welcome-reminder-{customer_id}'). It’s idempotent. If the schedule has already fired and auto-deleted, the delete call returns ResourceNotFoundException and the signup service ignores it.

What’s worth remembering

EventBridge Scheduler is the default. For any new scheduled workload, start here. Cron, rate, one-shot expressions; native timezones; flexible windows; retry policies; DLQs; schedule groups; a million schedules per group.
EventBridge Rules is legacy for scheduling. Fine if you already have a handful; not where new work goes. UTC-only, 100-rule limit, no one-shot.
Step Functions owns multi-step workflows. Scheduler starts them; Step Functions runs them. Durable state, retries per state, Wait states for hours or days, Choice for branches, Map for fan-out.
Standard vs Express is a cost-and-duration call. Standard: up to a year, paid per transition, full audit. Express: up to 5 minutes, paid per duration, CloudWatch Logs only. The two share the state-language spec.
Per-entity schedules with ActionAfterCompletion: DELETE. One schedule per customer, with at(<timestamp>) and auto-delete on fire, replaces the whole “table of pending tasks + sweeper job” pattern for most cases.
Flexible time windows save the thundering herd. A million schedules all set to fire at 02:00 will thunder; set MaximumWindowInMinutes: 15 and AWS spreads the load.
Retry + DLQ are per-schedule and per-state. Scheduler has scheduler-level retries on target invocation failure; Step Functions has per-state retries on task failure. Both are independent of the target’s own internal retry logic.
cron on EC2 isn’t wrong, it’s just not durable. Keep it for single-host single-script single-cron-line cases. Everything else earns a real scheduler.

The audit’s outcome at Acme: the 200 scheduled things decompose into about 90 recurring cron-style jobs (EventBridge Scheduler, one per Region where appropriate), 12 multi-step workflows (Scheduler + Step Functions, one per workflow), and roughly 100 per-entity reminders and expiries (Scheduler one-shots with auto-delete). The EC2 cron host retires. Three scheduling shapes, three tools, one clean separation.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.