The situation
A retailer’s order-fulfilment service is wired up simply: an SQS standard queue orders-to-fulfil on defaults (30-second visibility timeout, four-day retention, no redrive policy), a Lambda function fulfil-order subscribed via an SQS Event Source Mapping with batch size 1, a 300-second function timeout, and a handler that calls a third-party fulfilment partner’s HTTP API to book a courier and print a label. The partner returns in 30-60 seconds under normal conditions and up to 5 minutes when either the partner is degraded or the call is being retried on their side before it comes back to the caller.
Two to four duplicate shipments per week across roughly 20,000 orders. Same order_id dispatched twice, two labels, two parcels, one unhappy customer. CloudWatch shows ApproximateNumberOfMessagesNotVisible spiking during partner-slowdown windows. The Lambda never errors. Durations cluster in two bands: a sharp peak around 40 seconds and a long tail out to 280 seconds. No DLQ is configured; a handful of messages have been sitting in the queue for days.
What actually matters
Before reaching for a fix, it’s worth understanding what SQS is actually promising. SQS’s delivery contract is at-least-once: a message will be delivered at least once, and may be delivered more than once if the consumer doesn’t signal completion in time. That’s the mechanism keeping the queue available when a consumer crashes, if the consumer doesn’t come back to say “done”, the queue assumes the worst and hands the message to someone else. The same mechanism fires, correctly from SQS’s point of view, when the consumer hasn’t crashed but is taking longer than the queue expected. The queue has no way to tell the difference.
Ownership sits with the fulfilment team, and the fulfilment team is two people. Any fix that needs continuous attention, a heartbeat loop extending visibility mid-handler, a bespoke monitoring dashboard, is an ops tax the team will pay forever. A static configuration change that the next person reading the runbook can understand at a glance beats a cleverer design.
Blast radius of the current setup is real: every duplicate shipment is a refund, a customer-service call, and a reputational ding. Blast radius of the fix is the queue and the Lambda; nothing else changes. The risk lives in mis-sizing: too small a visibility timeout reproduces the bug; too large, and a genuinely failed invocation takes too long to retry.
Cost shape is a minor concern. A longer visibility timeout doesn’t cost more. A DLQ costs nothing unless it fills up. The savings are implicit, every duplicate shipment that doesn’t happen is a refund that doesn’t happen either.
Failure modes are where the interesting design choices live. The current design has three failure modes the team has seen (slow partner, duplicate ship, messages sitting for days) and several it hasn’t yet (poison pill with no DLQ, in-flight cap exhaustion, partner API key rotation). The fix needs to cover slow-partner deliberately and poison-pill as a side effect.
Coupling between the three numbers, visibility timeout, function timeout, actual processing time, is the heart of the defect. Get one correct and the other two still contradict. AWS’s published rule of thumb exists because the three numbers have to be designed together.
What we’ll filter on
Four filters for the fix:
- No duplicate processing. An order shipped exactly once per message, even when the partner takes four minutes instead of forty seconds.
- Tolerates 30-300 second processing times. The handler has to survive a slow partner without handing the same message to a second consumer.
- Handles genuinely failing messages without clogging the queue. A poisonous message, malformed, rejected by the partner, must stop retrying forever and end up somewhere inspectable.
- Low operational overhead. Two-person team. No bespoke state machines, no hand-rolled locking tables, no custom heartbeats unless they pay for themselves.
The SQS visibility-timeout landscape
The defaults. Every SQS queue has a visibility timeout with a 30-second default, a 0-second minimum (useful for terminating the timeout early on a message you’ve decided to release), and a 12-hour maximum from the moment the message is first received. Extending doesn’t reset the ceiling, you can keep a message in-flight for a long time, but not forever.
What a receive does. When a consumer calls ReceiveMessage and gets a message, SQS marks it in-flight for the length of the visibility timeout. One of three things then happens: the consumer calls DeleteMessage (gone); the consumer calls ChangeMessageVisibility (extend, or set to 0 to release early); the timeout expires without either (SQS makes the message visible again and anyone can receive it). That third bullet is the scenario’s bug.
Lambda Event Source Mapping and visibility. The ESM polls the queue, receives the batch, and invokes the function. From SQS’s point of view the message is now in-flight for the queue’s visibility timeout, 30 seconds by default. On successful return, ESM calls DeleteMessage; on error, ESM does nothing and the timeout expires naturally. ESM does not extend the visibility window while the function is running. If the function takes 41 seconds with a 30-second visibility timeout, the message has already been redelivered by t=30 and the second invocation has already booked a second courier by the time DeleteMessage tries (and fails) to fire.
The recommended ratio. AWS guidance is that the queue’s visibility timeout should be at least six times the Lambda function timeout. The 6× factor absorbs the function timeout itself, Lambda’s own internal retry behaviour, and enough slack that a slow-but-not-failing invocation doesn’t race the clock. For a 5-minute Lambda, 6× is 1,800 seconds (30 minutes), comfortably inside the 12-hour ceiling.
ChangeMessageVisibility for application-aware extension. When a handler knows it’s running long, it can extend the window from inside the invocation, typically a background task extending by 120 s every 60 s of handler runtime. Worth it when the average case is fast and the worst case is rare; skip it when a generous static timeout is cheaper in brainpower than the heartbeat code.
Redrive policy and DLQ. Configured on the source queue with two fields: the DLQ’s ARN and maxReceiveCount. Each receive-without-delete increments the message’s receive count; past the ceiling, SQS moves the message to the DLQ instead of making it visible again. AWS guidance avoids values as low as 1; a value of 5 is a pragmatic default. DLQ retention must be longer than source retention, because messages keep their original enqueue timestamp; source 4 days, DLQ 14 days (the SQS maximum) is the shape that works.
FIFO queues and MessageGroupId. Strict ordering within a group; same visibility-timeout arithmetic still applies to redelivery. FIFO’s group lock stops other messages in the group from being delivered concurrently, it does not stop the same message from being redelivered after visibility expiry. Useful when ordering is a requirement; not a fix for the duplicate-processing bug.
In-flight cap. A standard queue holds approximately 120,000 in-flight messages at once; a FIFO queue’s cap depends on active message groups. Matters most when visibility is long, deletes are slow, and arrival rates are high.
Side by side
| Configuration | No duplicates | Tolerates 30-300 s | Handles poison msgs | Low ops |
|---|---|---|---|---|
| Default 30 s visibility, no DLQ | ✗ | ✗ | ✗ | ✓ |
| ChangeMessageVisibility heartbeat, no DLQ | ✓ | ✓ | ✗ | ✗ |
| FIFO queue with MessageGroupId | ✗ | ✗ | , | ✗ |
| 6× visibility timeout alone (1,800 s) | ✓ | ✓ | ✗ | ✓ |
| 6× visibility + DLQ (maxReceiveCount = 5) | ✓ | ✓ | ✓ | ✓ |
Matching the timings
The 6× rule, in depth
Why six, not three. A pessimistic factor-of-six exists to cover four things at once: the function timeout itself (300 s); Lambda’s own infrastructure-level retries for transient invocation failures (which happen inside the visibility window without external pickup); batch-level processing when more than one message is in play; and slack for a slow-but-not-failing invocation not to race the clock. Three-times feels like it should work; in practice it leaves no headroom for any of the four.
Why static, not heartbeat. A handler can call ChangeMessageVisibility mid-flight to extend the window: the classic “extend by 120 every 60 seconds” pattern. That’s the correct answer when the average case is fast and the worst case is rare, an in-flight slot held for 30 minutes when most invocations finish in 45 seconds is wasteful, and the 120,000 in-flight cap gets closer than it should. For the fulfilment workload the worst case happens often enough that a static 30-minute window is cheaper in brainpower than maintaining a heartbeat loop in every handler forever.
Why a DLQ regardless. Visibility-timeout sizing handles the slow-partner case. It does nothing for a genuinely poisonous message, a malformed payload, a schema mismatch, a permanent partner rejection. Without a DLQ the 4-day retention quietly ticks those messages away while every receive burns partner-API quota and Lambda time. A redrive policy with maxReceiveCount = 5 pointing at a DLQ captures the poison pills without drama, and leaves them somewhere a human can look at them.
A worked example: one slow-partner message
Before the fix, the message lives through seven events. At t=0 the ESM polls, receives, invocation A starts, the handler calls the partner. At t=30 the visibility timeout expires and SQS makes the message visible again; invocation A is still waiting. At t=35 the ESM polls again, receives the same message, invocation B starts, the handler calls the partner, which happily books a second courier because there’s no duplicate-suppression layer on their side. At t=140 invocation A returns and ESM calls DeleteMessage with invocation A’s receipt handle; SQS returns MessageNotInflight because that receipt was invalidated at t=30. At t=180 invocation B returns and ESM’s delete with invocation B’s receipt succeeds. The net is one message, two shipments, zero errors in the Lambda logs. The duplicate is entirely invisible to the Lambda runtime, it’s SQS doing what it promises, on a contract the design hadn’t priced in.
After the fix, with visibility set to 1,800 seconds, the same message lives through two events. At t=0 the ESM polls and invocation A starts. At t=140 invocation A returns; ESM’s delete runs against a still-valid receipt handle; the message is gone. One message, one shipment.
The failure path is preserved. If invocation A hangs and the function itself times out at t=300, ESM sees the error and does nothing; the message remains in-flight for another 1,500 seconds. At t=1,800, if no one has deleted it, it becomes visible again and a retry begins, this time against the real failure, not a slow-but-succeeding call. Past the maxReceiveCount = 5 threshold, the message is moved to the DLQ and a CloudWatch alarm on DLQ depth pages whoever’s on call.
What’s worth remembering
- SQS’s delivery contract is at-least-once. Duplicates happen when visibility timeout expires before DeleteMessage runs. The consumer has to either size the window correctly or be idempotent.
- Visibility timeout: 30-second default, 0-second minimum, 12-hour maximum. Extending via
ChangeMessageVisibilitydoesn’t reset the 12-hour ceiling. - AWS’s rule of thumb for Lambda-over-SQS is visibility timeout ≥ 6× function timeout. Absorbs the function timeout, Lambda’s internal retry behaviour, and enough slack for a borderline case.
- Lambda’s function timeout maximum is 900 seconds (15 minutes). Under 6×, the longest SQS visibility timeout that corresponds to a maxed-out Lambda is 5,400 seconds, well inside SQS’s 12-hour ceiling.
- Lambda ESM does not automatically extend visibility during execution. On success it calls DeleteMessage; on error it does nothing. If the function runs longer than the visibility timeout, the message has already been redelivered.
ChangeMessageVisibilityis the application-aware lever. Useful when the average case is fast and the worst case is rare; otherwise, size the static timeout generously and move on.ReportBatchItemFailureslets partial batches succeed. Matters little at batch size 1; matters a lot at batch size 10 or 100.maxReceiveCounton a redrive policy sends chronically-failing messages to a DLQ. A value of 5 absorbs transient failures without letting a poison pill run forever. DLQ retention must be longer than source retention.- FIFO’s message-group locking is an ordering primitive, not a concurrency one. Same visibility-timeout arithmetic applies; DLQs break FIFO ordering as a side effect.
- In-flight caps are real, around 120,000 on standard queues. Long visibility timeouts plus slow deletes plus high arrival rates drive up in-flight count and eventually throttle receives.