How to Shape API Gateway Traffic with Throttles, Quotas, and Usage Plans

June 07, 2027 · 15 min read

The situation

A payments platform runs a single REST API in front of a relational database and a handful of internal services. The team is about to open it up to third-party partners and the SRE partner has asked the question every SRE partner eventually asks: how do we keep the API up when one caller misbehaves?

A public endpoint (POST /transactions): anonymous in the sense that anyone with the public URL can call it. The downstream Lambda fans out to a database with a connection-pool ceiling around 2,000 concurrent calls. Today a misconfigured mobile client can pin every pool slot in seconds.
A partner endpoint (POST /partners/:id/transactions): behind an API key that identifies the partner. Each partner has a contract: the Free tier gets 1,000 calls per day, the Pro tier gets 100,000 per day, the Enterprise tier gets 10,000 per minute sustained. The contracts are real numbers the finance team prints on invoices.
An internal health-check endpoint (GET /status): called 100 times per second by monitoring. Should never be throttled by anything intended for user traffic.

Everything is on a single API Gateway REST API today, with no throttling configured beyond the account-wide default. The question isn’t whether throttling helps, it does, it’s which knob does which job, and how to wire them so the three endpoints get three different protection profiles.

What actually matters

Before reaching for the settings page, it’s worth asking what we’re actually trading.

The core job of an API throttle is pacing. The downstream can absorb so many requests per second; incoming traffic arrives in bursts; the throttle is what flattens the burst to something the downstream can handle. If the burst exceeds the rate the throttle allows, the excess gets a 429 Too Many Requests response and the caller backs off.

The first thing to ask is: what rate is the throttle protecting? A throttle in front of an endpoint that calls a Lambda is different from a throttle in front of an endpoint that calls a database. A throttle tuned to the Lambda’s reserved concurrency is different from a throttle tuned to an external vendor’s SLA. The rate has to match the bottleneck behind the gateway.

The second thing to ask is: who is the throttle rating against? Some throttles are global, this endpoint gets 5,000 RPS across the world. Others are per-caller, each API key gets 100 RPS. Others are per-account, this AWS account gets 10,000 RPS across every API Gateway it runs. These are different things and they interact.

The third is burst capacity. Most throttles are “token buckets”: a steady refill rate and a bucket that holds some number of tokens. A burst consumes tokens from the bucket faster than they refill; when the bucket empties, requests are throttled. The bucket size is a separate knob from the refill rate and it determines how much short-lived burst the throttle tolerates before kicking in.

The fourth is billing and contractual clarity. A contract says “1,000 calls per minute sustained, 2,000 per second burst”; the platform has to produce exactly that behaviour from whatever API Gateway actually configures. A knob that says “rate: 1,000, burst: 2,000” is easy to map to a contract; a knob that says “throttle at 0.6 × account limit” is not.

The fifth is quota windowing. Some limits are rates (per second, per minute); others are quotas (per day, per month). Rates are instantaneous; quotas accumulate. They’re different knobs and they’re usually configured separately.

And finally, a softer one: where in the request lifecycle the throttle lives. A throttle at the gateway fails cheaply, the backend never sees the request. A throttle in the Lambda itself consumes Lambda invocation cost. A throttle at the database consumes connection-pool slots. Earlier is cheaper.

What we’ll filter on

Distilling that exploration into filters we can score each knob against:

Scope, account-wide, stage/API-wide, per-method, or per-API-key?
Unit, rate per second, quota per day, or concurrency?
Burst, is there a separate bucket size?
Applies to authorised callers only, does it require identification?
Suitable for billing contracts, does the shape map to invoice language?

The throttling landscape in API Gateway

Account-level throttling. A Region-wide soft limit set by AWS: 10,000 RPS steady, 5,000 burst, across all API Gateway REST APIs in the account. It applies whether the team configures anything or not; it’s the floor that catches runaway traffic before it takes down the Region. Requests a quota increase if 10,000 RPS isn’t enough; otherwise nothing to configure. Applies to everything; no per-caller granularity; not contractual.
Stage-level throttling. Per-stage rate and burst on a REST API or HTTP API. Configured in the stage settings: “rate: 5,000 RPS, burst: 10,000”. Below the account level but above any per-method or per-key throttle. Useful as the global ceiling for an API; a single knob the platform team can turn as traffic grows. Doesn’t distinguish between callers.
Method-level throttling. Per-method override on a stage: “POST /transactions gets 2,000 RPS, burst 4,000; GET /status gets 200 RPS, burst 400.” Applies regardless of caller identity. The tool for protecting a specific endpoint whose downstream has different capacity from the rest of the API, the database-fanout endpoint gets a tighter bucket than the static-read endpoint. Lives in Method Request → Settings, or in the REST API’s stage settings as a per-method override.
Usage-plan throttling. Per-API-key (or per-identity) rate and burst, scoped by usage plan. A usage plan associates a set of API keys, a set of API-stages, a throttle (rate + burst), and a quota (requests per day/week/month). Requests that present a valid API key against an API-stage in the plan are rate-limited as the plan says; the Free plan gets 10 RPS with 1,000/day, the Pro plan gets 100 RPS with 100,000/day, the Enterprise plan gets 1,000 RPS with unlimited. The contract-shaped knob.
WAF rate-based rules. Not strictly an API Gateway feature, but the common neighbour. A WAF web ACL attached to the API Gateway stage can rate-limit by source IP (“block any IP over 1,000 requests in 5 minutes”), which catches abusers at a level before API Gateway has to evaluate them. Useful for anonymous/public endpoints where there’s no API key to key off; useless for partner traffic because partners arrive through NAT and share IPs.
Lambda reserved concurrency. Not an API Gateway feature, but lives in the same conversation. A Lambda with reserved concurrency set to 500 can never run more than 500 concurrent invocations, which bounds what the API can do to the database regardless of whether the gateway throttle worked. The belt-and-braces safety net behind the gateway.

Side by side

Knob	Scope	Unit	Burst	Needs API key	Contractual
Account-level	Region / account	RPS	Yes	✗	✗
Stage-level	Stage	RPS	Yes	✗	✗
Method-level	Method + stage	RPS	Yes	✗	Partially
Usage plan	API key + API-stage	RPS + daily/weekly/monthly quota	Yes	✓	✓
WAF rate-based	IP + ACL	Requests / 5 min	—	✗	✗
Lambda reserved concurrency	Function	Concurrent executions	—	✗	✗

Reading the table by endpoint rather than by knob:

Public POST /transactions, anonymous, downstream is sensitive, WAF-IP-rate-limit at the edge plus method-level throttle inside the gateway plus Lambda reserved concurrency behind. Three independent gates, each with a different failure mode.
Partner POST /partners/:id/transactions, identified by API key, contract-driven limits. Usage plans per tier, each with its own rate and quota. This is exactly the shape usage plans exist to solve.
Internal GET /status, should not be affected by either of the above. A method-level throttle is the cheapest protection: high enough to absorb monitoring, low enough that it won’t accidentally hide a problem if a misbehaving caller starts probing the endpoint.

Matching traffic shape to knob

Anonymous traffic passes three independent gates (WAF, method, Lambda concurrency); partner traffic is sorted into usage plans by API key; internal traffic gets one simple method throttle.

The picks in depth

Public POST /transactions → layered protection. The WAF rate-based rule at the edge is the cheapest line of defence: any source IP making more than 1,000 requests in 5 minutes gets a 403 from WAF before API Gateway sees it, which keeps abuse off the gateway’s bill. The method-level throttle inside the gateway is the per-endpoint ceiling: 2,000 RPS steady, 4,000 burst, matching the database’s pool ceiling. Callers over that quota get a 429. Finally, the Lambda’s reserved concurrency is the safety net: 500 concurrent executions, so even if every gate above is wrong, the database cannot be hit with more than 500 concurrent connections. The three gates are redundant and that’s the point, one mis-configuration doesn’t bring the database down.

Partner POST /partners/:id/transactions → usage plans per tier. Three usage plans, named free, pro, enterprise, each with its own rate, burst, and quota. API keys are created per partner, associated with the plan that matches their contract. A new partner signs the Pro contract, gets a Pro API key, and immediately has 100 RPS with 100,000 requests per day available, no API code change required. When a partner upgrades to Enterprise, the Ops team moves their API key from the Pro usage plan to the Enterprise usage plan; the new limits apply within seconds.

Three operational habits make usage plans survivable. First, rotate API keys by treating them as secrets, not identifiers, give each partner two keys at a time so rotation is a matter of issuing the new key, letting them swap, and revoking the old one. Second, monitor quota usage through CloudWatch. API Gateway emits UsagePlan metrics and the Ops team can alert on “partner reached 80% of daily quota” so the account rep can call the partner before they get throttled. Third, don’t shape throttle to burst = 10 × rate by default. The defaults (rate = X, burst = 2X or thereabouts) are conservative. A partner whose contract is 100 RPS sustained and 200 RPS for spikes wants rate=100, burst=200, not rate=100, burst=5000.

Internal GET /status → method-level throttle. A single method-level throttle at 500 RPS with 1,000 burst is sufficient. It’s high enough to absorb legitimate monitoring from multiple Regions simultaneously and low enough that a misbehaving test harness can’t accidentally flood the endpoint and make the Lambda’s cost interesting. No usage plan, no API key, no WAF rule, the endpoint’s threat model is much weaker because it’s internal.

A worked partner flow

A partner on the Pro plan has a daily budget of 100,000 requests and a rate of 100 RPS. At 9am their batch job kicks off and fires 150 RPS for a minute:

t=0s      batch starts; 100 requests pass, 50 are 429'd
t=1s      same
...
t=60s     ~9,000 passed, ~3,000 429'd; batch finishes
t=60s+    normal traffic ~20 RPS resumes

CloudWatch records Throttled metrics tagged with the usage plan; the partner’s ops team gets a notification; their retry logic kicks in and reprocesses the 3,000 throttled requests over the next few minutes, which fits comfortably inside the 100 RPS steady rate.

At midday the partner has consumed 60,000 of their daily 100,000. Cost control on the platform’s side is automatic: the quota will run out around 17:00 if the pattern continues, at which point every subsequent request gets 429 until midnight. The partner’s account rep sees the high-usage alert at 14:00 (80% threshold), calls the partner, and offers a mid-cycle quota bump to the rest of today, a soft upgrade that earns goodwill and converts into a tier upgrade next month.

What’s worth remembering

Throttling answers “what’s the bottleneck behind the gateway?” Tune the rate to the downstream’s capacity, not to an aspirational number.
Four throttle scopes in API Gateway: account, stage, method, usage plan. Each has a different reason to exist. They compose: a request must pass all four to reach the backend.
Usage plans are the contractual knob. Rate + burst + quota, scoped by API key. Shape it to match the invoice, “100 RPS, 100,000 / day” becomes throttle: { rateLimit: 100, burstLimit: 200 }, quota: { limit: 100000, period: DAY }.
API keys are identifiers, not secrets by nature. Give each partner two at a time for rotation; track them in a secret-management system, not in email threads.
WAF rate-based rules protect anonymous traffic. IP-keyed, cheap at the edge, fails earlier than any API Gateway throttle.
Lambda reserved concurrency is the last line. A downstream-bound API with reserved concurrency survives even if the gateway throttles are misconfigured, at the cost of 429s instead of 5xx.
429 Too Many Requests is a protocol contract, not just a status code. Well-behaved clients honour it with back-off and jitter; the platform publishes that expectation in documentation so partner integrations actually do it.
The account-level limit still exists. 10,000 RPS Region-wide by default, raise by support ticket. Set a dashboard alarm at 70% so the team sees the ceiling coming before the ceiling hits them.

WAF rate-based at the edge, method-level in the gateway, usage plans for contractual traffic, reserved concurrency behind the Lambda, three endpoints, four knobs, a clean division of labour. The work isn’t picking a favourite, it’s matching each endpoint’s traffic shape to the knob that shapes it.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.