How to Monitor Service Quotas Before They Bind

September 15, 2027 · 15 min read

The situation

Acme runs in six Regions across three accounts (prod, staging, sandbox). A post-mortem last quarter cataloged every quota-related incident in the preceding year. The list was uncomfortable: eight production-affecting events, six of them caused by the same class of limit.

RunInstances per vCPU per Region, hit twice in eu-west-1 during autoscaling events.
Lambda concurrent executions hit three times: once regionally, twice in the per-function reserved pool.
EBS provisioned-IOPS (io1) hit during a storage migration.
S3 PUT request rate briefly hit on a massive backfill (not a hard quota but an account-level guidance that became painful).
Route 53 hosted zones approaching the 500-per-account soft limit.

Every incident followed the same shape. A metric crossed the limit, something started erroring, someone on-call filed a support case to raise the limit, and the increase arrived anywhere from 10 minutes to 3 days later. In every case, the capacity need was foreseeable: the autoscaler was growing on a predictable traffic pattern; the Lambda spike came from a scheduled batch everyone had scheduled; the storage migration was planned.

The post-mortem’s conclusion: Acme needs to watch the quotas, predict when they’ll bind, and raise them on a schedule, not on a pager.

What actually matters

A quota programme that stops binding at 3 AM has a small number of moving parts; the work is being precise about each one before reaching for a feature.

The first is what counts as the source of truth for “the current limit”. Many services have their own consoles and APIs that hint at limits; only one place will tell you, uniformly across services, “this account’s applied value right now and the AWS default that came before.” Without that single surface, every service is its own bespoke audit, and the answer to “what are we sitting at?” is twelve dashboards and a spreadsheet.

The second is what counts as the source of truth for “the current usage”. The limit is one number; how close we are is another. Some services publish usage as a metric; others require an API call that counts resources at the moment of inspection. The mechanism has to be able to express “usage as a fraction of limit” as a time series, otherwise alarms have nothing to fire on.

The third is whether the increase path is API-callable or human-mediated. A quota programme that scales is one where the increase request itself is code: a script the alarm calls, with a target value, a justification template, and a tracking record. A programme where every increase is a support ticket somebody types is a programme that breaks down at the volume the post-mortem describes.

The fourth is the unit of policy across accounts. Six Regions, three accounts, with hundreds of quotas each is hundreds-of-times-three-times-six bookkeeping. The mechanism has to let one team push baseline values to every account from one place; otherwise each account drifts to its own configuration and a new account starts at the defaults nobody intended.

The fifth is which quotas are worth tracking at all. Service Quotas knows about thousands; the post-mortem implicates a handful. The chosen tooling has to make it easy to focus on the quotas that actually bind, rather than alarming on the long tail and training the on-call to ignore the alarms.

And sixth, the request lifecycle is its own observability surface. A filed increase request has states (pending, approved, rejected) and a duration. If the programme can’t see the state of its own open requests, the team learns the request was rejected when the limit binds anyway. That’s right back to the 3 AM page the programme was meant to prevent.

What we’ll filter on

Quota observability: can we see current value and current usage programmatically?
Automated request path: can the increase be filed via API, not a support ticket?
Per-account vs org-wide: does the solution scale to every account?
Alarm before breach: does utilisation cross a threshold before the hard limit?
Request tracking: can we see status of open requests programmatically?
New-account prefill: do new accounts start with sensible limits, not AWS defaults?

The quota-management landscape

Manual console quota-increase requests. Open the Service Quotas console, click “Request quota increase,” enter a value, wait. Per-account, per-Region, per-human. Fine for one-off; doesn’t scale and doesn’t alarm.
Service Quotas API (RequestServiceQuotaIncrease). The same request as the console but callable programmatically. Target value, no free-text justification required for most adjustable quotas; AWS evaluates and either applies automatically or returns PENDING for human review. Integrates into automation.
Service Quotas Request Templates (organisation-level). From the management account or delegated admin, define a set of (quota code, target value) pairs that every new account in the organisation should have applied on creation. Doesn’t touch existing accounts; it’s “day-zero prefill” only.
Trusted Advisor service-limits check. Tracks ~60 commonly-hit quotas; flags >80% utilisation. Five-minute evaluation cadence. Business Support or higher; results feed into the Support API.
CloudWatch alarms on AWS/Usage metrics. The programmatic observation layer. For services that publish usage metrics here (Lambda, API Gateway, Route 53, VPC, EC2, EBS, KMS, and about forty others), alarms on (ResourceCount / Quota) > 0.8 catch imminent breaches with arbitrary thresholds and integrate with the standard alarm tooling.
AWS Health Dashboard and Personal Health events. Events of category accountNotification include upcoming quota reductions or changes. Not a primary observability tool but worth aggregating.

Side by side

Option	Observable	Automated request	Org-wide	Alarm before breach	Status tracking	New-account prefill
Manual console request	Visual only	✗	✗	✗	In console	✗
Service Quotas API	✓	✓	Per account	✗	✓	✗
Request Templates	✓	✓	✓	✗	✓	✓
Trusted Advisor limits check	Semi	✗	Per account	Above 80%	✗	✗
CloudWatch Usage alarms	✓	✗ directly	Per account	✓	✗	✗
Health events	Reactive	✗	✓	✗	✗	✗

The working pattern combines four of them: Service Quotas API for values, CloudWatch Usage alarms for thresholds, Service Quotas API again for requests, Request Templates for new accounts. Each solves a different piece; the programme is the combination.

From observation to action

The observability side is CloudWatch metric math against Service Quotas values; the action side is an SNS-subscriber Lambda that files the increase and tracks it. New accounts get the same numbers applied before the first workload ships, via the organisation template.

The alarm in depth

The CloudWatch AWS/Usage namespace exposes metrics with four dimensions: Service, Type (usually Resource), Resource (the specific quota-contributing resource), and Class (often None). For Lambda concurrent executions the metric shape is:

{
  "Namespace": "AWS/Usage",
  "MetricName": "ResourceCount",
  "Dimensions": [
    { "Name": "Service", "Value": "Lambda" },
    { "Name": "Type", "Value": "Resource" },
    { "Name": "Resource", "Value": "ConcurrentExecutions" },
    { "Name": "Class", "Value": "None" }
  ]
}

The quota value lives separately in Service Quotas. The metric-math approach on a CloudWatch alarm joins them:

aws cloudwatch put-metric-alarm \
  --alarm-name lambda-concurrency-utilisation-eu-west-1 \
  --alarm-description 'Lambda concurrent executions over 70% of quota' \
  --metrics '[
    {
      "Id": "m1", "Label": "Usage",
      "MetricStat": {
        "Metric": {
          "Namespace": "AWS/Usage",
          "MetricName": "ResourceCount",
          "Dimensions": [
            {"Name":"Service","Value":"Lambda"},
            {"Name":"Type","Value":"Resource"},
            {"Name":"Resource","Value":"ConcurrentExecutions"},
            {"Name":"Class","Value":"None"}
          ]
        },
        "Period": 300, "Stat": "Maximum"
      },
      "ReturnData": false
    },
    {
      "Id": "m2", "Label": "Quota",
      "Expression": "SERVICE_QUOTA(m1)",
      "ReturnData": false
    },
    {
      "Id": "utilisation",
      "Expression": "(m1 / m2) * 100",
      "Label": "Percent of quota",
      "ReturnData": true
    }
  ]' \
  --comparison-operator GreaterThanOrEqualToThreshold \
  --threshold 70 \
  --evaluation-periods 2 \
  --datapoints-to-alarm 2 \
  --alarm-actions arn:aws:sns:eu-west-1:111122223333:quota-utilisation-warning

SERVICE_QUOTA(m1) is a built-in CloudWatch metric math function that returns the current applied quota value for the metric it’s called on. Divide usage by quota, alarm at 70%, and we have a signal that fires before the ceiling binds.

Not every quota has an AWS/Usage metric. For those that don’t, the fallback is a scheduled Lambda that calls GetServiceQuota and ListServiceQuotaIncreaseRequests plus the relevant describe API (DescribeHostedZonesCount, for example), writes a custom CloudWatch metric, and alarms on that. More code, same pattern.

The request Lambda in depth

The SNS subscriber that receives the alarm and files the increase:

import boto3, json, os

sq     = boto3.client('service-quotas')
ddb    = boto3.resource('dynamodb').Table(os.environ['TRACKING_TABLE'])

def handler(event, context):
    for record in event['Records']:
        alarm = json.loads(record['Sns']['Message'])
        # alarm name encodes service code and quota code (or a registry lookup)
        service_code = 'lambda'
        quota_code   = 'L-B99A9384'  # ConcurrentExecutions

        current = sq.get_service_quota(
            ServiceCode=service_code, QuotaCode=quota_code
        )['Quota']['Value']

        target  = current * 2  # bump 2x -- a policy, not a law

        req = sq.request_service_quota_increase(
            ServiceCode=service_code,
            QuotaCode=quota_code,
            DesiredValue=target
        )['RequestedQuota']

        ddb.put_item(Item={
            'requestId':  req['Id'],
            'serviceCode': service_code,
            'quotaCode':   quota_code,
            'currentValue': current,
            'targetValue':  target,
            'status':       req['Status'],  # PENDING | CASE_OPENED | APPROVED | DENIED
            'createdAt':    req['Created'].isoformat()
        })

IAM: the Lambda’s role needs servicequotas:GetServiceQuota, servicequotas:RequestServiceQuotaIncrease, servicequotas:GetRequestedServiceQuotaChange, plus DynamoDB write on the tracking table.

The doubling rule (target = current * 2) is a policy choice. For small quotas (10 -> 20) it’s conservative. For large ones (10,000 -> 20,000) it’s aggressive and likely to require human review on AWS’s side. A better rule is “next standard tier up” encoded as a per-quota lookup table; AWS often applies small increases automatically and bigger ones after review. The lookup table lives in code, in DynamoDB, or in Parameter Store.

The organisation template

For every new Acme account, run a one-time increase at account-creation time. The Service Quotas Request Template associated with the Organization expresses this:

aws service-quotas create-service-quota-template \
  --service-quota-template \
  '{
    "TemplateId":"default",
    "Region":"eu-west-1",
    "ServiceCode":"lambda",
    "QuotaCode":"L-B99A9384",
    "DesiredValue":5000
  }'

aws service-quotas associate-service-quota-template

Once associated, every new account created in the organisation gets the template’s requests applied at creation. Existing accounts don’t; the template is day-zero prefill, not backfill. For existing accounts, the alarm-driven path does the work.

The template is scoped per Region per quota. Getting the list right is a small design exercise: pick the 10-20 quotas that caused 80% of historical incidents and prefill those.

A worked scenario

New quarter, new product launch, traffic forecast says concurrent Lambda executions will double over six weeks.

Week 1: usage sits at 420 concurrent executions against a quota of 1000 (42%). No alarm. No action.

Week 3: a traffic spike pushes usage to 720. (720 / 1000) * 100 = 72. The alarm crosses the 70% threshold at 14:02, stays above at 14:07 (two consecutive periods). SNS fires, Lambda invoked. The Lambda reads the current quota (1000), doubles it (2000), files the increase, records the request in DynamoDB.

AWS applies the increase automatically within a few minutes (common for Lambda concurrency up to 10,000). The sweeper Lambda, running every 15 minutes, sees the status change from PENDING to APPROVED, updates DynamoDB, publishes an EventBridge event.

Downstream of that event: a Slack notification to #platform-ops, a Jira ticket auto-updated, and, importantly, the 70% alarm auto-adjusts. SERVICE_QUOTA(m1) now returns 2000, so the utilisation expression drops from 72% to 36%, and the alarm goes back to OK without any human touching the console.

Six weeks later, at 1450 concurrent executions against 2000 (72.5%), the alarm fires again. Same loop. Same auto-resolution. The team has never seen a pager for this quota.

What’s worth remembering

Quotas are observable as CloudWatch metrics joined to Service Quotas values. Metric math with SERVICE_QUOTA(m) is the cleanest alarm shape: usage divided by quota, alarm threshold in percent.
The Service Quotas API handles the request path. RequestServiceQuotaIncrease is automatable; most adjustable quotas apply in minutes if the target is reasonable.
Trusted Advisor tracks a curated 60-odd quotas. Fine as a dashboard; not how you’d build a systematic programme. Business Support required.
Organisation request templates prefill new accounts. Pick the 10-20 historically-incident-prone quotas and put them in the template. Doesn’t touch existing accounts.
AWS/Usage is the canonical usage namespace. Most services publish here; for the ones that don’t, a scheduled Lambda that publishes a custom metric closes the gap.
Track open requests in DynamoDB or similar. A scheduled sweeper polling GetRequestedServiceQuotaChange keeps tracking state honest and feeds status notifications.
Doubling is a policy, not a law. Some quotas are better stepped by a factor (2x, 5x), some by an absolute jump (add 1000), some have a vendor-recommended next tier. Encode the rule per quota.
The alarm auto-resolves after the increase. When SERVICE_QUOTA(m1) updates, utilisation drops and the alarm returns to OK. No tuning required.

A quota programme is three small pieces: observe with CloudWatch metric math, act with the Service Quotas API, and prefill new accounts with an organisation template. Eight production-affecting events last year would have been zero with the programme in place, not because the limits never crept up, but because the response was automated and completed before the metric ever reached the ceiling. Quotas are soft limits when they’re treated that way; they become hard walls only when nobody is watching.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.