Tuning Redshift WLM for Mixed Workloads

January 17, 2028 · 16 min read

The situation

A ra3.4xlarge × 6 Redshift cluster. Three workload shapes share it.

Dashboards. 200 users across five QuickSight dashboards, mostly direct-query. Queries are sub-second to a few seconds each, high concurrency, predominantly reads against curated fact and dimension tables. Freshness matters; latency matters.
ETL. A nightly pipeline that runs from 01:00-06:00 UTC: loads yesterday’s events, builds the daily fact and aggregates, vacuum/analyze at the end. Long queries (minutes to an hour each), high memory per query, sequential inside the pipeline.
Ad-hoc analytics. A handful of analysts running heavy queries against the warehouse during business hours, joining large tables, aggregating over long windows. Unpredictable shape; moderate concurrency; queries take minutes.

The cluster is default-configured: one WLM queue, automatic memory allocation, no priority. The nightly ETL finishes on time, but between 07:00 and 09:00 every morning (while yesterday’s ETL has just finished and the business day’s dashboards are opening) there’s contention: vacuum still running, the first dashboard load is slow, an analyst kicks off a heavy query and blocks the queue further.

The default queue was fine when the cluster was small and the workload was homogeneous. With three different workload shapes competing, the default is no longer shaping things; it’s just serialising them badly.

What actually matters

Before reaching for a configuration change, it’s worth separating what WLM does from what people often hope it does.

WLM divides cluster memory and concurrency across queues. A queue has a fixed share of memory and a fixed number of query slots. Queries are routed to queues by rules (user group, query group, query type detection, priority). Within a queue, queries run up to the slot count concurrently; excess queries wait. Memory is split across the slots so each running query has a predictable allocation.

What WLM doesn’t do: WLM doesn’t preempt. A long-running query in a queue doesn’t get kicked off because a short one arrived; the short one waits. WLM doesn’t spin up more compute; concurrency is bounded by cluster capacity. WLM doesn’t understand business priority except through the rules you teach it.

The trade-off at the heart of WLM is concurrency vs memory per query. A queue with 10 slots and 50% of memory gives each query roughly 5% of cluster memory, and can run 10 concurrent queries. A queue with 2 slots and 50% of memory gives each query 25% of cluster memory but only runs 2 concurrent queries. More memory per query = bigger queries succeed without spilling to disk = faster for memory-hungry work. More slots = more concurrent users, each with less memory headroom = faster when queries are light and concurrency is the constraint.

A cluster with one queue can only make one choice for all queries. A cluster with three queues can make three choices, matching each workload shape.

The second concept is transient capacity for read bursts, where the cluster can lean on extra short-lived compute when the main queues are full. Lifts the “wait” problem at the cost of per-second charges for the borrowed capacity. Useful as a safety valve; not a substitute for queue design.

The third is the adaptive vs explicit trade-off. An adaptive approach lets the cluster decide concurrency and memory per query based on observed workload patterns. An explicit approach gives you fixed queues with fixed counts and fixed memory. Adaptive handles the simple case gracefully; explicit is the tool when you need guarantees the workload owner can point at.

What we’ll filter on

Distilling into filters we can score each approach against:

Predictability (does a query of a given shape get a reliably-similar slice of resources?)
Priority handling (can short queries skip past long ones?)
Memory per query (how much memory does each query get, and can it be tuned per workload?)
Concurrency ceiling (how many queries run at once, and what happens when the cap is hit?)
Complexity of configuration (how many knobs and how much operational tuning does this require?)

The WLM landscape

1. Automatic WLM (the default). Redshift decides concurrency and memory based on observed workload. One “queue” logically, but the service manages per-query allocation. Good for simple, homogeneous workloads and for teams that don’t want to tune. Less predictable for heterogeneous workloads where you want guaranteed memory for specific shapes.

2. Manual WLM. Define up to 8 user-defined queues (plus a superuser queue). Each queue has a name, a slot count (total concurrency 1-50 across all queues), a memory percent (sums to 100% across all queues), optional query monitoring rules, and routing rules. Queries are routed by matching user groups or query groups, or by short-query acceleration.

3. Query Monitoring Rules (QMR). Within any queue (Automatic or Manual), you can define rules that trigger on query behaviour: “if return row count > 1B, log it”, “if query time > 30 min, abort”, “if scan rows > 10B, move to lower-priority queue”. QMRs are how runaway queries get caught without requiring a DBA to be watching.

4. Short-Query Acceleration (SQA). Redshift can auto-detect short queries and route them to a priority lane with more slots, so a dashboard query doesn’t wait behind a long ETL. Enabled per-cluster; affects Manual and Automatic WLM both.

5. Concurrency scaling. For read queries, Redshift can spin up transient clusters when main-cluster queues are full, with linear-scaling pricing (free hour earned per day as part of the base pricing, then per-second after). For heavy concurrent reads, this is the pressure-valve.

6. Priority classes (Automatic WLM). Automatic WLM supports priority labels (highest, high, normal, low, lowest); queries with higher priority get preference for resources. Routes via user group or query group like Manual queues.

Side by side

Option	Predictability	Priority	Memory per query	Concurrency ceiling	Configuration effort
Automatic WLM	Adaptive	Via priority classes	ML-managed	Up to 50	Low (ML does it)
Manual WLM	Fixed	Via queue routing	Queue-level, explicit	Up to 50 (across queues)	Medium-high
Short-Query Acceleration	Fixed short-query lane	Automatic for short	ML-managed for SQA	Separate SQA slots	Low (on/off)
Concurrency Scaling	N/A (additive capacity)	Same as main queue	Per-query	Additional clusters	Low (on/off)
QMR	N/A (rule overlay)	N/A	N/A	N/A	Medium

Reading by situation:

Small cluster, homogeneous workload, no tuning appetite: Automatic WLM with SQA on. Let Redshift’s model figure it out.
Heterogeneous workloads with specific memory needs (ETL wants big chunks, dashboards want concurrency): Manual WLM with queues per workload. Pay the configuration cost for the predictability.
Occasional concurrent-read spikes breaking the main cluster: Concurrency scaling as the pressure-valve. Leaves the main cluster’s queue design intact; adds transient capacity when needed.
Runaway queries from analysts: QMR. Rules that log, abort, or demote based on observed behaviour.

Manual WLM in depth

Queue definition. Each queue has:

Name, for logging and monitoring.
Concurrency (slot count): how many queries run simultaneously in this queue.
Memory percent: share of cluster memory allocated to this queue; split evenly across its slots.
User groups: database user groups that route here by membership.
Query groups: labels set with SET query_group TO 'etl' in the session; queries with that label route here.
Timeout (optional): max query runtime before cancellation.

The sum of memory percent across all queues must total 100; the sum of concurrency across all queues is capped at 50.

Routing priority. When a query submits, WLM checks rules in order: first query-group match, then user-group match, then falls back to a default queue. First match wins.

Slot behaviour. A slot is an execution concurrency unit. A query takes one slot when it starts; holds it for the query’s duration. If the queue’s slots are full, queries queue (FIFO within the queue). A query can request multiple slots with SET wlm_query_slot_count TO 4, useful for COPY or memory-hungry queries that want more than one slot’s worth of memory temporarily. Drains to the queue’s depth, so a 4-slot ETL query blocks the other slots in its queue while running.

Memory-per-slot. Cluster memory × queue memory percent ÷ queue slot count = memory per query in that queue. On a 6-node ra3.4xlarge cluster with ~192 GB usable memory, a queue with 30% memory and 10 slots gives each query ~5.8 GB; the same 30% across 2 slots gives each query ~29 GB. Big difference for joins or aggregations that spill to disk when memory is tight.

A three-queue configuration

For the described workload, a manual-WLM configuration with three queues pays off:

Dashboards queue: high concurrency (20 slots), moderate memory (30%). Users in the dashboards group route here.
ETL queue: low concurrency (2 slots), high memory per query (40%). Users in the etl group route here. SET wlm_query_slot_count = 2 for the biggest loads; they take both slots and the queue’s whole memory share.
Ad-hoc queue: moderate concurrency (5 slots), moderate memory (25%). Users in the analysts group route here.
Superuser queue (implicit): 5% memory, 1 slot, for DBAs only.

Totals: 28 user slots (well under the 50 cap), 100% memory. Queries route by user group; a dashboard user’s query goes to the dashboard queue, an analyst’s goes to ad-hoc.

The result: dashboards and ETL don’t compete for slots (they’re in different queues), so the 07:00 contention goes away. An analyst’s heavy query runs in its own queue, so it doesn’t starve dashboards. The ETL still has its big memory share, so the nightly load doesn’t spill to disk.

The configuration and its trade-offs

Three queues match three workload shapes. Add-ons sit alongside: SQA for short queries, concurrency scaling for read bursts, QMR for runaway protection.

Query Monitoring Rules in depth

QMR is the safety net. A rule has a predicate (query-level metric threshold) and an action (log, hop to next queue, abort).

Useful predicates:

query_execution_time (seconds): runaway queries.
query_cpu_time (seconds): queries hammering CPU regardless of wall-clock.
rows_scanned, rows_returned, bytes_spilled_to_disk: queries missing predicates or doing full scans.
nested_loop_join_row_count: catching catastrophic plans.

Actions:

log: record in STL_WLM_RULE_ACTION without affecting the query.
hop: move the query to the next queue (useful for demoting runaways).
abort: cancel the query.

A useful starting ruleset:

On ad-hoc queue: abort if query_execution_time > 1800 (30 minutes). Ad-hoc analysts rarely need longer; runaways usually indicate a missing filter.
On any queue: log if bytes_spilled_to_disk > 10000 (10 GB). Spills indicate memory pressure; logging without action gives the DBA evidence without disrupting users.
On ETL queue: abort if query_execution_time > 7200 (2 hours). The pipeline’s longest job should complete within the SLA; queries exceeding this are probably stuck.

A worked reconfiguration

The DBA applies the three-queue config via the Redshift parameter group (WLM is controlled via wlm_json_configuration):

[
  {
    "user_group": ["dashboards"],
    "query_concurrency": 20,
    "memory_percent_to_use": 30,
    "name": "dashboards"
  },
  {
    "user_group": ["etl"],
    "query_concurrency": 2,
    "memory_percent_to_use": 40,
    "name": "etl",
    "rules": [
      { "rule_name": "etl_long_query", "predicate": [{"metric_name": "query_execution_time", "operator": ">", "value": 7200}], "action": "abort" }
    ]
  },
  {
    "user_group": ["analysts"],
    "query_concurrency": 5,
    "memory_percent_to_use": 25,
    "name": "adhoc",
    "rules": [
      { "rule_name": "adhoc_long_query", "predicate": [{"metric_name": "query_execution_time", "operator": ">", "value": 1800}], "action": "abort" },
      { "rule_name": "adhoc_big_spill", "predicate": [{"metric_name": "query_temp_blocks_to_disk", "operator": ">", "value": 100000}], "action": "hop" }
    ]
  },
  {
    "short_query_queue": true
  }
]

The configuration reboots the cluster (a brief downtime). After restart, queries route by user group; STL_WLM_QUERY and STV_WLM_QUERY_STATE show queue assignment and slot use. The DBA watches for a week, adjusts concurrency and memory based on observed queue depths (if the dashboards queue consistently queues, bump concurrency or move some memory), and turns on concurrency scaling when she sees the ad-hoc queue’s depth spike on month-end.

What’s worth remembering

WLM shapes a cluster’s resources per workload. Queues carve up memory and concurrency; routing rules put queries in the right queue. It’s how one cluster supports mixed workloads without them colliding.
Concurrency vs memory is the core trade. Many slots with little memory each = good for many small queries. Few slots with lots of memory each = good for heavy queries that spill to disk otherwise. Match the queue’s shape to the workload in it.
Manual WLM over Automatic for heterogeneous workloads. Automatic is great for homogeneous loads. The moment you have specific guarantees to make (ETL needs X memory, dashboards need Y concurrency), Manual lets you encode them.
SQA is usually on. Short-Query Acceleration routes short queries to a fast lane without user intervention. Saves the dashboard from waiting behind the analyst’s join.
Concurrency Scaling is the pressure valve. Read-only transient clusters when main queues fill. Small baseline cost (a free hour per day), pay-per-second above it. Leave on unless you have a specific reason not to.
Query Monitoring Rules protect against runaways. Abort queries that exceed time, scan, or spill limits. Start with log-only rules to see what your workload actually does, then promote to abort for the rules that should fire.
User groups route queries to queues. Map user groups to workloads early so routing is simple; don’t rely on query-group SET commands in application code for critical routing.
Measure before tuning. STL_WLM_QUERY, SVV_ALL_TABLES, SVV_TABLE_INFO, the Query Monitoring dashboard. The queue-depth and slot-wait metrics tell you where you’re bottlenecked; the tuning follows.

WLM isn’t magic; it’s a resource-allocation contract between you and the cluster. Tell it which queries matter how, and it will allocate. Default WLM is a single contract for all queries; a cluster with three workloads wants three contracts. The work is naming the workloads, sizing their queues, and watching the metrics to refine the sizes.

These posts are LLM-aided. Backbone, original writing, and structure by Craig. Research and editing by Craig + LLM. Proof-reading by Craig.