Skip to content

Scaling ​

Goals ​

  • Keep latency within SLO while maintaining cost efficiency.
  • Default to horizontal scale (more replicas) and only scale vertically (bigger vCPU/RAM) when necessary.
  • Use data (response times, CPU/memory, instance counts, and cost) to tune decisions.

Default Policy ​

  1. Scalers: Two complementary ACA scalers run in parallel β€” HTTP concurrency and CPU utilization.

    • HTTP concurrency target: 20 concurrent requests per replica (tuned down from the initial 40 based on observed CPU-heavy request profiles for the public API).
    • CPU utilization target: 70% β€” fires earlier than HTTP concurrency for CPU-bound sync bursts where per-replica CPU saturates before concurrent request count climbs.
    • Either scaler can independently trigger a scale-out; both must drop below their thresholds before scale-in begins.
  2. Replica bounds:

    • minReplicas: 1 for public APIs (avoid cold start).
    • maxReplicas: set per observed peak traffic (public API: 10).
  3. Scale-down cooldown: 600 seconds for the public API. This prevents premature scale-in during intra-burst traffic dips (brief drops in the middle of an active sync workload).

  4. Preference:

    • Scale horizontally first for cost efficiency. Adding replicas only incurs cost while they run under load.
    • Scale vertically later if a single request needs more headroom; vertical sizing increases β€œfixed” cost because the larger container size is paid for throughout its active lifetime.

Configuration (ACA) ​

Example (Bicep) ​

The containerApp.bicep module exposes the following scaling parameters:

bicep
param minReplicas int = 0
param maxReplicas int = 1
param httpConcurrentRequests int = 0    // 0 = scaler disabled
param cpuUtilizationThreshold int = 0  // 0 = scaler disabled; percentage (e.g. 70)
param scaleDownCooldownSeconds int = 300

Public API configuration (from main.bicep):

bicep
cpu: valueString(environment, { prd: '1', acc: '0.5', default: '0.25' })
memory: valueString(environment, { prd: '2Gi', acc: '1Gi', default: '0.5Gi' })
minReplicas: valueInt(environment, { prd: 1, default: 0 })
maxReplicas: valueInt(environment, { prd: 10, acc: 2, default: 1 })
httpConcurrentRequests: 20
cpuUtilizationThreshold: valueInt(environment, { prd: 70, acc: 80, default: 0 })
scaleDownCooldownSeconds: 600

Notes

  • Start with cpu: 0.5, memory: 1Gi for new services. Right-size after profiling (see "When to Scale Vertically").
  • If a service is truly bursty and can tolerate cold starts or is infrequently called, minReplicas: 0 is acceptable.
  • Omitting httpConcurrentRequests or cpuUtilizationThreshold (or passing 0) disables that scaler β€” useful for non-HTTP workloads or services where CPU scaling is not meaningful.
  • Raise scaleDownCooldownSeconds when traffic is bursty with brief dips mid-burst (see "Scale-Down Cooldown" below).

How We Tune concurrentRequests (20 β†’ …) ​

  1. Collect a baseline for 3–7 days under representative traffic.
  2. If CPU < 50% and p95 well below SLO, consider raising concurrency (e.g., 20 β†’ 30).
  3. If CPU β‰₯ 75% or p95 approaches SLO, lower concurrency (e.g., 20 β†’ 15) or add replicas by raising maxReplicas.
  4. Re-assess until we hit the "sweet spot" (steady p95 and 60–75% CPU during normal load).

The public API was tuned from the initial default of 40 down to 20 based on observed data: sync workloads are CPU-heavy per request, meaning a single replica saturates CPU well before HTTP concurrency climbs to 40. The CPU scaler at 70% covers burst detection for those patterns.


Scale-Down Cooldown ​

Scale-in is automatic β€” no rules required. KEDA polls all active scalers and begins counting down the cooldown period once all scalers drop below their thresholds.

ParameterDefaultPublic API
scaleDownCooldownSeconds300600

The public API cooldown is set to 600 seconds because sync workloads produce intra-burst traffic dips β€” brief quiet windows in the middle of an active sync β€” that would otherwise trigger premature scale-in. With a 10-minute cooldown, replicas stay warm through the full burst and are released within ~10 minutes after the workload ends.


Decision Tree: Scale Out vs. Optimize vs. Scale Up ​

  1. Is p95 latency above SLO?

    • No β†’ Do nothing. Keep observing.
    • Yes β†’ Go to 2.
  2. Is per-replica CPU β‰₯ 75% or memory β‰₯ 80%?

    • Yes β†’ Try horizontal scale first (increase maxReplicas or decrease target concurrency slightly).
    • No β†’ Go to 3.
  3. Where is time spent (from traces)?

    • Mostly external (DB, HTTP downstream) β†’ Optimize queries, add caching, reduce payloads; scaling won’t help much.
    • Mostly app CPU / GC β†’ Consider vertical scale (more vCPU/RAM) or reduce allocations/compute; then re-test.
  4. Are we frequently at maxReplicas with p95 ~ SLO and costs rising?

    • Yes β†’ Prioritize optimization (DB indexes, caching, batching) before adding more replicas.

When We Scale Horizontally (Default) ​

  • Workload is I/O-bound or embarrassingly parallel.
  • Increasing replicas reduces queueing and improves p95.
  • Cost is proportional to actual load (replicas scale down when idle).

Actions

  • Increase maxReplicas with the 60–75% CPU target in mind.
  • Keep concurrentRequests near the latency sweet spot.
  • Watch DB/redis/queue limits as you add replicas.

When We Scale Vertically (Exception, Not Default) ​

Scale vertically only when many requests are CPU/memory intensive and a single replica is the bottleneck:

  • p95/p99 dominated by CPU work inside the service even with low concurrency.
  • High GC% time, LOH pressure, or frequent OOM.
  • Thread pool starvation at modest concurrency.

Actions

  • Move the app to a larger vCPU/RAM size (workload profile).
  • Keep horizontal scaling enabled; vertical β‰  disable autoscale.
  • Re-check cost: bigger instances increase the β€œfixed” cost floor because we pay the higher rate for the entire time the container runs.

FAQ ​

  • Why two scalers instead of one? HTTP concurrency alone misses CPU-bound bursts. When sync workloads arrive (CPU-heavy per request), a replica can hit 100% CPU while concurrent request count stays low. The CPU scaler at 70% catches this before requests start queuing.

  • Why 20 concurrent requests (not 40)? 20 is the tuned value for the public API after observing that CPU saturates at modest concurrency for sync-heavy traffic. For I/O-bound services with fast response times and low CPU, 40 remains a reasonable starting point.

  • Why horizontal before vertical? Horizontal scaling matches cost to demand; vertical sizing raises the baseline cost as long as the container is running.

  • What's the signal to go vertical? When p95/p99 are dominated by in-process CPU/memory work and a single request needs more headroom despite modest concurrency.

  • Why 600s cooldown for the public API? Sync workloads produce brief traffic dips mid-burst. A 300s default would start scaling in replicas that are needed again 60–90 seconds later. The 10-minute window covers the observed burst durations and avoids oscillation.