Beating Scaling Latency with KEDA Pre-warming

How LZStock defeats the 'Scaling Latency' of standard Kubernetes HPA by combining Proactive Pre-warming (KEDA) with Guaranteed QoS for the morning traffic tsunami.

TL;DR

Defeat Scaling Latency with KEDA: Replaced reactive Kubernetes HPA with KEDA's Cron-triggered ScaledObjects, proactively pre-warming the cluster 5 minutes before the 9:30 AM market open to completely eradicate cold-start bottlenecks during traffic tsunamis.
Eradicate CPU Throttling via Guaranteed QoS: Enforced strict requests == limits constraints on Pod configurations, assigning them to the Guaranteed QoS class to prevent aggressive Linux cgroup throttling from causing unpredictable latency spikes in Go's Garbage Collector.
Dual-Dimension Autoscaling: Engineered a hybrid scaling strategy that absorbs predictable daily volatility through time-based proactive bounds, while maintaining responsive CPU-metric triggers to handle unforeseen intraday market surges.

The Objective

Financial dashboards are notoriously spiky. At 9:30 AM EST, thousands of active users simultaneously open their apps.

The fatal flaw of standard Kubernetes Horizontal Pod Autoscaling (HPA) is Scaling Latency. HPA is reactive; it waits for CPU metrics to spike before creating new Pods. Booting new Pods takes 30-60 seconds. During a 10x traffic spike, waiting 60 seconds means dropped WebSocket connections, cascading timeouts, and API Gateway crashes.

The objective is to combine Resource Constraints (Guaranteed QoS) with Proactive Pre-warming to ensure our infrastructure is fully scaled before the traffic arrives, while still retaining the ability to reactively scale during unpredictable intraday surges.

The Mental Model: Reactive vs. Proactive Scaling

To defeat Scaling Latency, we decouple our autoscaling into two dimensions:

Proactive (Time-Based): Scaling up before the known event (The Opening Bell).
Reactive (Metric-Based): Scaling up during unexpected market volatility.

Core Implementation

We discarded standard HorizontalPodAutoscaler manifests in favor of KEDA ScaledObject, which natively supports multi-trigger autoscaling (combining Cron schedules with CPU metrics).

Vertical Constraints (Guaranteed QoS)

For high-frequency Go applications, we set requests strictly equal to limits. This assigns the Pod to the Guaranteed QoS class in Kubernetes, preventing the CPU throttling that causes sporadic latency spikes in Go's Garbage Collector.

# deployment.yaml (Snippet)
containers:
  - name: {{ .Chart.Name }}
    image: "{{ .Values.image.repository }}:{{ .Values.image.tag }}"
    
    # Vertical Scaling Bounds: Requests == Limits
    resources:
      requests:
        cpu: "1000m"      # 1 Full Core requested
        memory: "512Mi"   # 512 MB Baseline
      limits:
        cpu: "1000m"      # Strictly capped at 1 Core
        memory: "512Mi"   # Exceeding this triggers OOMKilled

Proactive Pre-warming & Reactive Scaling (KEDA)

Instead of waiting for the CPU to hit 100% at 9:30 AM, KEDA forces the deployment to scale to a minimum of 15 replicas at 9:25 AM. This completely eliminates Cold Start latency. If an unexpected news event causes traffic to surge at 1:00 PM, the CPU trigger takes over.

# keda-scaledobject.yaml
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: {{ include "lzstock.fullname" . }}-scaler
spec:
  scaleTargetRef:
    name: {{ include "lzstock.fullname" . }}
  minReplicaCount: 3
  maxReplicaCount: 30
  
  triggers:
  # 1. PROACTIVE: Pre-warm the system 5 minutes before the Opening Bell
  - type: cron
    metadata:
      timezone: America/New_York
      start: 30 09 * * 1-5  # 09:25 AM Mon-Fri (Pre-market)
      end: 00 16 * * 1-5    # 16:00 PM Mon-Fri (Market Close)
      desiredReplicas: "15" # Force 15 replicas instantly

  # 2. REACTIVE: Handle unexpected intraday volatility
  - type: cpu
    metadata:
      type: Utilization
      value: "70" # Trigger scale-out if CPU exceeds 70%

Edge Cases & Trade-offs

The Cost of Pre-warming vs. Availability: By forcing 15 replicas at 9:25 AM, we are paying AWS for 5 minutes of compute time before the traffic actually arrives. In massive clusters, this "idle buffer" costs real money. However, in financial technology, losing a user's WebSocket connection during the opening bell translates to lost trades and destroyed trust. The architectural decision is clear: We trade a slight increase in infrastructure cost for absolute availability during critical business windows.
CPU Throttling vs. Node Packing: If you set limits higher than requests (Burstable QoS), your Go application might burst and consume idle Node CPU. However, Linux cgroups will ruthlessly throttle your container if it exceeds its quota over time, leading to hidden, impossible-to-debug latency spikes in API responses. I enforce requests == limits to ensure predictable latency, explicitly trading node-packing efficiency for absolute performance stability.
Scale-Down and Graceful Termination: When the market closes at 4:00 PM, KEDA will drop the desiredReplicas back to the baseline of 3, terminating up to 27 Pods simultaneously. If the stateless service abruptly dies, active users will see 502 Bad Gateway errors. This is why the Graceful Shutdown implementation from our internal Liveness Module (deregistering from NATS and draining HTTP connections) is an absolute prerequisite for safely enabling aggressive K8s scale-in.

The Outcome

By replacing naive reactive HPA with KEDA's Proactive Pre-warming schedules, LZStock entirely eliminated the "Scaling Latency" bottleneck. The API Gateway boots and initializes its caches exactly 5 minutes before the opening bell, absorbing 10x traffic spikes with a perfectly flat latency curve, and gracefully contracting at night to minimize AWS costs.

The Objective​

The Mental Model: Reactive vs. Proactive Scaling​

Core Implementation​

Vertical Constraints (Guaranteed QoS)​

Proactive Pre-warming & Reactive Scaling (KEDA)​

Edge Cases & Trade-offs​

The Outcome​