Pain points before the first KEDA scale event
KEDA answers how many gateway workers to run; it does not replace canary judgment on OpenClaw Macs that still tail logs and promote artifacts.
- Queue-only scaling: replicas climb while a hot canary slice still serves the wrong skill hash.
- Split metrics: HPA sees low CPU while gateway 5xx rates rise on the VIP your webhook never scores.
- Webhook storms: ScaledObject polling plus unbounded retries stampede every clustervps node behind the load balancer.
KEDA ScaledObject triggers and the build queue
Point the ScaledObject trigger at the same depth signal your OpenClaw build lane exports—Redis list length, NATS pending messages, or a Prometheus gauge from the coordinator on a clustervps gateway Mac. Cap maxReplicaCount during canary so scale-out cannot outrun probe budgets.
| Trigger | Starter value | Canary note |
|---|---|---|
| Queue depth | Scale at ≥ 8 pending jobs | Hold max replicas +1 until merged JSON passes. |
| Cooldown | 120s scale-down delay | Prevents thrash when webhooks retry during analysis. |
| minReplicaCount | 1 stable + 0 canary workers | Canary workers use a separate Deployment label. |
| Activation | 0 → idle gateways sleep | Wake only the AZ slice under test. |
Align promotion locks with Nomad build-lock patterns so rsync never runs while KEDA scales the canary Deployment.
Multi-AZ gateway canary slices
Pin gateway_version and az labels per Mac. Route canary webhooks to a tagged hostname from multi-AZ gateway webhooks, not the stable pool operators use for SSH. Reuse traffic ratios from multi-AZ canary skills and per-node fragments from fragment merge workflows. Pick regions on home before adding a fourth gateway node.
Metric probes: latency and error-rate thresholds
Return one merged JSON document per ScaledObject polling interval—same discipline as Rollouts or Flagger, but scored by your gateway webhook before KEDA raises replicas.
| Signal | Starter threshold | Fail when |
|---|---|---|
| Canary 5xx rate | ≤ 0.5% over five minutes | Two consecutive windows above ceiling. |
| Gateway p99 latency | ≤ 220 ms on canary VIP | Regression > 15% vs stable baseline. |
| Queue depth | ≤ 12 pending jobs | Depth grows while canary weight increases. |
| degraded flag | HTTP 200 with explicit boolean | degraded: true fails closed. |
{
"status": "healthy",
"keda": "openclaw-build-lane",
"canary": { "5xx_rate": 0.003, "p99_ms": 158 },
"gateway": { "disk_ok": true, "queue_depth": 5, "skill_hash": "c4e1…" },
"degraded": false
}
Webhook failure-summary broadcast
Mount dual bearer secrets with overlap for at least one full KEDA polling window. On non-success classifications, batch a digest to the notifier Mac using cluster logs and webhook digests—operators read one summary while the scaler retries.
- Primary token: ScaledObject custom metric / gateway webhook header.
- Overlap token: accepted seven days after rotation.
- Retry cap: three gateway attempts with jitter; polling interval ≥ 60s during maintenance.
Skill-pack version lock and rollback
Freeze skill-pack hashes while canary_active=true. On abort, revert hash, set KEDA maxReplicaCount to the stable lane value, and release rsync flock locks per the artifact rsync matrix. Doctor failures still fail the merged JSON—see Doctor deep checks before widening traffic.
1TB / 2TB disk watermarks on gateway Macs
Include disk_ok in merged probes. Fail closed when APFS crosses yellow gates during scale-out.
| Tier | Yellow gate | Action |
|---|---|---|
| 1TB gateway | ≥ 82% used | Block KEDA scale-up; broadcast digest. |
| 2TB gateway | ≥ 78% used | Same; allow stable lane only. |
| Red gate | ≥ 90% either tier | Scale to min replicas; drain canary VIP. |
KEDA vs Flagger vs Argo Rollouts (what this guide adds)
Flagger and Rollouts shift traffic weights on a mostly fixed replica count. KEDA shifts capacity when queues or custom metrics demand it—ideal for burst builds on clustervps parallel Mac lanes.
- KEDA (this article): ScaledObject triggers, queue coupling, scale caps during canary.
- Flagger: Canary CRD + AnalysisRun webhooks—see our Flagger guide.
- Argo Rollouts: AnalysisRun on Rollouts objects—not Argo CD—see Rollouts probes.
- Flux: GitOps image automation—see Flux canary walkthrough.
Pick one upstream caller per gateway measurement URL. Never double-fire the same handler from KEDA scale events and a Rollouts AnalysisRun in the same minute.
Minimal reproducible rollout (seven steps)
- Install KEDA and confirm ScaledObject targets your OpenClaw worker Deployment—not the gateway DaemonSet.
- Wire the trigger to build-queue depth with cooldown and a canary-specific max replica cap.
- Expose
/keda/metricson a canary-tagged gateway Mac with bearer auth. - Return merged JSON with latency, error rate, queue depth, disk_ok, and skill_hash.
- Lock skill packs and pause rsync until probes pass or abort.
- Enable failure broadcast to your notifier path; rehearse scale-down on degraded true.
- Validate by curling the endpoint from a bastion while KEDA holds replicas at the canary cap.
Citable guardrails
- Measurement contract: one merged JSON schema versioned in Git per gateway fleet.
- Scale freeze: no KEDA scale-up while
degraded: truefor two polling windows. - Promotion freeze: no delete-heavy rsync while canary_active is true.
Wire KEDA canaries on a multi-node OpenClaw fleet
Compare Flagger and Nomad build locks, then start from home or purchase to provision parallel Mac mini M4 gateways with SSH/VNC access.