Platform engineers running multi-AZ OpenClaw gateways on clustervps need a canary path that Flagger can score without drowning Mac nodes in webhook retries. This guide delivers a minimal repro: version slices and skill-pack locks, Flagger AnalysisRun thresholds with merged gateway probes, failure-summary broadcast with token rotation, rsync build-lock parameters, and a rollback FAQ—plus links to our Flux webhook canary and Argo Rollouts AnalysisRun walkthroughs when you compare controllers.

Pain points before the first Flagger promotion

Canary math on Kubernetes is only half the story. Gateway Macs still serve OpenClaw skills, tail logs, and promote artifacts while Flagger shifts traffic.

  • Drifting gateway slices: one AZ runs a newer skill pack while Flagger measures another—merged probes look healthy while users hit mismatched behavior.
  • Split probe surfaces: process health returns green while canary 5xx counters climb on the gateway VIP Flagger never calls.
  • Webhook storms: tight measurement intervals plus unbounded retries stampede every clustervps node behind the load balancer.

Multi-node gateway version slices and skill-pack locks

Pin an immutable gateway_version label per AZ and lock OpenClaw skill bundles with a content hash before Flagger raises weight. Reuse per-node config fragments from fragment merge workflows and traffic ratios from multi-AZ canary skills so every gateway in the measurement pool returns comparable JSON.

Route Flagger metric checks to a canary-tagged hostname documented in multi-AZ gateway webhooks, not the stable pool operators use for day-to-day SSH. Return to home for regional node maps when you add a fourth gateway Mac.

AnalysisRun metric thresholds and probe merge

Flagger evaluates one provider response per interval. Merge gateway process health, canary error rates, queue depth, and optional Doctor fields into a single document—same discipline as Rollouts in our Argo Rollouts probe guide, but wired through Flagger MetricTemplate webhooks instead.

SignalStarter thresholdFail when
Canary 5xx rate≤ 0.5% over five minutesTwo consecutive windows above ceiling.
Gateway p99 latency≤ 220 ms on canary VIPRegression > 15% vs stable baseline.
Queue depth≤ 12 pending jobsSustained growth while weight increases.
Disk yellow gateAPFS < 78% on 2TB nodesAny gateway crosses yellow during analysis.
degraded flagHTTP 200 with explicit booleandegraded: true fails closed even at 200.
{
  "status": "healthy",
  "flagger": "payments-gateway",
  "canary": { "5xx_rate": 0.003, "p99_ms": 164 },
  "gateway": { "disk_ok": true, "queue_depth": 4, "skill_hash": "9f2a…" },
  "degraded": false
}

Webhook failure-summary broadcast and token rotation

Mount dual bearer secrets with overlap for at least one full Flagger analysis window. Log verification failures with namespace, Canary name, and measurement index. On non-success classifications, batch a digest to the notifier Mac using the pattern in cluster logs and webhook digests—operators read one summary while Kubernetes retries measurements.

  • Primary token: Flagger MetricTemplate webhook header.
  • Overlap token: accepted for seven days after rotation.
  • Retry cap: three gateway attempts with jitter; Flagger interval ≥ 60s during Mac maintenance.

Artifact rsync and build-lock parameters during canary

Nothing confuses a canary faster than promoting binaries mid-analysis. Hold rsync behind flock and cap bandwidth per the artifact rsync matrix.

LOCK=/var/tmp/openclaw-promote-${SKILL_HASH}.lock
flock -n "$LOCK" ionice -c2 -n4 rsync -az --delete-delay \
  --bwlimit=28000 --timeout=300 \
  "${GOLDEN}:/artifacts/" "${LOCAL_ROOT}/"

Minimal reproducible rollout (six steps)

  1. Install Flagger and confirm the Canary CRD targets your gateway Service—not Argo CD sync hooks.
  2. Expose /flagger/metrics on a canary-tagged OpenClaw gateway Mac with mTLS or bearer auth.
  3. Register a MetricTemplate webhook pointing at that URL; return merged JSON with explicit thresholds.
  4. Lock skill packs and pause rsync promotions until analysis completes or aborts.
  5. Enable failure broadcast to your notifier path; rehearse rollback weights before production.
  6. Validate parity by curling the endpoint from a bastion while Flagger runs a dry-run canary at five percent weight.

Canary rollback FAQ

Flagger vs Flux vs Rollouts? Pick one upstream caller per measurement URL. Compare GitOps canaries in our Flux canary walkthrough and Rollouts AnalysisRun notes above—do not double-fire the same gateway handler.

When to abort? Two failed metric windows, degraded: true, or disk yellow on any gateway Mac. Revert Flagger weight, restore stable skill hash, and release rsync locks.

Doctor still failing while metrics pass? Treat Doctor as a secondary field inside merged JSON; see Doctor deep checks before widening traffic.

Citable guardrails

  • Measurement contract: one merged JSON schema versioned in Git per gateway fleet.
  • Token overlap: minimum one Flagger analysis interval plus five minutes.
  • Promotion freeze: no delete-heavy rsync while Canary status is Progressing.
Operational guidance only. Flagger and OpenClaw APIs evolve; validate CRD fields and webhook payloads against your installed versions before production.
Multi-node OpenClaw on clustervps

Provision gateway Macs for Flagger canaries

Read the Flux canary walkthrough or Argo Rollouts probe guide, then open purchase to add multi-AZ Mac mini M4 gateways with SSH/VNC access.

Get multi-node cluster capacity View cluster pricing