Webhook broadcast playbooks tell you who failed to deliver; they do not tell you which OpenClaw build or skill pack slice should own production traffic. The smallest reproducible cluster discipline is a ratio-driven canary across clustervps Mac gateways, with semver skill directories and one merged probe that already understands Doctor, queues, and digest context.

This HowTo complements—but does not repeat—the notifier-centric story in OpenClaw Multi-AZ Gateways: Probes, Webhooks & Token Rotation. There the control plane is digest broadcast and token overlap. Here the control plane is traffic percentage, immutable skill slices, and promotion gates you can rehearse on three clustervps hosts without guessing which Mac is authoritative.

OpenClaw multi-node publish flow (before any ratio moves)

Treat publish as a contract across gateways, not a single git pull hero moment. Every participating Mac must agree on four artifacts: the openclaw.lock row that pins daemon and CLI hashes, the tenant fragment set described in Tenant Splits, Doctor Merge & Webhook Digests, the skill pack directory that matches the lockfile semver, and the composite readiness route documented below. Promotion order is always shadow on canary hostDoctor plus merged probe greenLB weight nudgepeer mirroraudit append. Skipping shadow placement is how teams accidentally ship two different skill trees while the load balancer still reports healthy.

  • Authoritative source: Git tag per promotion; refuse launchd reload if the working tree is dirty.
  • Parallel gateways: One writer Mac applies the tag; canary gateway pulls first; stable gateways pull only after ratio gates pass.
  • Workers and notifiers: May trail gateways by one revision, but never lead them—otherwise merged probes lie about skill compatibility.
#!/usr/bin/env bash
set -euo pipefail
cd /usr/local/share/openclaw-infra
/usr/bin/git fetch --tags origin
/usr/bin/git checkout "refs/tags/${PROMOTE_TAG}"
/usr/bin/shasum -a 256 openclaw.lock | /usr/bin/tee /tmp/lock.sha
/usr/local/bin/openclaw version --json | /usr/bin/tee /tmp/version.json
/usr/bin/diff -q openclaw.lock <(ssh canary-gw "cat /usr/local/share/openclaw-infra/openclaw.lock")

When the diff is empty, you are cleared to touch load-balancer weights. If it is not empty, stop: you are about to split brains across AZs.

Traffic ratio: canary gates on clustervps multi-AZ gateways

Canary promotion is a sequence of discrete ratios, not vibes. Start by sending roughly five percent of new sessions to the gateway that already carries the shadow skill directory (/var/db/openclaw/skills/next symlinked beside current). Hold that ratio through at least two composite probe intervals—typically ten minutes—while you watch gateway p95 latency, queue depth, and error budgets independent of webhook noise. Ramp with 10–20 point steps only when the merged JSON shows green Doctor output and the digest block is either clean or already classified as benign back-pressure.

Keep a spreadsheet row per step: timestamp, weight snapshot, skill semver, and the ticket. That row becomes your rollback compass when someone asks what changed between lunch and the incident.

Document the exact API or CLI your load balancer vendor uses to shift weights; clustervps operators often front Mac gateways with anycast or GeoDNS, so reproduce the vendor calls in a shell function you can paste into chat during an incident.

Per-tenant configuration fragments (skill slice boundaries)

Skill pack version slicing fails when tenant fragments disagree about include order. Mirror /etc/openclaw/tenants/<tenant>/skills.d/ on every gateway before you symlink next to a new semver folder. Each fragment should declare only the slice boundary—allowed tool namespaces, model profiles, and disk quotas—not secrets. Mount secrets from /var/db/openclaw/secrets/<tenant> with POSIX ACLs so your canary host can validate parsing without leaking tokens into CI logs.

# On each gateway after git tag checkout
sudo install -d -o root -g wheel /etc/openclaw/tenants/acme/skills.d
sudo /usr/local/bin/openclaw config lint --tenant acme
sudo /bin/ln -sfn "/var/db/openclaw/skills/1.4.2" /var/db/openclaw/skills/next
sudo launchctl kickstart -k system/com.openclaw.gateway

Canary tenants can point skills.d/10-canary.yaml at next while production tenants stay pinned to current until audit approval—this is the minimal per-tenant slice without forking entire clusters.

Merging health probes: Doctor, queues, digest, and skill semver

Load balancers still call one URL, but the handler must now answer whether the declared skill semver on disk matches openclaw.lock, whether Doctor is green for the tenants that received traffic in the last interval, and whether queue depth stayed inside SLO. Webhook digest rows remain useful, yet they are a field inside the merged payload—not the promotion trigger. That distinction keeps this article orthogonal to pure broadcast architectures.

#!/usr/bin/env bash
set -euo pipefail
TENANT_CANARY="${TENANT_CANARY:-acme}"
SKILL_PATH="/var/db/openclaw/skills/current"
/usr/bin/readlink "${SKILL_PATH}" | /usr/bin/tee /tmp/skill_path.txt
/usr/local/bin/openclaw doctor --tenant "${TENANT_CANARY}" --json >/tmp/doctor.json
/usr/bin/curl -fsS --max-time 3 "http://127.0.0.1:9099/v1/webhook-digest" -o /tmp/digest.json
/usr/bin/python3 - <<'PY'
import hashlib, json, pathlib
blob = pathlib.Path("/tmp/doctor.json").read_bytes() + pathlib.Path("/tmp/digest.json").read_bytes()
print(json.dumps({"probe_sha256": hashlib.sha256(blob).hexdigest(),"skill_resolved": pathlib.Path("/tmp/skill_path.txt").read_text().strip()}))
PY

Emit status: degraded when Doctor is yellow but semver and digest are acceptable, so traffic keeps moving while dashboards scream. Emit hard failure when semver mismatches the lockfile—no amount of clean webhooks should promote that build.

Rollback: restore traffic, symlinks, and launchd in one breath

Rollback is two levers: LB weights return to the pre-canary snapshot, and /var/db/openclaw/skills/current returns to the previous semver directory. Do not partially roll back—splitting weights without restoring the symlink strand leaves you running old traffic against new skills or the opposite nightmare. After both levers move, kick launchd once and wait for merged probe JSON to match the archived good snapshot from your audit file.

#!/usr/bin/env bash
set -euo pipefail
/usr/bin/scp stable-gw:/var/db/openclaw/audit/last_good_weights.json /tmp/weights.json
./lb_restore_weights.sh /tmp/weights.json
sudo /bin/ln -sfn "/var/db/openclaw/skills/${ROLLBACK_SEMVER}" /var/db/openclaw/skills/current
sudo launchctl kickstart -k system/com.openclaw.gateway
/usr/bin/curl -fsS http://127.0.0.1:8088/readyz | /usr/bin jq .

Pair rollback drills with artifact hygiene from the cross-region artifact matrix so you are not racing rsync while the load balancer is still pointed at the wrong AZ.

Audit: append-only JSONL your security team can grep

Every ratio step should append one JSON line to /var/db/openclaw/audit/promotions.jsonl on the writer gateway, replicated to object storage nightly. Capture actor, ticket, previous and next skill semver, the LB weight map, and the probe_sha256 emitted by the merged probe script. Security reviewers care less about webhooks than they do about proof that humans approved binary motion.

/usr/bin/printf '%s\n' \
  "{"ts":"$(date -u +%Y-%m-%dT%H:%M:%SZ)","actor":"${USER}","ticket":"${TICKET}","from":"1.4.1","to":"1.4.2","weights":"${WEIGHT_BLOB}","probe_sha256":"${PROBE_SHA}"}" \
  | /usr/bin/tee -a /var/db/openclaw/audit/promotions.jsonl

When auditors ask what changed, you answer with jq filters, not Slack scrollback.

FAQ: canary versus webhook-first operations

Do I still need the notifier digest? Yes—surface it inside the merged probe so operators see systemic partner outages. Just do not use digest health as the sole reason to widen traffic; semver and Doctor must agree first.

What if only one tenant wants the new skills? Keep their fragments on the canary gateway and pin LB sticky cookies or headers for that tenant cohort to the canary pool while everyone else stays on stable weights.

How many Mac nodes minimum? Three clustervps hosts remain the smallest honest rehearsal: stable AZ-A, stable AZ-B, and a canary gateway that accepts ratio traffic before peers mirror.

Operational guidance only. OpenClaw deployment details vary by release; validate flags against your installed build. Load-balancer APIs differ by vendor; treat shell snippets as patterns, not drop-in secrets.
Parallel cluster capacity

Match gateways per AZ, not per hero operator

Canary drills stop being scary when each AZ has its own dedicated Mac with headroom for shadow skill trees and merged probes. Compare clustervps plans, add a parallel node in the region that lacks a canary seat, then wire ratios against real hardware instead of borrowed laptops. When you are ready to provision, open purchase—and keep help center links beside your runbook so the team agrees on SSH paths before traffic moves.

Add a parallel gateway Mac View plans