npm -g order, matching peer trees, Doctor on every node, a single merged readiness JSON, failure broadcasts, and explicit rollback bookmarks before canary traffic widens.
Read alongside canary ratios, gateway webhooks, and tenant Doctor merge; here we focus on install order before traffic moves. After binaries match, use the rsync artifact matrix so golden trees stay consistent across regions.
Why rolling upgrades still fail on small Mac clusters
Incidents cluster around split semver across gateways, probe sprawl that hides stale Doctor data, and silent peers that never saw the same npm -g tree. Without rollback bookmarks between npm, launchd, and load balancer steps—and without a shared failure signal—rolling upgrades look healthy while ratios drift.
The fix is boring operations: freeze evidence, upgrade deliberately, prove Doctor on every host, collapse probes, broadcast reds, and only then touch canary percentages using the companion traffic guides linked above.
What recent 2026.4.x notes imply—without the hype
2026.4.x changelogs mention clearer plugin peer handling and registry link fixes when private scopes reorder tarballs, plus refreshed optional image generation defaults (temp paths, GPU hints). Treat both as review items: reinstall on every node and benchmark disk latency on your clustervps hosts instead of assuming mirror parity.
Decision matrix: big-bang night versus rolling canary
Match strategy to how many gateways you can drain and how long tenants can tolerate mixed builds.
| Dimension | Big-bang window | Rolling canary | Operator note |
|---|---|---|---|
| Peer consistency | Simple: everything moves together. | Harder: enforce lockfile parity per hop. | Rolling needs explicit stop rules. |
| Blast radius | All gateways bounce at once. | One node carries risk first. | Pair with LB ratio discipline. |
| Doctor signal | One maintenance banner for users. | Requires merged probes to avoid false green. | Never widen while probes disagree. |
Whatever you pick, store LB weight JSON, openclaw.lock hashes, and composite probe checksums beside each rollback bookmark. Incident reviews go faster when the spreadsheet row—not Slack memory—is the source of truth.
npm -g order that keeps peer graphs honest
- Freeze:
npm ls -g --depth=0,openclaw.lock, LB weights. - Core then plugins:
openclaw@2026.4.xfirst, then first-party plugin bundles; rerunnpm ls -g --depth=1and fail on duplicates. - Restart: bounce launchd only after local Doctor passes (checklist below).
# Example only—pin the exact version your lockfile allows sudo npm install -g openclaw@2026.4.3 sudo npm install -g @openclaw/plugin-bundle@2026.4.3 /usr/local/bin/openclaw doctor --json | /usr/bin/tee /tmp/doctor.post-upgrade.json
Rolling checklist with explicit rollback points
Rolling mode: one node at a time. After each step, confirm merged readiness matches archived good JSON.
- Drain: shift LB weight off the target. Rollback: restore weight snapshot.
- Upgrade: run the npm sequence. Rollback: reinstall prior semver from cache.
- Doctor + restart:
openclaw doctor --jsonper tenant slice; restart gateway; Doctor again warm. Rollback: prior build; restore skill symlink if used. - Probe merge: ship one readiness JSON (Doctor, queue, digest). Rollback: LB off if hash ≠ peers.
- Widen: nudge ratios only after peers acknowledge green. Rollback: weight + symlink together.
- Next node: repeat; avoid adjacent gateways on mismatched cores without a ratio.
Merge probes first, then trust traffic moves
Expose one readiness URL whose JSON merges Doctor, queue snapshots, and digest totals; if on-disk semver ≠ openclaw.lock, fail hard even when webhooks look fine.
#!/usr/bin/env bash
set -euo pipefail
/usr/local/bin/openclaw doctor --json >/tmp/doctor.json
/usr/bin/curl -fsS --max-time 3 http://127.0.0.1:9099/v1/queue-snapshot -o /tmp/queue.json
/usr/bin/curl -fsS --max-time 3 http://127.0.0.1:9099/v1/webhook-digest -o /tmp/digest.json
/usr/bin/python3 - <<'PY'
import hashlib, json, pathlib
parts = [pathlib.Path(p).read_bytes() for p in ("/tmp/doctor.json","/tmp/queue.json","/tmp/digest.json")]
print(json.dumps({"ready_probe_sha256": hashlib.sha256(b"".join(parts)).hexdigest()}))
PY
Hash mismatch versus peers should fire the same failure broadcast and optionally an lb freeze hook before automation widens ratios.
FAQ
Plugins before core? Only if release notes demand it; otherwise peer warnings may hide until runtime.
Image helpers after 2026.4.x? Expect different temp disk churn or GPU scheduling hints; watch APFS during the first canary hour and cross-check capacity with disk watermark guidance if batch jobs share the host.
Cluster size? Three clustervps Macs cover stable, canary, and witness roles.
Stage OpenClaw upgrades on dedicated Mac mini M4 nodes
Before rehearsing 2026.4.x, browse the homepage, help, and purchase flows. More cluster context: canary skills, split DNS, rsync matrix, Mosh vs SSH.