What breaks when slices, logs, and webhooks diverge
This HowTo targets two or three dedicated Mac gateways plus one notifier or aggregator role. Without a shared merge contract, each host quietly picks a different overlay order, log rotation script, and retry directory layout. Webhook providers then hammer every gateway in parallel while Slack fills with duplicate stack traces.
- Multi-path slices: Separate append-only roots for gateway runtime, webhook retry envelopes, and audit JSONL so one noisy path cannot starve the others.
- Quota reality: Treat bytes and inode pressure as first-class; APFS can look roomy while millions of tiny files stop writers cold.
- Digest broadcast: A single notifier batches five-minute failure windows with AZ labels so humans subscribe to summaries, not raw retries.
Layer this guide on top of the fragment merge and workflow isolation HowTo, the multi-AZ gateway and webhook playbook, and the canary skills and probe merge guide when you promote traffic across regions.
Multi-node merge, in this context, means every clustervps Mac renders the same effective OpenClaw surface before any log line or webhook retry is trusted. Operators often skip that step and jump straight to “more disk,” which hides the real bug: host A retained an experimental slice that rewrites log paths or digest endpoints. Treat merge verification as a gate the same way you would treat a schema migration—no traffic shift until hashes match, and no notifier promotion until both gateways point at the same upstream digest topic names.
Lab prerequisites on clustervps
| Ingredient | Why it matters | Pass criteria |
|---|---|---|
| Deterministic merge stack | Every Mac emits identical effective config after overlays. | SHA-256 of merged JSON matches across nodes pre-cutover. |
| Reserved log partition | Gateway spikes cannot evict user home data. | If logs share the OS volume, set softer quotas and faster rotation. |
| Notifier Mac or container | Digest jobs stay off latency-sensitive TLS workers. | Digest lag under one interval at p95 during failure drills. |
Pair capacity choices with the public Mac plan catalog so log retention math stays honest before you sign anything.
HowTo: minimal reproducible merge, quotas, and digest broadcast
- Freeze merge order. Inventory every
gateway.d, environment overlay, and host suffix. Apply the same stack on each node, render merged output to a temp path, compare hashes, and block promotion on mismatch. - Carve log slices. Create
var/log/openclaw/gateway,webhook, andauditroots with one primary writer per tree. Symlink active files to dated rotations so tailers never chase renames mid-write. - Wire quotas and rotation. Daily rollover at a fixed minute, compress archives older than seven days, and alert when total log bytes exceed roughly seventy-five percent of the reserved budget—or when inode usage crosses the same ratio on
df -i. - Stand up digest ingest. Gateways emit structured failure envelopes (status family, correlation ID, AZ, retry count) to a stream or spool readable by the notifier. The notifier deduplicates per upstream partner, caps body size, and emits one Slack or email card per five-minute window.
- Silent-failure guard. If the digest publisher cannot deliver, increment a counter file on disk and mirror the metric to your dashboard so “quiet” never means “healthy.”
- Validate end-to-end. Inject a synthetic webhook failure on canary weight, confirm exactly one digest with the right AZ label, restore success, and snapshot
du -shper slice plusdf -h/df -ibaselines for the runbook.
When you wire quotas, prefer hard ceilings expressed in the same units your finance team understands—gigabytes per day per slice—then map those numbers to rotation frequency. A common mistake is compressing archives aggressively while leaving the active file unbounded; the hot file grows until OpenClaw blocks during a traffic spike. Pair byte ceilings with line-rate alarms: if a gateway emits more than N megabytes of JSONL per minute for longer than five minutes, page someone before the disk does.
Digest broadcast should stay idempotent at the consumer. If the notifier restarts mid-window, it must rebuild state from the spool tail rather than guessing. That is why the silent-failure counter file matters: chat APIs lie, metrics sometimes lag, but an append-only counter on disk tells you the last successful outbound digest sequence. Replaying from that offset keeps operators from seeing duplicate cards after deploys.
# Quick headroom snapshot (run on every gateway during drill day) /usr/bin/du -sh /var/log/openclaw/* 2>/dev/null; /bin/df -h /var/log; /bin/df -i /var/log
Document the commands beside your help center links so interns rehearse the same checks after incidents. Capture outputs in your ticket template so postmortems compare apples to apples.
Troubleshooting checklist: rotation, disk, inode
| Symptom | Likely cause | First actions |
|---|---|---|
| Logs stop after midnight | Rotation script and OpenClaw disagree on file handles | Confirm postrotate sends USR1 or equivalent reload; verify new file perms and ownership. |
No space left on device with “free” space |
Snapshots, sidecar DBs, or webhook spool on same volume | Run tmutil listlocalsnapshots /; move spool; extend plan disk tier. |
Writers fail, df -h looks fine |
Inode-like metadata pressure from tiny retry files | Check df -i; batch retries; prune scratch trees safely. |
| Duplicate digest storms | Two notifier instances or clock skew | Ensure one elected consumer; keep NTP drift under two seconds. |
When webhook signing secrets rotate, stagger digest channel credentials so overlap windows do not look like false green. Log each rotation with ticket ID beside the gateway audit slice.
Rotation mechanics: On macOS gateways, prefer rename-based rollover where OpenClaw opens a new inode atomically while long-running tailers follow the symlink target you update after fsync. Avoid copytruncate patterns unless every reader explicitly supports them; otherwise you will see truncated JSON lines exactly when providers retry the hardest. After each rotation event, run a five-line smoke tail on every slice to confirm new entries append cleanly.
Inode hygiene: Webhook retry queues that create one file per attempt explode metadata usage faster than raw bytes. Batch pending deliveries into chunked spools, cap retry depth, and vacuum scratch directories during maintenance windows. If you must keep per-attempt artifacts for compliance, move them to object storage or a dedicated volume with known inode limits rather than the gateway boot SSD.
Disk pressure that is not logs: Time Machine local snapshots, Xcode DerivedData on shared CI hosts, and SQLite sidecars beside OpenClaw can steal space from the partition where you parked JSONL. When alerts fire, walk the volume with du -d 1 before blaming application logging—half the incidents we see are adjacent tooling filling the same APFS container.
Sign-off checklist before production merge
- Merge hash parity: Effective config identical on every Mac in the cohort.
- Slice isolation: Gateway, webhook, and audit paths each have rotation and quota alarms.
- Digest SLO: Synthetic failure produces one summary within the batching window.
- Inode headroom:
df -isnapshot stored in the incident binder. - Operator paths: SSH jump hosts documented next to help articles and plans.
FAQ: logs and digests on clustervps
Can I merge slices automatically in CI? Yes—treat merged output as an artifact, gate deploys on hash equality, and keep hot-reload windows aligned with the fragment HowTo.
Should digest bodies include stack traces? Only for severities that block revenue; otherwise keep airline-style clauses with correlation IDs for replay.
What if I only have two Macs? Co-locate digest on the lighter-traffic gateway temporarily, but keep log slices on disk separate so notifier restarts never truncate gateway JSONL.
When you are ready to grow the fleet, pick a Mac tier with enough SSD headroom for your retention math, then use the public purchase page to provision additional gateways or a dedicated notifier—no console login is required to compare regions, memory, or billing cadence.