Pain points that look like slow disks
Cross-region clusters rarely fail because Xcode lacks CPU. They fail when metadata latency spikes, cache directories fight APFS, or release exports starve SSH sessions. Treat JuiceFS as a latency budget, not infinite POSIX.
- Metadata fan-out: thousands of small writes per minute from package managers and test shards overload Redis or SQL primaries faster than object storage.
- Client cache collisions: every node needs its own
--cache-diron local NVMe; sharing one folder silently corrupts performance counters. - Promotion storms: unbounded rsync competes with JuiceFS chunk uploads and leaves builders without headroom, a pattern we also cover for object filers in the SeaweedFS disk matrix.
Metadata engine matrix on S3
| Engine | Best when | Watch item |
|---|---|---|
| Redis | Two to eight nodes, predictable CI bursts. | Fast fail-over if you snapshot AOF. |
| TiKV | Horizontal metadata growth across regions. | Plan PD placement away from noisy builder subnets. |
| SQL (MySQL/Postgres) | Teams that already run managed databases. | Connection pools per Mac must stay bounded. |
| Embedded SQLite | Single-node labs only. | Avoid for parallel clusters. |
Pick Redis when your p99 metadata budget stays under two milliseconds on LAN and you can tolerate a single writer. Move to TiKV before horizontal sharding of builders exceeds one primary Redis footprint.
JuiceFS format and mount parameters
| Flag | Example | Why operators care |
|---|---|---|
| format --storage s3 | Region pinned endpoint URL | Keeps chunk latency stable across AZs. |
| --trash-days | 3 to 7 for CI caches | Prevents silent inode leakage after deletes. |
| --compress zstd | Level five | Balances CPU on M4 against WAN bytes. |
| mount --cache-size | 40GiB on 1TB nodes, 96GiB on 2TB | Reserves APFS slack for DerivedData and logs. |
| mount --max-uploads / --max-downloads | 50 / 100 per node | Stops chunk queues from starving interactive SSH. |
| mount --writeback | Off for release trees | Turn on only for disposable scratch volumes. |
| mount --read-cache | Dedicated NVMe path | Separates hot reads from write cache churn on shared builders. |
| mount --buffer-size | 300 MiB baseline | Smooths WAN when APFS slack exists. |
juicefs format --storage s3 --bucket https://s3.example.com/ci-meta \\ redis://meta-primary:6379/1 ./ci-shared juicefs mount --cache-dir /Volumes/local/jfs-cache --read-cache /Volumes/local/jfs-read \\ --cache-size 40960 --buffer-size 300 \\ --max-uploads 50 --max-downloads 100 --prefetch 1 \\ redis://meta-primary:6379/1 /Volumes/SharedCI
Artifact rsync throttles and concurrency
Mirror the artifact rsync matrix discipline: two streams per WAN path, explicit delete delay, and checksum gates before promotion.
rsync -az --partial --delete-delay --bwlimit=55000 \\ --info=stats2 ./artifacts/ mac-usw-03:/Volumes/JuiceMount/promote/
- Bandwidth: start near fifty-five megabytes per second per stream; raise only when JuiceFS writeback queues and SSH idle latency remain flat.
- Concurrency: cap parallel rsync jobs to two per geography while compile farms still hit object storage.
- Locks: wrap promotion in
flockso canary and stable lanes never delete each other.
Rollout steps before you add nodes
- Step 1: Baseline JuiceFS
juicefs statsp95 for metadata and block throughput during a full nightly build. - Step 2: Size per-node cache directories on NVMe, separate from the mount point, and snapshot trash retention weekly.
- Step 3: Apply rsync ceilings and verify Git fetch plus SSH console latency during promotion.
- Step 4: Wire OpenClaw gateway probes to emit one digest per failed webhook window instead of raw log spam.
- Step 5: Re-run the disk checklist below after any metadata engine upgrade or cache path change.
- Step 6: Document rollback: unmount JuiceFS, drain rsync, restore metadata snapshot, then remount read-only until validation passes.
1TB and 2TB disk acceptance checklist
- 1TB nodes: cap JuiceFS cache near forty gigabytes and keep fifteen gigabytes headroom for logs plus crash reports.
- 2TB nodes: allow ninety-six gigabyte caches but require weekly inode counts because larger caches hide metadata drift.
- Shared acceptance: verify S3 lifecycle rules, metadata backup size growth under twenty percent week over week, and rsync wall clock under service level.
OpenClaw canary probes and webhook failure digests
Treat gateway Macs as observability endpoints. When Flux or Argo Rollouts posts a canary webhook, merge JSON probe output from each node, cap retries, and broadcast a single failure summary to your on-call channel.
Align gateway token rotation with the Flux webhook canary guide and the Argo Rollouts AnalysisRun pattern. Reuse the log hygiene ideas from cluster logs webhooks so operators see correlation ids instead of duplicate stack traces.
- Canary probe JSON: include node hostname, JuiceFS mount state, free APFS gigabytes, and last rsync exit code.
- Webhook failure digest: limit to four kilobytes, attach deep links to raw logs, expire after thirty minutes.
- Backoff: exponential retry with jitter prevents metadata storms when S3 throttles briefly.
Citable guardrails
- Cache budget rule: keep JuiceFS client cache under four percent of total APFS capacity on 1TB builders and under five percent on 2TB builders unless trash days are zero.
- Metadata latency SLO: alert when Redis INFO latency doctor reports sustained reads above one millisecond during compile peaks.
- Promotion SLO: require rsync dry-run checksum pass and OpenClaw green probe before any traffic shift documented in GitOps history.
- Backup pairing: match retention to the restic and rclone backup matrix so restores keep metadata snapshot order.
Provision nodes with NVMe headroom for JuiceFS cache and CI artifacts
Select multi-region Mac mini M4 plans, attach S3-compatible storage, and keep SSH plus JuiceFS mounts responsive while rsync promotions run.