Failure modes that masquerade as slow disks
Cross-region clusters rarely starve for CPU on Apple Silicon. They degrade when parity rebuilds overlap with replication lag alarms, or when unbounded rsync saturates the same NIC that carries SSH and Git traffic.
- Parity rebuild pressure: drive loss or decommission triggers EC heal streams that can crowd out interactive sessions unless you cap heal concurrency per site.
- Replication skew: active-active buckets without explicit lag budgets let QA pull half-written trees; operators chase ghosts instead of compiler errors.
- Promotion overlap: rsync without
bwlimitcollides with MinIO scanner and lifecycle jobs, a pattern we also document beside filer stacks in the SeaweedFS rsync disk matrix.
Erasure coding versus cross-bucket replication
| Pattern | Best when | Trade-off |
|---|---|---|
| EC:4 | Single geography, three or more drives per erasure set, cost-sensitive cold archives. | Heals are CPU plus disk heavy; avoid mixing with latency-sensitive Xcode DerivedData roots. |
| EC:8 | Large pools where double-device tolerance matters for multi-month artifact retention. | Higher parity overhead; plan scanner windows off peak compile hours. |
| Active-active replication | Readers in two regions need identical keys within minutes. | Requires conflict rules, versioned deletes, and explicit replication lag SLO per bucket prefix. |
| One-way site replication | Promotion hub in one region, DR copies elsewhere. | Simplest mental model for CI promotion plus audit trails. |
Pick EC inside each region for durability economics, then layer site replication only on buckets that truly need mirrored reads. Keep scratch or ephemeral caches on local APFS, not on replicated prefixes, so heal storms never block compile queues.
Replication window and EC parameter checklist
| Parameter | Starter value | Operator intent |
|---|---|---|
| Replication lag alert | Five to fifteen minutes per hop on WAN | Page only when lag exceeds twice the nightly promotion window. |
| Delete marker sync | Enabled for versioned release buckets | Prevents resurrecting binaries after rollback. |
| Scanner duty cycle | Off peak UTC nights | Keeps M4 CPU for tests instead of bitrot checks. |
| Heal concurrency | Two workers per node on 1TB, four on 2TB | Protects APFS free space buffers listed later. |
| ILM transition day | Thirty to ninety days for CI tarballs | Drops storage cost before cross-region egress repeats. |
| Key layout | Prefix per product line plus immutable build id | Makes replication filters and partial restores deterministic. |
Document every rule in GitOps so new Mac nodes inherit the same bucket policy. Pair these knobs with the JuiceFS S3 cache matrix when POSIX mounts sit above the same object store.
Example rsync throttle line for artifact promotion: rsync -az --delete-delay --bwlimit=48000 --partial plus two concurrent jobs per geography. Raise bandwidth only after mc admin info shows stable drive latency and SSH remains under two hundred milliseconds.
Artifact rsync throttles beside MinIO
Follow the artifact rsync matrix: cap streams, serialize delete phases, and checksum before flipping traffic. Treat MinIO PUT storms and rsync as shared WAN tenants.
- Bandwidth: begin near forty-eight megabytes per second per stream on shared one-gigabit uplinks; validate with
--info=stats2logs. - Concurrency: two rsync jobs per region while compile and replication both run; add a third lane only after disk gates stay green for a week.
- Locks: wrap promotions with
flockso canary and stable prefixes never race deletes.
Rollout steps before scaling nodes
- Step 1: Baseline MinIO cluster health with
mc admin healsummaries and drive latency histograms during a full nightly build. - Step 2: Apply EC sets per rack, enable versioning on release buckets, and freeze scanner schedules around compile peaks.
- Step 3: Configure replication rules with explicit lag budgets; rehearse failover reads from the secondary region.
- Step 4: Layer rsync caps and verify Git fetch plus help center SSH guidance stays responsive during promotion.
- Step 5: Re-run the disk checklist after any heal policy change or lifecycle edit.
- Step 6: Document rollback: pause replication, drain rsync, mark the bucket read-only, restore from the restic and rclone backup matrix cadence, then reopen traffic.
1TB and 2TB APFS acceptance gates
- 1TB nodes: reserve eighteen gigabytes for logs and crash captures; cap local staging under twelve percent of total capacity.
- 2TB nodes: allow deeper staging but require weekly inode audits because large caches hide lifecycle drift.
- Shared rule: replication plus heal must never consume more than half of sustained write megabytes per second during business hours.
OpenClaw canary and health probes
Point gateway webhooks at lightweight checks: mc ready, head-object on a canary key, and free APFS gigabytes on each Mac. Flux or Argo Rollouts can gate traffic only when those probes stay green; failures should emit a single bounded digest instead of raw logs, mirroring the Flux webhook canary guide.
Citable guardrails
- Replication SLO: alert when measured lag exceeds twice the documented promotion window for any release prefix.
- Heal fairness: cap concurrent heal operations so ninety-fifth percentile compile time rises less than eight percent week over week.
- rsync contract: checksum every promotion and require OpenClaw green before GitOps records a traffic shift.
- Disk contract: treat eighty percent utilization as capacity planning trigger, not a surprise outage.
Scale MinIO-backed CI without starving APFS
Rent multi-region Mac mini M4 nodes with predictable networking, attach S3-compatible buckets, and keep replication plus rsync inside measured ceilings using public purchase and pricing pages.