Build a multi-machine Apple Silicon cluster (May 2026) — Exo + MLX-LM + Thunderbolt 5 RDMA

Why this is now credible

Until early 2026, “run a 671B model on consumer hardware” was a stunt — possible technically, useless operationally because the inter-device latency dominated everything. The macOS 26.2 + Thunderbolt 5 RDMA combination flipped that. Inter-device latency dropped by ~99%; the per-token communication cost on pipeline parallel went from milliseconds to microseconds. DeepSeek V3 671B at 5.37 tok/s on 8x M4 Pro Mac Minis is the headline benchmark — it turns this from “can it work?” into “is the budget worth it?”

The honest framing this stack takes: this is the credible Apple-cluster path, not the right answer for most readers. Most workloads fit a single 4090 or M3 Max. The cluster path matters when (a) the model genuinely won't fit a single machine, (b) you have data-residency requirements that rule out cloud, and (c) you can absorb the operational complexity. If those don't all apply, look at the /stacks/apple-silicon-ai single-Mac stack first.

Networking assumptions

The stack's viability depends entirely on the interconnect. The honest hierarchy:

Thunderbolt 5 + macOS 26.2 RDMA: ~80 Gbps practical with sub-microsecond latency. The architecture this stack is built around. Requires M4 Pro+ on every node and macOS 26.2+ on every node — one node on macOS 26.1 silently downgrades the cluster to non-RDMA path.
Thunderbolt 4 (no RDMA): ~40 Gbps; the latency is acceptable but the throughput is the bottleneck. Not what this stack is designed for; pipeline parallel still works but you lose ~40-50% of the cluster advantage.
10 GbE Ethernet: works as backup / management network. Don't use as the primary inter-device path; the RDMA latency advantage vanishes.

See /systems/distributed-inference for the latency math that determines whether the cluster pays for itself.

Step-by-step setup

1. Verify RDMA on every node

# On every Mac in the cluster — verify Thunderbolt 5 RDMA available
# Requires macOS 26.2+ AND M4 Pro / M4 Max / M3 Ultra hardware
system_profiler SPThunderboltDataType | grep -i RDMA

# Expected output: "RDMA Support: Yes"
# If any node reports "RDMA Support: No" — fix that before continuing.
# One non-RDMA node downgrades the entire cluster.

The most common failure: one node still on macOS 26.1. Update everywhere; reboot; re-verify before installing the rest of the stack.

2. Install Exo on every node

# Native install via brew (preferred) on every Mac
brew install exo

# Or via pip if you prefer Python-managed
pip install exo-explore

# Verify version >= 1.0 — earlier versions don't support the
# Thunderbolt 5 RDMA path properly
exo --version

3. Start Exo on the head node

# On the Mac you'll use as the head node:
exo --node-role=head --discovery=auto

# Exo auto-discovers other 'exo' processes on the LAN/Thunderbolt
# bus. Each detected node logs a connection event in head's stdout.
# Wait until all expected nodes appear before sending workloads.

# On every other node:
exo --node-role=worker --head=192.168.10.1 --discovery=manual

4. Pull a model and run a query

# Through the head node's API:
curl -X POST http://localhost:52415/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "deepseek-ai/DeepSeek-V3",
    "messages": [{"role":"user","content":"Hello, world."}],
    "max_tokens": 100
  }'

# First request: model loads + Metal kernels JIT-compile across the
# cluster. Expect 2-5 minutes of initial setup.
# Subsequent requests: ~5.37 tok/s on a well-tuned 8-node cluster.

Power + thermal considerations

The big unsung advantage of this stack vs an NVIDIA equivalent: ~250-400W under load for the entire 8-node cluster. Compare to 1.5-3 kW for an equivalent multi-GPU NVIDIA setup. Three implications:

Residential power circuit is fine. 400W is well under any standard 15A 120V circuit. No need for dedicated 240V wiring; no need to coordinate with the rest of your house's draw.
Thermal output is manageable. Mac Minis run cool — sustained inference loads stay under 50°C in normal room temperatures. No active rack cooling required. A standard well-ventilated office shelf works.
Battery / UPS lasts longer. A 1500W UPS can run the entire cluster for ~30 minutes during an outage — vs ~3 minutes for an NVIDIA equivalent.

Cost estimate

Honest hardware estimate for the 8-node configuration as of May 2026:

8x M4 Pro Mac Mini, 64GB unified memory: ~$2,800/each = $22,400
Thunderbolt 5 cables + hub: ~$400
10GbE switch (backup/management): ~$300
1500W UPS: ~$300
Total: ~$23,400

Compare to: 8x H100 SXM = $200K+; 4x A100 80GB = ~$60-80K used. The cost ratio is real, and the price-per-token math is competitive with cloud rentals at sustained-use workloads. Where this loses: you can't train on it, you can't run NVIDIA-only frameworks, and the throughput per replica trails datacenter SKUs by 5-10x at single-user concurrency.

Failure modes you'll hit

RDMA falls back silently to non-RDMA. One node on macOS 26.1, one cable that doesn't support Thunderbolt 5 — the cluster keeps working but ~40% of the advantage vanishes. Always verify with system_profiler SPThunderboltDataType | grep RDMA on every node after any change.
Auto-discovery doesn't cross VLANs. Multicast DNS works on a flat LAN but routers can block it. Use --discovery=manual --peers=... with explicit IPs when auto-discovery fails.
Metal kernel cold start. First inference takes 30-60 seconds longer than expected as Metal kernels JIT-compile across the cluster. Pre-warm at startup with a 10-token query.
One node OOM kills the cluster. If a single node runs out of memory, the pipeline stalls. With 64GB unified memory per M4 Pro Mac Mini, this is usually solved by 32GB-tier nodes lacking headroom. Use 64GB consistently.
Daisy-chain Thunderbolt 5 introduces asymmetric latency. Star topology with a Thunderbolt 5 hub avoids this; daisy-chain works but asymmetric pairs (node 0 ↔ node 7 via 6 hops) lose the latency advantage.
Software updates require coordinated reboots. macOS update on one node leaves the cluster down until all nodes are updated. Coordinate update windows.
Concurrent workloads share the cluster. Single-user cluster handles its workload well; second user doubles latency because pipeline-parallel doesn't multi-tenant cleanly. Use one cluster per concurrent user.

Variations and alternatives

4-node minimum variation. 4x M4 Pro Mac Mini handles 70B-405B class models comfortably. Cheaper (~$11.5K hardware); doesn't fit DeepSeek V3 671B but covers most realistic local-frontier workloads.

M3 Ultra single-node alternative. Mac Studio with M3 Ultra + 192GB unified memory fits frontier models on a single node. ~$5K-7K. Slower than the 8-node M4 Pro cluster but vastly simpler operationally. Pick this if your model fits 192GB and you don't want cluster ops.

Linux/CUDA equivalent. /stacks/distributed-inference-homelab covers the NVIDIA path. Higher per-node throughput; vastly higher cost, power, and thermal envelope.

Petals fallback. If you can't afford even the 4-node variation, Petals shards over WAN volunteers. Slow but works. Acceptable for non-sensitive workloads only.

Who should avoid this stack

Anyone whose model fits on a single bigger machine. M3 Ultra Mac Studio with 192GB unified memory handles 70B-class models comfortably; M4 Max 128GB handles 32B-class effortlessly. The cluster path is for models that genuinely won't fit a single machine.
Anyone needing training or NVIDIA-only frameworks. MLX is inference-first; vLLM, TensorRT-LLM, training frameworks, CUDA kernels don't apply here. If your workload needs NVIDIA, this isn't your stack.
Anyone serving multi-tenant production. Pipeline parallelism doesn't multi-tenant cleanly. Concurrent users on the same cluster degrade rapidly. Per-user clusters are operationally painful.
Anyone uncomfortable with macOS-as-server. macOS isn't headless-server-grade — display sleep settings matter, automatic updates require coordination, kernel panics happen on prerelease OS versions. If your ops culture is “Linux server only,” this stack will fight you.
Anyone whose budget can't absorb the $20-25K hardware sunk cost. The cluster is a real capital investment. Cloud rental at sustained use pays this back; sporadic use doesn't. Math the economics carefully.

Going deeper

/systems/distributed-inference — the architectural depth on TP/PP, RDMA latency math, and why interconnect quality determines viability.
Apple Silicon AI single-Mac stack — the simpler precursor to this cluster path.
Exo catalog entry — the cluster orchestrator, with the Thunderbolt 5 RDMA prerequisite verification path.
MLX-LM catalog entry — the per-node inference engine.
Inference runtime ecosystem map — full landscape with the alternatives.