Can I distribute local LLM inference across multiple machines (P2P)?
The answer
One paragraph. No hedging beyond what the data actually warrants.
Yes — but the network is almost always the bottleneck. Three working paths in May 2026:
1. vLLM tensor parallelism (NVIDIA, same machine) — production-grade. Splits attention heads across multiple GPUs in the same box. Bandwidth via NVLink (or PCIe). Works great for dual / quad 3090, dual A100, etc. NOT cross-machine.
2. MLX-distributed (Apple) — Mac Studio cluster. Two or more M-series Macs connected via Thunderbolt 4 (40 Gbps) or 10GbE. Sharing model weights across nodes works; latency is the cost. A multi-node M3 Ultra cluster CAN host a 671B MoE like DeepSeek V3 — community reports describe it as "usable for solo workflows, not for serving." We don't have independent measurements; the specific tok/s depends heavily on node count, interconnect, and quant.
3. exo (cross-OS P2P) — community project that handles cross-platform clusters: mix Mac, Linux, even iPhone in a single pool. Splits the model layer-wise across nodes. Trades throughput for "use the hardware you already have." Don't expect production-grade speeds.
4. Petals / Petals 2 — the most ambitious: peer-to-peer inference across the public internet. Anyone can join the swarm; you contribute compute and consume inference. Real but slow; the swarm latency makes interactive use frustrating.
The honest math: for layer-wise distributed inference, you're bottlenecked by the slowest network hop in the chain. Thunderbolt 4 (40 Gbps) is fine for Mac clusters. 10GbE is acceptable for small NVIDIA clusters. 1 GbE is unusable for anything beyond toy demos.
Decision rule: if you have 2-4 Mac Studios already, MLX-distributed is the only path that fits the hardware constraints. If you have 2-4 NVIDIA workstations, vLLM tensor-parallel + NCCL over 10GbE works but you're better off building one bigger box. P2P over the public internet (Petals) is a research curiosity, not an operator solution.
Explore the numbers for your specific stack
Where we got the numbers
MLX-distributed: ml-explore/mlx-examples repo. exo: exo-explore/exo repo. Petals: bigscience-workshop/petals. Cross-machine bandwidth math from cluster-deployment community discussions.
Also see
Other questions in this thread
Other /q/ landings on the same topic — same editorial discipline.
Found this via a forum search? Bookmark the URL — we update these pages as new data lands. Have a question that should live here? Open a GitHub issue.