Build a Mac-native AI stack (May 2026)
A Mac-native local AI stack that takes full advantage of unified memory and (optionally) scales across multiple Macs via Thunderbolt 5 — runs 32B-class models comfortably on a single Mac, frontier-class models across a cluster.
- 01HardwareCompute (Apple Silicon GPU + unified memory)apple-m3-max
M3 Max 64GB is the single-Mac sweet spot in May 2026 — 36 GPU cores, 400 GB/s memory bandwidth, all 64GB addressable as VRAM. M4 Pro / M4 Max win on Thunderbolt 5 RDMA for clustering; for single-Mac use, M3 Max delivers 90% of M4 Max throughput at lower price.
- 02ToolInference engine (Apple-native)mlx-lm
MLX-LM over llama.cpp on M-series silicon: matched throughput on short context, ~15-25% faster on long context (32K+), and the path that pairs with Exo for cluster scaling. Use llama.cpp when you need GGUF quants MLX hasn't picked up yet.
- 03ToolModel-swap layer (ad-hoc experimentation)ollama
Ollama on Mac uses llama.cpp under the hood — runs alongside MLX-LM for the 'pull a new model right now' workflow. Different role than MLX-LM (Ollama wraps llama.cpp; MLX-LM is the Apple-native path). Both alive on different ports.
- 04ToolDistributed serving (multi-Mac cluster)exo
Exo is what makes multi-Mac credible in 2026: auto-discovers nearby Apple Silicon devices on the LAN, shards models across them via pipeline parallel on top of MLX. Thunderbolt 5 + macOS 26.2 RDMA cuts inter-device latency by ~99%, turning consumer-Mac clusters into a real serving option.
- 05ModelCoding model (single-Mac primary)qwen-2.5-coder-32b-instruct
Qwen 2.5 Coder 32B in MLX-4bit quant runs comfortably on a 64GB M3 Max with room for 32K context. Beats DeepSeek Coder V2 Lite on coding benchmarks at the same memory footprint.
- 06ToolChat frontendopenwebui
Open WebUI runs in Docker Desktop or directly via npm; talks to MLX-LM's OpenAI-compatible bridge. Same multi-user ergonomics as on Linux/Windows; native Apple Silicon container performance is now within 5% of bare metal.
- 07ToolMCP host (agent workflows)claude-desktop
Claude Desktop is the native macOS MCP host with the strictest spec implementation. Pairs with MCP servers (filesystem, git, search) to give agentic workflows a polished native UI. Use Claude Desktop alongside Open WebUI — different roles.
Why Apple Silicon is no longer second-class
Three things changed through 2025-2026 that make this stack a serious option for the first time:
MLX-LM caught up. Through 2024 the consensus was “llama.cpp Metal beats MLX on Apple Silicon for everything except long-context.” Through 2025 MLX closed the throughput gap on short context and extended its long-context lead. As of May 2026, MLX-LM matches or exceeds llama.cpp Metal across the workloads most users care about.
Thunderbolt 5 + macOS 26.2 RDMA shipped. On M4 Pro+ hardware running macOS 26.2, Thunderbolt 5 cables carry RDMA — Remote Direct Memory Access — between Macs at near-PCIe speeds. Inter-device latency for tensor parallel dropped by ~99% compared to the pre-RDMA path. That single change made consumer-Mac clusters credible for serving frontier-class models.
Exo matured. The auto-discovery LAN cluster tool that sits on top of MLX. As of the May 2026 release, DeepSeek V3 671B runs at 5.37 tok/s on 8x M4 Pro Mac Minis — slower than a datacenter cluster, but on hardware most serious developers can actually buy.
Step-by-step setup (single Mac)
1. Install MLX-LM as the inference engine
# Install MLX-LM via pip — UV is fastest if you have it
uv tool install mlx-lm
# or: pip install mlx-lm
# Pull and serve a coding model in MLX 4-bit quant
mlx_lm.server \
--model mlx-community/Qwen2.5-Coder-32B-Instruct-4bit \
--port 8000 \
--host 127.0.0.1The MLX server exposes an OpenAI-compatible /v1 endpoint on the port you pick. First load downloads the model (~18GB for the 4-bit quant) and warms the Metal kernels — expect 30-60 seconds before first token on cold start.
2. Add Ollama for ad-hoc model swaps
# Install Ollama natively (uses llama.cpp under the hood)
brew install ollama
# Pull a smaller chat model that complements the MLX coding model
ollama serve &
ollama pull qwen3:14b
# Verify both runtimes alive on different ports
curl http://localhost:8000/v1/models # MLX-LM (Qwen Coder 32B)
curl http://localhost:11434/api/tags # Ollama (Qwen 3 14B)Ollama and MLX-LM coexist on the same Mac because they hold different model weights and use the same Metal device driver. Total memory under load: ~26GB unified; idle Ollama drops back to ~14GB. The 64GB M3 Max has comfortable headroom for both plus your normal workflow.
3. Wire Open WebUI as the chat frontend
# Run Open WebUI in Docker Desktop on Apple Silicon (native ARM)
docker run -d --name open-webui \
-p 3000:8080 \
--restart unless-stopped \
-v open-webui:/app/backend/data \
-e OPENAI_API_BASE_URLS="http://host.docker.internal:8000/v1" \
-e OPENAI_API_KEYS="any-string" \
-e ENABLE_OLLAMA_API=true \
-e OLLAMA_BASE_URLS="http://host.docker.internal:11434" \
ghcr.io/open-webui/open-webui:latestNative ARM containers on Apple Silicon Docker Desktop run within ~5% of bare metal performance now. Open WebUI sees both backends and the model switcher works the same as on Linux/Windows.
4. Add Claude Desktop with MCP for agentic workflows
# Install Claude Desktop (via Mac App Store or direct download)
# Then edit ~/Library/Application Support/Claude/claude_desktop_config.json:
{
"mcpServers": {
"filesystem": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-filesystem", "/Users/you/projects"]
},
"git": {
"command": "npx",
"args": ["-y", "@modelcontextprotocol/server-git", "--repository", "/Users/you/projects/main-repo"]
}
}
}Claude Desktop is the native macOS MCP host with the strictest spec implementation — different role from Open WebUI, used for agentic workflows that need filesystem and git access. Restart Claude Desktop after editing config; the MCP servers launch on app startup.
Multi-Mac clustering with Exo
The single-Mac stack runs comfortably to ~70B models in MLX-4bit. To go larger — DeepSeek V3, Llama 3.1 405B class — you need a cluster. Exo turns 2-8 Apple Silicon devices on the same LAN into one logical inference target.
# Install on every Mac in the cluster
brew install exo
# On the Mac you'll use as the entry point, just run:
exo
# Exo auto-discovers other Macs on the LAN running exo and
# shards model layers via pipeline parallel. With Thunderbolt 5
# RDMA enabled (macOS 26.2 + M4 Pro+), the inter-device latency
# is near-PCIe — 8x M4 Pro Mac Minis run DeepSeek V3 671B at
# 5.37 tok/s, which is genuinely usable.See /systems/distributed-inference for the architectural depth on what's actually happening when Exo shards a model across machines, and the conditions under which Thunderbolt 5 RDMA pays for the extra hardware (it usually does for 70B+ models that don't fit a single Mac; it usually doesn't for smaller models that already fit).
Failure modes you'll hit
- Metal kernel cold start. First inference after a fresh model load takes 30-60 seconds longer than expected. The Metal compiler is JIT-compiling kernels on first dispatch. Subsequent calls are fast. Pre-warm by sending a 10-token prompt at server startup.
- Activity Monitor shows GPU at 100% but tok/s is low. Almost always thermal throttling on a chassis without active cooling (MacBook Pro sustained workload). Plug into power, lift the laptop off the desk for airflow, or move to a Mac Studio / Mac Mini for sustained inference.
- Exo doesn't auto-discover other Macs. Multicast DNS is blocked by some routers. Fix: point Exo at peer IPs explicitly with
exo --discovery=manual --peers=192.168.1.5,192.168.1.6. - Thunderbolt 5 RDMA falls back to non-RDMA. One node on macOS 26.1 silently downgrades the cluster. Verify all nodes report
system_profiler SPThunderboltDataType | grep RDMAshows enabled. - MLX-LM doesn't support a quant format you need. MLX has its own quant format (
mlx-community/*-4bit); GGUF support is limited to specific architectures. If a model is only available as GGUF, run it via Ollama instead. - Open WebUI Docker Desktop high CPU on idle. Apple Silicon Docker Desktop can pin a CPU core at 10-20% even with no containers running. Set Docker Desktop to
Preferences → Resources → Advancedand limit CPU to 4 cores; the savings on battery are noticeable.
Variations and alternatives
llama.cpp instead of MLX-LM. If you need GGUF compatibility (sharing models with Linux/Windows users) or a model MLX hasn't picked up, swap the inference engine. The rest of the stack (Ollama, Open WebUI, Claude Desktop, Exo) stays the same — Exo can drive llama.cpp via its OpenAI-compatible bridge.
M4 Pro / M4 Max instead of M3 Max. Pick M4 Pro / M4 Max if you plan to cluster — Thunderbolt 5 RDMA only works on those generations. For single-Mac use, M3 Max delivers ~90% of the throughput at lower cost.
Cross-platform homelab variation. If your stack mixes Apple Silicon and a Linux GPU box, see the RTX 4090 workstation stack for the GPU side. Both stacks expose OpenAI-compatible endpoints; a single Open WebUI instance can show models from both as siblings.
Coding-agent specialisation. If your workload is mostly autonomous coding, see the dedicated local coding-agent stack — same MLX-LM engine but specialised around OpenHands + Mem0 + git-MCP rather than a generalist Open WebUI surface.
Going deeper
- Apple M3 Max catalog entry — unified-memory characteristics, GPU core scaling, thermal envelope under sustained load.
- MLX-LM catalog entry — the Apple-native inference path with quant format details and architecture coverage.
- Exo catalog entry — the multi-Mac clustering layer, including the Thunderbolt 5 RDMA prerequisite and how to verify it's active.
- /systems/distributed-inference — protocol-engineering depth on what happens when Exo shards a model across machines, and the latency math that determines whether the cluster pays for itself.
- Inference runtime ecosystem map — where MLX-LM and Ollama sit relative to vLLM / SGLang / llama.cpp and the broader landscape.