Build a Mac-native AI stack (May 2026) — MLX-LM + Exo + Ollama + Open WebUI + Claude Desktop

Why Apple Silicon is no longer second-class

Three things changed through 2025-2026 that make this stack a serious option for the first time:

MLX-LM caught up. Through 2024 the consensus was “llama.cpp Metal beats MLX on Apple Silicon for everything except long-context.” Through 2025 MLX closed the throughput gap on short context and extended its long-context lead. As of May 2026, MLX-LM matches or exceeds llama.cpp Metal across the workloads most users care about.

Thunderbolt 5 + macOS 26.2 RDMA shipped. On M4 Pro+ hardware running macOS 26.2, Thunderbolt 5 cables carry RDMA — Remote Direct Memory Access — between Macs at near-PCIe speeds. Inter-device latency for tensor parallel dropped by ~99% compared to the pre-RDMA path. That single change made consumer-Mac clusters credible for serving frontier-class models.

Exo matured. The auto-discovery LAN cluster tool that sits on top of MLX. As of the May 2026 release, DeepSeek V3 671B runs at 5.37 tok/s on 8x M4 Pro Mac Minis — slower than a datacenter cluster, but on hardware most serious developers can actually buy.

Step-by-step setup (single Mac)

1. Install MLX-LM as the inference engine

# Install MLX-LM via pip — UV is fastest if you have it
uv tool install mlx-lm
# or: pip install mlx-lm

# Pull and serve a coding model in MLX 4-bit quant
mlx_lm.server \
  --model mlx-community/Qwen2.5-Coder-32B-Instruct-4bit \
  --port 8000 \
  --host 127.0.0.1

The MLX server exposes an OpenAI-compatible /v1 endpoint on the port you pick. First load downloads the model (~18GB for the 4-bit quant) and warms the Metal kernels — expect 30-60 seconds before first token on cold start.

2. Add Ollama for ad-hoc model swaps

# Install Ollama natively (uses llama.cpp under the hood)
brew install ollama

# Pull a smaller chat model that complements the MLX coding model
ollama serve &
ollama pull qwen3:14b

# Verify both runtimes alive on different ports
curl http://localhost:8000/v1/models   # MLX-LM (Qwen Coder 32B)
curl http://localhost:11434/api/tags   # Ollama (Qwen 3 14B)

Ollama and MLX-LM coexist on the same Mac because they hold different model weights and use the same Metal device driver. Total memory under load: ~26GB unified; idle Ollama drops back to ~14GB. The 64GB M3 Max has comfortable headroom for both plus your normal workflow.

3. Wire Open WebUI as the chat frontend

# Run Open WebUI in Docker Desktop on Apple Silicon (native ARM)
docker run -d --name open-webui \
  -p 3000:8080 \
  --restart unless-stopped \
  -v open-webui:/app/backend/data \
  -e OPENAI_API_BASE_URLS="http://host.docker.internal:8000/v1" \
  -e OPENAI_API_KEYS="any-string" \
  -e ENABLE_OLLAMA_API=true \
  -e OLLAMA_BASE_URLS="http://host.docker.internal:11434" \
  ghcr.io/open-webui/open-webui:latest

Native ARM containers on Apple Silicon Docker Desktop run within ~5% of bare metal performance now. Open WebUI sees both backends and the model switcher works the same as on Linux/Windows.

4. Add Claude Desktop with MCP for agentic workflows

# Install Claude Desktop (via Mac App Store or direct download)
# Then edit ~/Library/Application Support/Claude/claude_desktop_config.json:
{
  "mcpServers": {
    "filesystem": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-filesystem", "/Users/you/projects"]
    },
    "git": {
      "command": "npx",
      "args": ["-y", "@modelcontextprotocol/server-git", "--repository", "/Users/you/projects/main-repo"]
    }
  }
}

Claude Desktop is the native macOS MCP host with the strictest spec implementation — different role from Open WebUI, used for agentic workflows that need filesystem and git access. Restart Claude Desktop after editing config; the MCP servers launch on app startup.

Multi-Mac clustering with Exo

The single-Mac stack runs comfortably to ~70B models in MLX-4bit. To go larger — DeepSeek V3, Llama 3.1 405B class — you need a cluster. Exo turns 2-8 Apple Silicon devices on the same LAN into one logical inference target.

# Install on every Mac in the cluster
brew install exo

# On the Mac you'll use as the entry point, just run:
exo

# Exo auto-discovers other Macs on the LAN running exo and
# shards model layers via pipeline parallel. With Thunderbolt 5
# RDMA enabled (macOS 26.2 + M4 Pro+), the inter-device latency
# is near-PCIe — 8x M4 Pro Mac Minis run DeepSeek V3 671B at
# 5.37 tok/s, which is genuinely usable.

See /systems/distributed-inference for the architectural depth on what's actually happening when Exo shards a model across machines, and the conditions under which Thunderbolt 5 RDMA pays for the extra hardware (it usually does for 70B+ models that don't fit a single Mac; it usually doesn't for smaller models that already fit).

Failure modes you'll hit

Metal kernel cold start. First inference after a fresh model load takes 30-60 seconds longer than expected. The Metal compiler is JIT-compiling kernels on first dispatch. Subsequent calls are fast. Pre-warm by sending a 10-token prompt at server startup.
Activity Monitor shows GPU at 100% but tok/s is low. Almost always thermal throttling on a chassis without active cooling (MacBook Pro sustained workload). Plug into power, lift the laptop off the desk for airflow, or move to a Mac Studio / Mac Mini for sustained inference.
Exo doesn't auto-discover other Macs. Multicast DNS is blocked by some routers. Fix: point Exo at peer IPs explicitly with exo --discovery=manual --peers=192.168.1.5,192.168.1.6.
Thunderbolt 5 RDMA falls back to non-RDMA. One node on macOS 26.1 silently downgrades the cluster. Verify all nodes report system_profiler SPThunderboltDataType | grep RDMA shows enabled.
MLX-LM doesn't support a quant format you need. MLX has its own quant format (mlx-community/*-4bit); GGUF support is limited to specific architectures. If a model is only available as GGUF, run it via Ollama instead.
Open WebUI Docker Desktop high CPU on idle. Apple Silicon Docker Desktop can pin a CPU core at 10-20% even with no containers running. Set Docker Desktop to Preferences → Resources → Advanced and limit CPU to 4 cores; the savings on battery are noticeable.

Variations and alternatives

llama.cpp instead of MLX-LM. If you need GGUF compatibility (sharing models with Linux/Windows users) or a model MLX hasn't picked up, swap the inference engine. The rest of the stack (Ollama, Open WebUI, Claude Desktop, Exo) stays the same — Exo can drive llama.cpp via its OpenAI-compatible bridge.

M4 Pro / M4 Max instead of M3 Max. Pick M4 Pro / M4 Max if you plan to cluster — Thunderbolt 5 RDMA only works on those generations. For single-Mac use, M3 Max delivers ~90% of the throughput at lower cost.

Cross-platform homelab variation. If your stack mixes Apple Silicon and a Linux GPU box, see the RTX 4090 workstation stack for the GPU side. Both stacks expose OpenAI-compatible endpoints; a single Open WebUI instance can show models from both as siblings.

Coding-agent specialisation. If your workload is mostly autonomous coding, see the dedicated local coding-agent stack — same MLX-LM engine but specialised around OpenHands + Mem0 + git-MCP rather than a generalist Open WebUI surface.

Going deeper

Apple M3 Max catalog entry — unified-memory characteristics, GPU core scaling, thermal envelope under sustained load.
MLX-LM catalog entry — the Apple-native inference path with quant format details and architecture coverage.
Exo catalog entry — the multi-Mac clustering layer, including the Thunderbolt 5 RDMA prerequisite and how to verify it's active.
/systems/distributed-inference — protocol-engineering depth on what happens when Exo shards a model across machines, and the latency math that determines whether the cluster pays for itself.
Inference runtime ecosystem map — where MLX-LM and Ollama sit relative to vLLM / SGLang / llama.cpp and the broader landscape.