Local AI on macOS — Apple Silicon operator's guide (May 2026)
The honest macOS local-AI manual. Apple Silicon unified-memory advantage, MLX vs llama.cpp Metal vs Ollama Metal, model size by RAM tier (16/32/64/128/192 GB), Apple Neural Engine reality, MacBook vs Mac Studio thermal architecture.
Why Apple Silicon is the per-watt local-AI leader
The single architectural fact that makes Apple Silicon special for local AI is unified memory. On a discrete-GPU PC, your VRAM is the constraint and 32 GB or 48 GB is the high end. On a Mac Studio M3 Ultra you can spec 192 GB of unified memory, and most of it is available to the GPU. That single number — usable GPU memory — is why a Mac Studio can comfortably run a 200B-class MoE model in a desktop chassis at near-silent 370W, and a comparable PC build needs four datacenter-class GPUs.
The other architectural facts that matter:
- Memory bandwidth scales with the chip tier: M4 Pro at 273 GB/s, M4 Max at 546 GB/s, M3 Ultra at ~819 GB/s. LLM decode is bandwidth-bound; the high-bandwidth chips are the ones that feel fast.
- Metal is well-supported by all the runtimes that matter: Ollama, llama.cpp, MLX, LM Studio. The Metal compute path on macOS is the production path; you don't end up on CPU-fallback unless something is misconfigured.
- The thermal envelope is generous on Mac Studio and Mac mini, tighter on MacBook Pro and tightest on MacBook Air. That governs sustained-load throughput more than peak.
macOS architecture — unified memory, Metal, ANE
The three compute paths on Apple Silicon, with the operator-grade status:
- GPU via Metal: where every serious local-LLM runtime runs. MLX, llama.cpp, Ollama all use Metal compute under the hood. This is the workhorse.
- CPU: fallback. llama.cpp will run on the performance cores using ARM SIMD if Metal isn't set up. Throughput is meaningfully lower; you only land here if something's broken.
- Apple Neural Engine (ANE): 38 TOPS on M4 family, similar on M3. Apple uses it for system features and Apple Intelligence. For third-party LLMs, ANE access goes through Core ML, and full-model ANE residency for decoder-only LLMs is uncommon. The honest read: ANE is not where you run your LLM in 2026; the GPU is. See the dedicated section below.
The unified-memory bit is worth one more sentence: there's no PCIe transfer between “system memory” and “GPU memory” on Apple Silicon, because they're the same RAM. That eliminates a class of latency and copy-cost problems that discrete-GPU systems live with.
Apple Silicon vs Intel Mac — the honest split
If you have an Intel Mac in 2026, your local-AI options are narrower than you might expect. AMD GPUs in older Mac Pros run llama.cpp via OpenCL; sustained throughput is poor compared to Apple Silicon at the same retail tier. Most of the modern macOS local-AI ecosystem (MLX especially) is Apple-Silicon-only. The honest answer for Intel Mac users: run Ollama or LM Studio with CPU-only inference for small models, or treat your machine as a thin client to a Linux home server. The Apple-Silicon migration is the prerequisite for serious macOS local-AI work.
MLX vs llama.cpp Metal vs Ollama Metal — runtime workflow
The three runtimes that matter on macOS, with the picker:
- MLX-LM: Apple's first-party array framework + LLM library. Best throughput on Apple Silicon for many configurations, very active maintenance, model checkpoints in MLX 4-bit / 8-bit format. Pick for: maximum performance, Apple-aligned development, shared iOS/iPadOS toolchain via MLX Swift.
- llama.cpp with Metal backend: GGUF-format models, the largest model zoo of any runtime, mature Metal compute path. Pick for: maximum model choice, cross-platform parity with your Linux deployment, you already know GGUF.
- Ollama: native macOS app + CLI, llama.cpp under the hood, OpenAI-compatible API. Pick for: lowest-friction setup, GUI users, scripting against a stable API surface that's identical across platforms.
A typical operator-grade workflow on a Mac Studio: Ollama for quick chat, llama.cpp directly when you need a model not in the Ollama registry, MLX-LM when you're benchmarking the absolute ceiling. LM Studio is the fourth option and is fine — it's a polished GUI on top of llama.cpp.
Model size by RAM tier — 16 / 32 / 64 / 128 / 192 GB
The honest sizing table for Apple Silicon, single-stream Q4 / 4-bit unless noted. Note that macOS reserves a meaningful chunk of unified memory for the OS and you should not plan on the entire RAM number being available for the model.
- 16 GB (MacBook Air M3 / M4, MacBook Pro M4 base): 7B-class Q4 (~4 GB weights + KV) is the comfortable ceiling. 13B Q4 loads but leaves little headroom for anything else.
- 32 GB (MacBook Pro M4 Pro): 13B-class Q4 comfortable; 30-32B Q4 (~17-19 GB) loads with patience and short-context discipline.
- 64 GB (MacBook Pro M4 Max, Mac Studio M4 Max): 32B Q4 comfortable, 70B Q4 (~40 GB) loads with the right quant choice and short context. The first tier where 70B is real.
- 128 GB (MacBook Pro M4 Max max-spec, Mac Studio M3 Ultra mid-spec): 70B Q4 comfortable; 100B+ MoE models (Mixtral 8x22B at Q4 ~70 GB) become viable; long context on 70B models becomes practical.
- 192 GB (Mac Studio M3 Ultra max-spec): 200B- class MoE models comfortable. The current ceiling for desktop local AI in a non-rack form factor. See mac-studio-m3-ultra-192gb combo for the operator-grade tradeoff analysis.
Decode throughput varies by model, quant, and chip tier. We don't publish single-figure tok/s estimates here; see the Apple Silicon path for the deeper buying decision and Apple Silicon AI stack for the deployment recipe.
Apple Neural Engine — current LLM status (limited)
Worth its own section because every Mac user asks. The honest state of ANE for third-party LLMs in 2026:
- ANE access goes through Core ML. You convert a model to a Core ML package; Core ML's scheduler decides which ops run on ANE, which on GPU, which on CPU.
- Decoder-only transformer LLMs are not ANE-friendly in the way convolutional vision models are. KV-cache attention patterns and dynamic shapes don't fit ANE's execution model cleanly.
- Apple Intelligence's on-device 3B model uses ANE because Apple has tooling and quantization paths the rest of us don't. Their numbers are not generally reproducible with third-party model conversions.
- Practical advice: don't plan a third-party LLM project around ANE in 2026. Plan around Metal-via-MLX or Metal-via- llama.cpp, and treat any ANE acceleration as a future bonus.
MacBook Pro vs Mac Studio vs Mac mini thermals
Throughput at peak is one number; throughput under sustained load is another. The thermal architecture across the Mac lineup governs that gap:
- Mac Studio: best sustained throughput. Active cooling with generous headroom; runs 70B+ inference for hours without throttling at audible-but-not-loud noise levels. The production-grade local-AI Mac.
- Mac mini M4 / M4 Pro: solid sustained throughput. Smaller chassis, smaller fan, but the chip tiers it ships with stay within a comfortable thermal envelope. Excellent price/performance for 7-32B-class workloads.
- MacBook Pro M4 Max: peak throughput is great, sustained throughput drops 20-35% after 5-15 minutes of full load on battery, less on AC. The chassis is thermally constrained relative to a Mac Studio. Fine for development; don't plan on it as a 24/7 inference server.
- MacBook Air: passive cooling. Short-burst inference is fine; sustained inference throttles aggressively. Treat as “will it run” territory, not deployment.
macOS vs Linux vs Windows — comparison for local AI
The honest cross-platform tier ranking:
- Linux: production-default. Every runtime supported; every deployment pattern documented. See /systems/linux-local-ai.
- macOS Apple Silicon: the per-watt and unified- memory leader. Best-in-class for very-large-model inference in a non-rack form factor. Less production server tooling than Linux.
- Windows + WSL2: viable development environment. See /systems/windows-local-ai.
- Windows native: hobbyist tier. Ollama and LM Studio are excellent; serious server-style deployment belongs on Linux or WSL2.
The macOS-specific argument: if you want to run a 100-200B-class MoE model on a desktop without building a multi-GPU rack, Mac Studio M3 Ultra at 192 GB is the cleanest path. No PC build at comparable money fits that envelope without four datacenter GPUs.
Common failure modes
- Wired memory pressure tanks throughput. macOS keeps non-AI processes resident; large models compete for unified memory. Quit memory-heavy apps (Chrome with 80 tabs, Docker Desktop, Photos library) before serious inference runs.
- Ollama or llama.cpp picks CPU instead of Metal. Build mismatch. Ollama's native macOS installer auto-uses Metal; if you built llama.cpp from source, ensure
LLAMA_METAL=1at build time. - MLX model fails to load with quantization mismatch. MLX 4-bit and llama.cpp Q4 are not interchangeable. Use the corresponding model checkpoint for the runtime you're using.
- Sleep-wake cycles break long-running inference. macOS power management pauses Metal compute on sleep. For production-style runs, use
caffeinate -ior System Settings → Energy Saver. - Spotlight indexing slows model load. Spotlight sometimes indexes large model files and can hammer the disk for hours. Add the model directory to Spotlight Privacy.
- Time Machine backs up the model cache. 192 GB of model files don't need to be in your Time Machine. Exclude the Hugging Face cache and your Ollama models directory.
- FileVault encryption drops cold-start performance. FileVault is fast, but loading a 70 GB model from cold cache on a FileVault volume is measurably slower than from an unencrypted external drive. For serious work, an external NVMe Thunderbolt drive without encryption is a real speedup.
Production-grade patterns on macOS
When a Mac Studio is your local-AI server (a real pattern in 2026):
- Run Ollama as a launchd service; the official installer ships this configuration.
- Set
OLLAMA_HOST=0.0.0.0:11434via launchd environment so the server is reachable from your Tailscale or local network. - Use Tailscale or ZeroTier; never expose the Ollama port directly to the internet.
- Caffeinate the machine (System Settings → Energy Saver → never sleep when on AC).
- Pin the macOS version; major macOS releases occasionally break Metal compute paths. Verify in a non-critical window before upgrading.
- Store model weights on an external NVMe Thunderbolt drive — on current Mac Studios the internal SSD is non-replaceable, so burn-in cycles on the model cache are best directed at a replaceable drive.
Going deeper
- Local AI on Linux — production-default OS guide.
- Local AI on Windows 11 — the Windows operator's manual.
- Apple Silicon path — buying decision walkthrough.
- Apple Silicon AI stack — deployment recipe.
- Mac Studio M3 Ultra 192 GB combo.
- Apple M4 Max, Apple M3 Ultra.
Next step on macOS
Lowest-friction starting point on Apple Silicon.