System guide · Operating system

Local AI on macOS — Apple Silicon operator's guide (May 2026)

The honest macOS local-AI manual. Apple Silicon unified-memory advantage, MLX vs llama.cpp Metal vs Ollama Metal, model size by RAM tier (16/32/64/128/192 GB), Apple Neural Engine reality, MacBook vs Mac Studio thermal architecture.

By Fredoline Eruo · Last reviewed 2026-05-08

Why Apple Silicon is the per-watt local-AI leader

The single architectural fact that makes Apple Silicon special for local AI is unified memory. On a discrete-GPU PC, your VRAM is the constraint and 32 GB or 48 GB is the high end. On a Mac Studio M3 Ultra you can spec 192 GB of unified memory, and most of it is available to the GPU. That single number — usable GPU memory — is why a Mac Studio can comfortably run a 200B-class MoE model in a desktop chassis at near-silent 370W, and a comparable PC build needs four datacenter-class GPUs.

The other architectural facts that matter:

  • Memory bandwidth scales with the chip tier: M4 Pro at 273 GB/s, M4 Max at 546 GB/s, M3 Ultra at ~819 GB/s. LLM decode is bandwidth-bound; the high-bandwidth chips are the ones that feel fast.
  • Metal is well-supported by all the runtimes that matter: Ollama, llama.cpp, MLX, LM Studio. The Metal compute path on macOS is the production path; you don't end up on CPU-fallback unless something is misconfigured.
  • The thermal envelope is generous on Mac Studio and Mac mini, tighter on MacBook Pro and tightest on MacBook Air. That governs sustained-load throughput more than peak.

macOS architecture — unified memory, Metal, ANE

The three compute paths on Apple Silicon, with the operator-grade status:

  • GPU via Metal: where every serious local-LLM runtime runs. MLX, llama.cpp, Ollama all use Metal compute under the hood. This is the workhorse.
  • CPU: fallback. llama.cpp will run on the performance cores using ARM SIMD if Metal isn't set up. Throughput is meaningfully lower; you only land here if something's broken.
  • Apple Neural Engine (ANE): 38 TOPS on M4 family, similar on M3. Apple uses it for system features and Apple Intelligence. For third-party LLMs, ANE access goes through Core ML, and full-model ANE residency for decoder-only LLMs is uncommon. The honest read: ANE is not where you run your LLM in 2026; the GPU is. See the dedicated section below.

The unified-memory bit is worth one more sentence: there's no PCIe transfer between “system memory” and “GPU memory” on Apple Silicon, because they're the same RAM. That eliminates a class of latency and copy-cost problems that discrete-GPU systems live with.

Apple Silicon vs Intel Mac — the honest split

If you have an Intel Mac in 2026, your local-AI options are narrower than you might expect. AMD GPUs in older Mac Pros run llama.cpp via OpenCL; sustained throughput is poor compared to Apple Silicon at the same retail tier. Most of the modern macOS local-AI ecosystem (MLX especially) is Apple-Silicon-only. The honest answer for Intel Mac users: run Ollama or LM Studio with CPU-only inference for small models, or treat your machine as a thin client to a Linux home server. The Apple-Silicon migration is the prerequisite for serious macOS local-AI work.

MLX vs llama.cpp Metal vs Ollama Metal — runtime workflow

The three runtimes that matter on macOS, with the picker:

  • MLX-LM: Apple's first-party array framework + LLM library. Best throughput on Apple Silicon for many configurations, very active maintenance, model checkpoints in MLX 4-bit / 8-bit format. Pick for: maximum performance, Apple-aligned development, shared iOS/iPadOS toolchain via MLX Swift.
  • llama.cpp with Metal backend: GGUF-format models, the largest model zoo of any runtime, mature Metal compute path. Pick for: maximum model choice, cross-platform parity with your Linux deployment, you already know GGUF.
  • Ollama: native macOS app + CLI, llama.cpp under the hood, OpenAI-compatible API. Pick for: lowest-friction setup, GUI users, scripting against a stable API surface that's identical across platforms.

A typical operator-grade workflow on a Mac Studio: Ollama for quick chat, llama.cpp directly when you need a model not in the Ollama registry, MLX-LM when you're benchmarking the absolute ceiling. LM Studio is the fourth option and is fine — it's a polished GUI on top of llama.cpp.

Model size by RAM tier — 16 / 32 / 64 / 128 / 192 GB

The honest sizing table for Apple Silicon, single-stream Q4 / 4-bit unless noted. Note that macOS reserves a meaningful chunk of unified memory for the OS and you should not plan on the entire RAM number being available for the model.

  • 16 GB (MacBook Air M3 / M4, MacBook Pro M4 base): 7B-class Q4 (~4 GB weights + KV) is the comfortable ceiling. 13B Q4 loads but leaves little headroom for anything else.
  • 32 GB (MacBook Pro M4 Pro): 13B-class Q4 comfortable; 30-32B Q4 (~17-19 GB) loads with patience and short-context discipline.
  • 64 GB (MacBook Pro M4 Max, Mac Studio M4 Max): 32B Q4 comfortable, 70B Q4 (~40 GB) loads with the right quant choice and short context. The first tier where 70B is real.
  • 128 GB (MacBook Pro M4 Max max-spec, Mac Studio M3 Ultra mid-spec): 70B Q4 comfortable; 100B+ MoE models (Mixtral 8x22B at Q4 ~70 GB) become viable; long context on 70B models becomes practical.
  • 192 GB (Mac Studio M3 Ultra max-spec): 200B- class MoE models comfortable. The current ceiling for desktop local AI in a non-rack form factor. See mac-studio-m3-ultra-192gb combo for the operator-grade tradeoff analysis.

Decode throughput varies by model, quant, and chip tier. We don't publish single-figure tok/s estimates here; see the Apple Silicon path for the deeper buying decision and Apple Silicon AI stack for the deployment recipe.

Apple Neural Engine — current LLM status (limited)

Worth its own section because every Mac user asks. The honest state of ANE for third-party LLMs in 2026:

  • ANE access goes through Core ML. You convert a model to a Core ML package; Core ML's scheduler decides which ops run on ANE, which on GPU, which on CPU.
  • Decoder-only transformer LLMs are not ANE-friendly in the way convolutional vision models are. KV-cache attention patterns and dynamic shapes don't fit ANE's execution model cleanly.
  • Apple Intelligence's on-device 3B model uses ANE because Apple has tooling and quantization paths the rest of us don't. Their numbers are not generally reproducible with third-party model conversions.
  • Practical advice: don't plan a third-party LLM project around ANE in 2026. Plan around Metal-via-MLX or Metal-via- llama.cpp, and treat any ANE acceleration as a future bonus.

MacBook Pro vs Mac Studio vs Mac mini thermals

Throughput at peak is one number; throughput under sustained load is another. The thermal architecture across the Mac lineup governs that gap:

  • Mac Studio: best sustained throughput. Active cooling with generous headroom; runs 70B+ inference for hours without throttling at audible-but-not-loud noise levels. The production-grade local-AI Mac.
  • Mac mini M4 / M4 Pro: solid sustained throughput. Smaller chassis, smaller fan, but the chip tiers it ships with stay within a comfortable thermal envelope. Excellent price/performance for 7-32B-class workloads.
  • MacBook Pro M4 Max: peak throughput is great, sustained throughput drops 20-35% after 5-15 minutes of full load on battery, less on AC. The chassis is thermally constrained relative to a Mac Studio. Fine for development; don't plan on it as a 24/7 inference server.
  • MacBook Air: passive cooling. Short-burst inference is fine; sustained inference throttles aggressively. Treat as “will it run” territory, not deployment.

macOS vs Linux vs Windows — comparison for local AI

The honest cross-platform tier ranking:

  • Linux: production-default. Every runtime supported; every deployment pattern documented. See /systems/linux-local-ai.
  • macOS Apple Silicon: the per-watt and unified- memory leader. Best-in-class for very-large-model inference in a non-rack form factor. Less production server tooling than Linux.
  • Windows + WSL2: viable development environment. See /systems/windows-local-ai.
  • Windows native: hobbyist tier. Ollama and LM Studio are excellent; serious server-style deployment belongs on Linux or WSL2.

The macOS-specific argument: if you want to run a 100-200B-class MoE model on a desktop without building a multi-GPU rack, Mac Studio M3 Ultra at 192 GB is the cleanest path. No PC build at comparable money fits that envelope without four datacenter GPUs.

Common failure modes

  1. Wired memory pressure tanks throughput. macOS keeps non-AI processes resident; large models compete for unified memory. Quit memory-heavy apps (Chrome with 80 tabs, Docker Desktop, Photos library) before serious inference runs.
  2. Ollama or llama.cpp picks CPU instead of Metal. Build mismatch. Ollama's native macOS installer auto-uses Metal; if you built llama.cpp from source, ensure LLAMA_METAL=1 at build time.
  3. MLX model fails to load with quantization mismatch. MLX 4-bit and llama.cpp Q4 are not interchangeable. Use the corresponding model checkpoint for the runtime you're using.
  4. Sleep-wake cycles break long-running inference. macOS power management pauses Metal compute on sleep. For production-style runs, use caffeinate -i or System Settings → Energy Saver.
  5. Spotlight indexing slows model load. Spotlight sometimes indexes large model files and can hammer the disk for hours. Add the model directory to Spotlight Privacy.
  6. Time Machine backs up the model cache. 192 GB of model files don't need to be in your Time Machine. Exclude the Hugging Face cache and your Ollama models directory.
  7. FileVault encryption drops cold-start performance. FileVault is fast, but loading a 70 GB model from cold cache on a FileVault volume is measurably slower than from an unencrypted external drive. For serious work, an external NVMe Thunderbolt drive without encryption is a real speedup.

Production-grade patterns on macOS

When a Mac Studio is your local-AI server (a real pattern in 2026):

  • Run Ollama as a launchd service; the official installer ships this configuration.
  • Set OLLAMA_HOST=0.0.0.0:11434 via launchd environment so the server is reachable from your Tailscale or local network.
  • Use Tailscale or ZeroTier; never expose the Ollama port directly to the internet.
  • Caffeinate the machine (System Settings → Energy Saver → never sleep when on AC).
  • Pin the macOS version; major macOS releases occasionally break Metal compute paths. Verify in a non-critical window before upgrading.
  • Store model weights on an external NVMe Thunderbolt drive — on current Mac Studios the internal SSD is non-replaceable, so burn-in cycles on the model cache are best directed at a replaceable drive.

Going deeper

Next step on macOS

Lowest-friction starting point on Apple Silicon.