Metal allocation failed — Apple Silicon OOM under unified memory pressure
Cause
Apple Silicon shares one memory pool between CPU + GPU + Neural Engine. Under pressure (large model + browser + system), the GPU allocator fails even though some RAM is free elsewhere. Causes:
- Model size + KV cache exceed practical Mac VRAM ceiling (~70% of physical RAM)
- macOS swapping to disk under pressure (kills throughput)
- Other processes (Chrome, Xcode indexing, Time Machine) eating unified memory
- MLX or llama.cpp Metal not setting GPU memory limit explicitly
Solution
1. Check pressure with vm_stat or Activity Monitor's Memory Pressure graph. If you're in yellow/red, free RAM first.
2. Drop the model size or quant:
# 70B Q4 needs ~45 GB unified memory; not viable on 32 GB Macs
# Drop to 32B Q4 (~20 GB) or 14B Q4 (~9 GB)
3. Close memory-heavy apps. Chrome easily eats 4-8 GB. Xcode indexing 2-4 GB. Time Machine + Spotlight indexing 1-2 GB. Disable Spotlight indexing on /Volumes/ if relevant.
4. Set GPU memory limit explicitly in MLX:
import mlx.core as mx
mx.metal.set_memory_limit(0.85) # 85% of physical RAM ceiling
5. Use caffeinate for long jobs so macOS doesn't sleep mid-inference:
caffeinate -i mlx_lm.generate --model ... --prompt ...
6. Bigger picture: Apple Silicon is bandwidth-bound. For sustained workloads, Mac Studio M3 Ultra 192GB or M4 Ultra 256GB is the production-grade target. See Mac Studio M3 Ultra combo.
Related errors
Did this fix it?
If your case was different, email support@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.