Metal / Apple Silicon

Metal allocation failed — Apple Silicon OOM under unified memory pressure

Q: How do you fix "Metal allocation failed — Apple Silicon OOM under unified memory pressure"?

**1. Check pressure with vm_stat or Activity Monitor's Memory Pressure graph.** If you're in yellow/red, free RAM first. **2. Drop the model size or quant:** ```bash # 70B Q4 needs ~45 GB unified memory; not viable on 32 GB Macs # Drop to 32B Q4 (~20 GB) or 14B Q4 (~9 GB) ``` **3. Close memory-heavy apps.** Chrome easily eats 4-8 GB. Xcode indexing 2-4 GB. Time Machine + Spotlight indexing 1-2 GB. Disable Spotlight indexing on /Volumes/ if relevant. **4. Set GPU memory limit explicitly in MLX:** ```python import mlx.core as mx mx.metal.set_memory_limit(0.85) # 85% of physical RAM ceiling ``` **5. Use `caffeinate` for long jobs** so macOS doesn't sleep mid-inference: ```bash caffeinate -i mlx_lm.generate --model ... --prompt ... ``` **6. Bigger picture:** Apple Silicon is bandwidth-bound. For sustained workloads, Mac Studio M3 Ultra 192GB or M4 Ultra 256GB is the production-grade target. See [Mac Studio M3 Ultra combo](/hardware-combos/mac-studio-m3-ultra-192gb).

metal::MetalCommandQueue allocation failed or [MPS] OOM

By Fredoline Eruo · Last verified May 7, 2026

Cause

Apple Silicon shares one memory pool between CPU + GPU + Neural Engine. Under pressure (large model + browser + system), the GPU allocator fails even though some RAM is free elsewhere. Causes:

Model size + KV cache exceed practical Mac VRAM ceiling (~70% of physical RAM)
macOS swapping to disk under pressure (kills throughput)
Other processes (Chrome, Xcode indexing, Time Machine) eating unified memory
MLX or llama.cpp Metal not setting GPU memory limit explicitly

Solution

1. Check pressure with vm_stat or Activity Monitor's Memory Pressure graph. If you're in yellow/red, free RAM first.

2. Drop the model size or quant:

# 70B Q4 needs ~45 GB unified memory; not viable on 32 GB Macs
# Drop to 32B Q4 (~20 GB) or 14B Q4 (~9 GB)

3. Close memory-heavy apps. Chrome easily eats 4-8 GB. Xcode indexing 2-4 GB. Time Machine + Spotlight indexing 1-2 GB. Disable Spotlight indexing on /Volumes/ if relevant.

4. Set GPU memory limit explicitly in MLX:

import mlx.core as mx
mx.metal.set_memory_limit(0.85)  # 85% of physical RAM ceiling

5. Use caffeinate for long jobs so macOS doesn't sleep mid-inference:

caffeinate -i mlx_lm.generate --model ... --prompt ...

6. Bigger picture: Apple Silicon is bandwidth-bound. For sustained workloads, Mac Studio M3 Ultra 192GB or M4 Ultra 256GB is the production-grade target. See Mac Studio M3 Ultra combo.

Related errors

Did this fix it?

If your case was different, email support@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.