fatalEditorialReviewed May 2026

MLX out of memory — manage Apple Silicon unified memory

MLX OOM on Apple Silicon traces to wrong-size model for unified memory, missing wired-memory limit, or memory pressure from other apps. macOS reserves 25-30% for system; the rest is your AI budget.

MLXMLX-LMApple Silicon (M1-M4)macOS

By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

Model size exceeds usable unified memory budget

Diagnose

Activity Monitor shows memory pressure red as model loads. macOS reserves ~25-30% for system + apps. On a 64 GB Mac, ~45 GB is usable for AI.

Fix

Smaller model or smaller quant. 70B Q4 (~40 GB) fits 64 GB Mac with care; 70B Q8 (~75 GB) doesn't fit until 96+ GB. Use MLX 4-bit quant for tightest VRAM use.

Other GPU-using apps holding unified memory

Diagnose

Browsers with hardware acceleration, Photoshop, Final Cut, even some Electron apps reserve significant unified memory. Check Activity Monitor → Memory tab.

Fix

Close memory-heavy apps before model load. For headless servers, ensure no idle GPU consumers. Disable hardware acceleration in browsers if unused.

Wired memory limit too low (sysctl iogpu.wired_limit_mb)

Diagnose

Default macOS wired-memory limit is conservative. MLX needs to wire significant memory for model weights. On large models, default limit is too low.

Fix

Bump the limit: `sudo sysctl iogpu.wired_limit_mb=49152` (48 GB on a 64 GB Mac). Persists until reboot. Add to /etc/sysctl.conf for permanent.

Loading FP16 instead of quantized weights

Diagnose

Model loaded with `mlx_lm.load(model_path)` defaults to whatever's in the repo. Some MLX repos ship FP16; quantize on load.

Fix

Use `mlx_lm.load(model_path, lazy=True)` and `mlx_lm.utils.quantize(...)` to convert on the fly. Or download a pre-quantized MLX repo (search HuggingFace for 'mlx-community/<model>-4bit').

Mac is genuinely too small for the workload

Diagnose

16 GB unified Macs cap at 7B Q4 inference. 24 GB caps at 13B. 32 GB at 32B. 64 GB at 70B Q4. Trying to run beyond your tier.

Fix

Smaller model. Or upgrade. For 70B+ workloads, M-series Mac Studio with 96+ GB is the path.

Best Mac for local AI →

macOS aggressively swapping to disk under memory pressure, tanking inference speed

Diagnose

MLX loads the model but generation slows to 1-3 tok/s after the first few tokens. Activity Monitor → Memory tab shows high swap usage and yellow/red memory pressure. macOS is paging unified memory to the internal SSD — a 70B Q4 model squeezes into 64 GB but macOS reserves ~20 GB for system, creating swap pressure.

Fix

Close all non-AI apps before inference. Disable hardware acceleration in browsers (Chrome: Settings → System → 'Use graphics acceleration when available' OFF). If production workloads require 70B+ inference on Apple Silicon, you need a machine with 96+ GB unified memory. For the 64 GB Mac Studio/MacBook Pro: cap models at 32B Q5 for comfortable headroom, or 70B Q4 at short context (2K-4K).

iogpu.wired_limit_mb default too conservative, creating artificial OOM

Diagnose

macOS has a default wired-memory ceiling that's well below total unified memory. MLX needs to wire (pin in physical memory) model weights for GPU access, but macOS refuses to allocate more than the default ceiling — usually ~60-70% of unified memory. The OOM error fires before the Mac is actually full.

Fix

Bump the wired limit: `sudo sysctl iogpu.wired_limit_mb=49152` for a 64 GB Mac (sets ceiling to 48 GB). This persists until reboot. For a permanent fix, add `iogpu.wired_limit_mb=49152` to `/etc/sysctl.conf`. Adjust the value to ~75% of your unified memory: 24 GB Mac → 16 GB (16384 MB), 32 GB → 24 GB (24576), 64 GB → 48 GB (49152), 96 GB → 72 GB (73728).

Best Mac for local AI →

If this keeps happening — the next decision is hardware

MLX OOM on a Mac means the unified-memory tier was undersized. The guides below frame the Mac buyer decision for the right next tier.

Frequently asked questions

How much usable memory does macOS leave for MLX?

Roughly 70-75% of unified memory at sustained load. On 64 GB: ~45 GB usable. On 32 GB: ~22 GB. On 16 GB: ~11 GB. macOS doesn't expose a hard line; it manages memory pressure dynamically.

Should I use MLX or llama.cpp on Apple Silicon?

MLX for Apple-native fine-tuning + research workflows. llama.cpp for general inference + broader model coverage (every GGUF on HuggingFace just works). MLX is faster on Apple-trained operations; llama.cpp is more portable.

Why did my Mac freeze instead of erroring cleanly?

macOS handles memory pressure by aggressively swapping + sometimes killing apps. If the AI process consumes too much wired memory, the OS becomes unresponsive before the process gets a clean OOM error. Bump iogpu.wired_limit_mb proactively.

How do I check my current wired memory limit?

`sysctl iogpu.wired_limit_mb` prints the current ceiling in MB. If the output shows a very high number (999999 or similar), the limit is effectively disabled. If it shows a number well below your expected budget (~60-70% of total unified memory), you're being artificially capped. Bump it as described.

Does Apple Silicon handle swap differently than a dedicated GPU?

Yes — and it's both a blessing and a curse. Unified memory means the GPU and CPU share the same physical RAM, so there's no explicit 'transfer to GPU' step. But when memory pressure hits, macOS starts swapping the entire unified memory pool to disk. This is catastrophic for inference because model weights + KV cache get paged together. A dedicated GPU with its own VRAM can page layers via UVM (PCIe to system RAM), which is slow but at least the OS isn't also paging the display compositor and your browser tabs through the same bottleneck.

What can I safely close to free unified memory before running MLX?

Biggest consumers on macOS: Chrome/Safari with many tabs (2-4 GB+ with hardware acceleration), Xcode + iOS Simulator (3-6 GB), Docker Desktop (2-4 GB even idle), Final Cut Pro / Logic Pro, Adobe apps, and Slack/Discord/VS Code (1-2 GB each). Restart the machine for a clean slate — macOS accumulates wired memory from long-running processes that `purge` can't always reclaim. A fresh boot on a 64 GB machine typically leaves 50-55 GB free.

Related troubleshooting

PyTorch MPS falling back to CPU on Apple Silicon

PyTorch on Apple Silicon silently falls back to CPU when an op isn't supported by MPS. Set PYTORCH_ENABLE_MPS_FALLBACK=1 to make it audible, then fix the actual op (cast dtype, disable flash-attention, lower batch).

llama.cpp Metal: GGML_ASSERT / mtl_buffer crash on macOS

Most Metal crashes in llama.cpp on Apple Silicon trace to too-aggressive context size, an old GGUF format, or a model whose tensor shape Metal can't kernel. Diagnostic + fix order.

Model keeps crashing / segfault during inference

Mid-inference crashes (segfault, illegal memory access, kernel panic) usually mean VRAM ECC, thermal throttling, PSU instability, or a bad model file. Here's the diagnostic order.

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time:

Where next?

All troubleshooting guides

OrBest GPU for local AI Will it run on my hardware?