degradesEditorialReviewed May 2026

MPS fallback to CPU — get Apple Silicon GPU usage back

PyTorch on Apple Silicon silently falls back to CPU when an op isn't supported by MPS. Set PYTORCH_ENABLE_MPS_FALLBACK=1 to make it audible, then fix the actual op (cast dtype, disable flash-attention, lower batch).

PyTorchApple Silicon (M1-M4)MetalComfyUI on MacHugging Face Transformers

By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

Operation not supported by MPS backend (most common)

Diagnose

Performance is 5-20x slower than expected. Set `PYTORCH_ENABLE_MPS_FALLBACK=1` and re-run — you'll see warnings like 'The operator aten::xyz is not currently implemented for MPS, falling back to CPU.'

Fix

Either accept the fallback (the env var allows it) or restructure your code to avoid the unsupported op. Common offenders: complex64 ops, certain reduction ops, custom CUDA kernels.

PyTorch version too old for current macOS

Diagnose

`python -c 'import torch; print(torch.__version__)'` shows < 2.5. Newer macOS (14+, 15) needs PyTorch 2.5+ for full MPS coverage.

Fix

Upgrade: `pip install --upgrade torch torchvision`. Newer PyTorch ships better MPS op coverage every release.

Tensor dtype unsupported on MPS

Diagnose

Error mentions complex64, bfloat16 in some ops, or specific quantization types. MPS supports float32, float16, int*, but coverage is uneven.

Fix

Cast tensors to a supported dtype before the failing op: `tensor.to(torch.float32)`. For models that demand bf16, fall back to CPU for that specific op via PYTORCH_ENABLE_MPS_FALLBACK=1.

Model architecture using flash-attention (NVIDIA-only)

Diagnose

Error mentions FlashAttention or 'no kernel image' on MPS. The model's config tries to load flash-attn-2 which doesn't exist on Apple Silicon.

Fix

In Transformers: pass `attn_implementation='eager'` or `attn_implementation='sdpa'` when loading. SDPA on MPS is reasonably fast; eager is the safe fallback.

Memory pressure forcing macOS to evict

Diagnose

`top` or Activity Monitor shows yellow/red memory pressure. Sudden 10x slowdown mid-inference — macOS is paging tensors to disk.

Fix

Close memory-heavy apps (browsers with hardware accel, Photoshop, Final Cut). Lower batch size. For 70B+ workloads, more unified memory is the only real answer.

Best Mac for local AI →

If this keeps happening — the next decision is hardware

MPS fallback warnings are usually a code-path issue, but if they happen on every workload your Mac is undersized. The guide below frames the upgrade path.

best budget Mac for local AI

Frequently asked questions

Is MPS as fast as CUDA?

No, but closer than people assume on inference. M4 Max ~546 GB/s vs RTX 4090 ~1008 GB/s — bandwidth gap is real. For LLM decode, M4 Max hits ~70-90% of 4090 throughput. For prefill (compute-bound), 30-50% slower. For training with PyTorch, MPS lags more.

Should I use MLX instead of PyTorch on Mac?

For Apple-native fine-tuning + research, yes — MLX is faster and built for Apple Silicon. For PyTorch-ecosystem code (Hugging Face Transformers, diffusers, anything that imports torch), stick with PyTorch + MPS. The translation cost isn't worth it for inference-only workflows.

Why does my Mac feel slow during inference?

Three usual causes: (1) Memory pressure forcing OS to swap (close apps), (2) MPS fallback to CPU on unsupported ops (set the env var to surface them), (3) thermal throttling on MacBooks under sustained load (use a cooling pad or run on power, not battery).

Related troubleshooting

llama.cpp Metal: GGML_ASSERT / mtl_buffer crash on macOS

Most Metal crashes in llama.cpp on Apple Silicon trace to too-aggressive context size, an old GGUF format, or a model whose tensor shape Metal can't kernel. Diagnostic + fix order.

Model keeps crashing / segfault during inference

Mid-inference crashes (segfault, illegal memory access, kernel panic) usually mean VRAM ECC, thermal throttling, PSU instability, or a bad model file. Here's the diagnostic order.

torch.cuda.is_available() returns False

PyTorch falsely reporting no CUDA is the most common Python ML setup failure. The cause is almost always: wrong PyTorch wheel for your CUDA version, or a CPU-only build accidentally installed.

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time:

Where next?

All troubleshooting guides

OrBest GPU for local AI Will it run on my hardware?