How to run local AI on iPhone (May 2026) — what actually works
iPhone 15 Pro / 16 Pro local-LLM operator's guide. Apple Intelligence vs MLC Chat vs Apollo for LLM vs MLX Swift apps. Realistic 3B/7B sizing, ANE vs CPU vs GPU, battery and thermal limits, and an honest list of what doesn't work yet.
The honest floor: what an iPhone is in May 2026
An iPhone 15 Pro or 16 Pro is, hardware-wise, a small Apple Silicon machine: 8 GB unified memory, a 38 TOPS Neural Engine on the A18 Pro (35 TOPS on the A17 Pro), a 5- or 6-core GPU, and a memory bandwidth of roughly 60 GB/s. That is enough to run a 3B-class LLM at INT4 in single-stream chat at usable speed, and a 7B-class LLM at Q4 with very tight context windows. It is not enough for frontier reasoning, long-context agents, or anything that requires sustained load for more than a handful of minutes — the device thermally throttles long before it runs out of memory.
Older iPhones (iPhone 14 Pro and below, every non-Pro model from iPhone 14 down) cap at 6 GB RAM. They run 1B-3B models at Q4 in short sessions, but the practical ceiling drops fast. If your phone isn't a 15 Pro / 15 Pro Max / 16 / 16 Pro / 16 Pro Max, assume the 1-3B tier and adjust expectations.
Apple Intelligence vs third-party local LLMs
The iOS 18+ Apple Intelligence stack is a hybrid system: on-device 3B foundation model for short, simple requests; Private Cloud Compute servers for harder ones. From the operator's point of view that means Apple Intelligence is not what r/LocalLLaMA means by “local”: you cannot pin it to on-device only, you cannot bring your own weights, and you cannot inspect what does or does not get sent to Apple. It is useful for in-OS rewriting, summarization in Mail, and Siri quality, but it is not the thing this guide is about.
The third-party path — apps and frameworks that load open-weight models onto the device and run inference fully locally — is what this page covers. Five families in production today:
- MLC Chat (App Store) — community runtime built on MLC LLM. Ships pre-quantized 1-7B models. The closest thing iOS has to Ollama: download a model in-app, chat with it, no Mac required. See MLC LLM operational review.
- Apollo for LLM (App Store) — polished consumer app that wraps llama.cpp + GGUF models. Easier UX than MLC Chat, slightly narrower model selection. Both are fine starting points.
- MLX Swift apps you build yourself — Apple's first-party path. Same model checkpoints as desktop MLX-LM. Requires Xcode and Apple Developer membership; the right path if you're shipping an app, not just trying inference. See the iPhone on-device AI stack for the production-grade build recipe.
- llama.cpp via Termux on iSH / a-Shell — works, slow, mostly a curiosity. Useful only if you want to tinker with llama.cpp builds on iOS without Xcode.
- Web-based runtimes (WebLLM in Safari) — Safari on iOS 18 supports WebGPU. WebLLM-style demos run in the browser. Slower than native MLC Chat, but no install required.
Realistic model sizes by phone tier
Based on the 8 GB RAM ceiling on Pro models and ~6 GB on standard models, the operator-grade sizing table (single-stream, short context):
- iPhone 16 Pro / 16 Pro Max (A18 Pro, 8 GB): Llama 3.2 3B at INT4 (~1.9 GB on disk) is comfortable. Phi-3.5 Mini at Q4 (~2.3 GB) fits with smaller context. 7B at Q4 (~4 GB weights) technically loads but leaves almost nothing for the OS — expect crashes if anything else is running.
- iPhone 15 Pro / 15 Pro Max (A17 Pro, 8 GB): same envelope. Sustained throughput is a hair lower than A18 Pro because ANE bandwidth and clock are both incrementally lower; the practical difference is <10%.
- iPhone 16 / 16 Plus (A18, 8 GB): same RAM, smaller GPU, no Pro-tier sustained-thermal headroom. Treat it as 3B-class.
- iPhone 14 Pro / 15 (A16/A15, 6 GB): 1B-3B class. Llama 3.2 1B or SmolLM 2 1.7B at Q4 are the realistic targets.
Numbers above are sizing, not throughput. Decode tok/s varies considerably by runtime, quant format, and thermal state. We deliberately do not publish single-figure tok/s estimates here; see /benchmarks/mobile-edge for measured values where they exist and the gap report for what we're still missing. If you want a measurement on a specific iPhone tier, file it at /benchmarks/request.
ANE vs CPU vs GPU on iPhone — what actually runs where
The three compute paths on an iPhone, with the honest 2026 status:
- Neural Engine (ANE): 38 TOPS on A18 Pro. Apple uses it for Apple Intelligence and for Core ML operators. For third-party LLMs, ANE access is not directly programmable the way CUDA is — your model has to ship as a Core ML package, and then Core ML decides which ops run on ANE vs GPU vs CPU. The honest read: ANE accelerates certain transformer ops well, but full-model ANE residency for decoder-only LLMs is rare today. Most third-party iOS LLMs run on GPU + CPU and treat ANE as a bonus.
- GPU (Metal): where MLX Swift, MLC LLM, and llama.cpp's Metal backend actually run. This is the workhorse path on iPhone in 2026.
- CPU: fallback. llama.cpp will use the CPU if Metal isn't set up; throughput drops by 3-5×.
If a vendor or app claims “runs on the Neural Engine”, assume that means “Core ML schedules some ops on ANE”, not “the whole transformer lives on ANE”. The latter is rare and usually requires Apple-published model variants.
Battery cost and thermal throttling
The two limits you'll hit before you hit memory:
- Battery: editorial estimate, 5-10% drained per 10-minute active chat session on iPhone 16 Pro running a 3B INT4 model via MLC Chat. A 30-minute session is real-world possible; a 90-minute session is a phone-dies-by-lunch problem.
- Thermals: phones throttle aggressively. Sustained decode at full clock for 5-10 minutes is enough to trigger a 25-40% throughput drop on most cases. Apple does this silently — you won't see a warning, just slower tokens.
The deployment pattern that works in practice: short bursts (≤2 minutes of active inference) interleaved with cooling time. The deployment pattern that fails: agent loops with continuous tool calls for 30+ minutes.
What doesn't work on iPhone in 2026
The honest list of things people ask about that don't work today:
- Frontier-reasoning agent loops. Anything that runs continuous tool calls for 10+ minutes will throttle and tank quality.
- Long-context summarization (32k+ tokens). KV cache eats 6-8 GB before you hit the model weights. Practical ceiling on 8 GB iPhones is 4-8k tokens of context.
- Multi-modal (vision-language) at usable latency. Smaller VLMs technically run, but image preprocessing on top of decode crushes thermals fast.
- Sustained RAG over a large local corpus. Indexing and retrieval are fine; the throughput problem is decode under sustained load.
- Models above 7B at any quant. 8 GB RAM is the ceiling. 7B Q4 already cuts close to OS pressure.
- FP16 anything. Always Q4-Q5 GGUF, INT4 Core ML, or MLX 4-bit. FP16 for a 3B model would consume 6 GB just for weights.
Install paths in order of effort
- MLC Chat (5 minutes). App Store → MLC Chat → pick a model from the in-app catalog → wait for download → chat. This is the “will it run on my phone” smoke test.
- Apollo for LLM (5 minutes). Same shape as MLC Chat with a more polished UI and a different (smaller) model selection. Pick whichever you find first; both work.
- WebLLM in Safari (no install). Visit the WebLLM demo page in Safari iOS 18+. Browser caches the model on first run. Slower than native; useful for “is this idea worth a real install” testing.
- MLX Swift app (4-8 hours). Xcode + Apple Developer account + sample MLX Swift LLM app + your chosen quantized model. See the iPhone on-device AI stack for the full step-by-step.
Common failure modes
- OOM crash mid-session. iOS aggressively reclaims RAM from background apps. A 7B Q4 model that “fits” on cold-start can be killed if you switch apps. Stick to 3B-class on 8 GB phones for stability.
- Model won't download in MLC Chat. Cellular data caps + 1-3 GB model files. Switch to Wi-Fi.
- tok/s drops sharply 5 minutes in. Thermal throttle. Let the phone cool; a fan or cool surface helps.
- App Store rejection for an MLX-Swift LLM app. Apple has rejected apps with very large bundled models in the past; ship the model as a download-on-first-launch resource, not a 2 GB app binary.
- iOS upgrade breaks Core ML model. Quantized Core ML packages built for an older iOS occasionally fail to load on new iOS releases. Re-export from your Mac MLX environment.
Going deeper
- iPhone on-device AI stack — the production-grade MLX Swift build recipe.
- Run local AI on Android — the cross-platform comparison.
- Best mobile AI runtimes — MLC LLM vs ExecuTorch vs Qualcomm AI Hub vs llama.cpp.
- Can phones run local LLMs? — the honest yes/no/depends.
- Mobile / edge benchmark gap report — what we measure, what we don't.
- Will it run? — pick a model + your phone tier, get a verdict.
Next step for iPhone operators
5-minute smoke test on your phone before investing in the MLX Swift stack.