Capability notes
On-device iPhone AI in 2026 splits into two lanes: **Apple Intelligence** (system-level, first-party feature set) and **third-party local inference** (MLX Swift, llama.cpp iOS, Private LLM apps).
**Apple Intelligence** runs on the Neural Engine + GPU of [Apple A17 Pro](/hardware/apple-a17-pro) and [Apple A18 Pro](/hardware/apple-a18-pro) chips. As of iOS 19 (mid-2026), Apple ships on-device models for notification summarization (~3B parameter class, distilled), writing tools, image generation (Image Playground + Genmoji), and the improved Siri intent system. These models are baked into the OS, updated with system updates, and not user-replaceable. Apple Intelligence requires a device with 8GB+ RAM — iPhone 15 Pro, 15 Pro Max, iPhone 16, 16 Plus, 16 Pro, 16 Pro Max, and all iPhone 17 series.
**Third-party local inference** is where operators get real leverage. [MLX Swift](https://github.com/ml-explore/mlx-swift) — Apple's Metal-accelerated ML framework — enables iOS apps to run transformer models at competitive token rates. [llama.cpp](/tools/llama-cpp) provides iOS bindings. The model size ceiling on an 8GB iPhone is 4-5GB usable for weights after OS overhead: Llama-3.1-8B at Q4 (4.5GB, 15-18 tok/s on A18 Pro, 12-15 on A17 Pro) or Qwen-3-8B at Q4. 1-3B models run at 30-50 tok/s. 7B at Q2 fits but degrades visibly. 13B-class models do not fit at usable quantizations on 8GB devices. On iPhone 16 Pro (A18 Pro, 8GB), 3B Q4 reaches 35-45 tok/s at 4K context.
The privacy advantage is architecturally enforced for local-only apps: zero network calls during inference, no telemetry, all state on-device. For regulated industries (healthcare, legal, defense), this is the primary argument for on-device inference over cloud APIs. What does not work: sustained multi-turn agentic loops (thermal throttle within 8-12 minutes), models above ~10B, concurrent model serving, on-device fine-tuning (LoRA on 3B requires >8GB), and full-resolution vision-language models.
If you just want to try this
Lowest-friction path to a working setup.
Install [LM Studio iOS](https://apps.apple.com/us/app/lm-studio/id…) from the App Store on an iPhone 15 Pro or newer. The app handles model download, quantization selection, and Metal-accelerated inference without a terminal.
Step 1: Open the App Store, search "LM Studio," and install. The app is free.
Step 2: Browse the in-app model catalog. For a responsive chat experience, select **Qwen-3-8B (Q4_K_M)** at 4.5GB — this fits on 8GB devices and runs at 15-18 tok/s on A17 Pro, 18-22 tok/s on A18 Pro. Q4_K_M is the practical quality floor for coherent multi-turn conversation.
Step 3: Download the model over Wi-Fi (4.5GB). Cellular download works but carriers may throttle beyond 5GB.
Step 4: Switch to Airplane Mode to verify offline operation. Tap "Load Model," wait 8-12 seconds for Metal shader compilation on first load, then start chatting.
Alternate path: **[Private LLM](/tools/llama-cpp)** from the App Store. Supports [llama.cpp](/tools/llama-cpp) MLX backend, equally offline-capable. Both apps are functionally equivalent for single-model chat.
What you get: fully offline AI chat on your phone. No API keys, no cloud, no server logs. Single-turn Q&A and writing assistance feel indistinguishable from cloud chatbots at 15-25 tok/s. Multi-turn reasoning with tool use and context beyond 4K are where the gap vs server models opens.
Skip if: you need 70B-class models, concurrent multi-user serving, or your phone has <8GB RAM (pre-iPhone 15 Pro devices lack the Neural Engine capabilities and memory floor).
For production deployment
Operator-grade recommendation.
On-device iOS AI strategy operates within three hard constraints: App Store policy, device memory ceiling, and iOS background-task behavior.
**Model shipping vs download.** Shipping a [GGUF](/tools/llama-cpp) or MLX model inside the IPA guarantees offline-first availability. The cost: a 4GB Q4 model pushes the IPA past Apple's 200MB cellular download limit, triggering a mandatory Wi-Fi prompt. On-Demand Resources exemption requires App Review justification. The safer path: ship a 100MB placeholder model plus in-app download flow. [Ollama's iOS app](/tools/ollama) uses this pattern.
**App Store policy.** Apple permits local ML inference under the standard developer agreement. Guideline 2.5.2 (Developer Q&A 2025) clarifies that GGUF and MLX weight files are data, not executable code, and are permitted as in-app downloadable content. BGAppRefreshTask gives ~30 seconds of background execution — insufficient for model loading plus inference. For background inference, a VoIP push or processing entitlement is required, both constrained by App Review.
**Update cadence.** Model updates inside the app bundle require App Review (24-72 hours). In-app downloaded models update server-side instantly. For managed fleets, use MDM-enforced model versioning via Managed App Configuration — pin model versions across devices without App Review. Apple's MDM framework (com.apple.configuration.managed) supports key-value configuration that your app reads at launch.
**MDM deployment.** Via Jamf, Kandji, or Microsoft Intune, deploy your private LLM app. Enforce per-app VPN for model download from internal infrastructure. Set a Managed App Configuration key to pin the approved model version. The app downloads the pinned model, verifies its checksum, and enters offline-only mode. Audit trails exist in MDM logs (deployment confirmation, config push receipt). No inference data leaves the device.
**When on-device wins.** Field use cases with intermittent/no connectivity (field inspections, secure facilities, aviation, maritime). Regulated environments where cloud processing is banned (HIPAA, attorney-client privilege, classified settings). Latency-critical UI interactions under 200ms round-trip.
**When on-device loses.** Workloads requiring >10B parameter models. Multi-user concurrent serving. Real-time vector database indexing. Training or fine-tuning. Full-resolution vision-language tasks.
What breaks
Failure modes operators see in the wild.
- **Battery drain on sustained inference.** A 3B Q4 model consumes 15-20% battery per hour; 7B Q4 pushes to 22-28%. The GPU is always engaged for attention layers — the Neural Engine alone cannot handle the full transformer stack. Symptom: phone drops from 80% to 30% in under two hours. Mitigation: batch inference into closed-loop bursts. Design UX around single-round completion. Monitor thermal state via ProcessInfo().thermalState and throttle when .serious or .critical.
- **Thermal throttling.** iPhones lack active cooling. After 8-12 minutes of continuous 7B inference, the A18 Pro reaches ~95°C junction temperature and throttles GPU from 1.4GHz to ~900MHz. Token rate drops 35-50%. External glass reaches 44-48°C. Symptom: tok/s drops from 18 to 9 across a long conversation. Mitigation: cap context at 4K. Defer heavy reasoning to a server when connectivity permits.
- **Model size ceiling on 8GB devices.** After iOS reserves ~2.5GB, 5.5GB remains for your app. A 7B Q4 at 4K uses ~4.5GB for weights + 0.8-1.5GB for KV cache. At 8K context, KV cache exceeds available memory. Symptom: Metal buffer allocation error or <1 tok/s. Mitigation: use 3B-4B models for 8K+ context. 7B at 8K requires 12GB+ RAM — no iPhone ships with this.
- **iOS background task restrictions.** On app switch, iOS suspends after ~5 seconds. BGAppRefreshTask gives ~30 seconds. Symptom: user switches apps mid-inference, returns to find model unloaded and conversation reset. Mitigation: serialize model state and KV cache to disk on background event. Display a resume spinner on foreground return.
- **App Store policy risk.** Apple permits open-weight model inference but restricts NSFW-capable models and models trained on copyrighted data in specific domains (music generation, voice cloning). Symptom: App Review rejection citing content policies when your app loads uncensored models. Mitigation: ship refusal guardrails in the app layer, document model provenance and training data compliance in App Review notes.
- **First-launch Metal shader compilation.** First model load after install compiles GPU shaders: 8-15 seconds on A17 Pro, 6-10 seconds on A18 Pro. Subsequent loads use cached shaders (<1 second). Symptom: 12-second unresponsive spinner on first use. Mitigation: pre-warm shader compilation on app first-launch in the background before the user enters chat. Display a progress indicator during compilation.
Hardware guidance
**Hobbyist: iPhone 15 Pro (8GB, A17 Pro)**
Entry point. Runs 1B-3B at 25-35 tok/s, 7B Q4 at 12-15 tok/s with 2K context. Sufficient for personal offline chat, writing assistance, basic summarization. One device, no fleet.
**Hobbyist: iPhone 16 (8GB, A18)**
Same 8GB ceiling, 15-20% faster token rates due to A18 architectural improvements. 3B Q4 at 35-45 tok/s, 7B Q4 at 15-18 tok/s. Better sustained thermals than A17 Pro (~10 min longer before throttle). The non-Pro value pick.
**SMB: iPhone 16 Pro fleet (8GB, A18 Pro)**
Practical ceiling for single-user on-device AI. 3B Q4 at 40-50 tok/s, 7B Q4 at 18-22 tok/s, 8K context on 3B. Deploy via MDM with Managed App Configuration for model version pinning. Ten iPhone 16 Pro devices running private LLM apps = field-team AI access with zero server infrastructure and zero data egress. Cost: ~$999/device + MDM licensing ($4-8/device/month via Jamf Pro or Kandji). Flat per-device cost model, no inference-per-query charges.
**Enterprise: iPhone fleet + private app + MDM**
Build a private iOS app wrapping [MLX Swift](https://github.com/ml-explore/mlx-swift) with your fine-tuned model. Distribute via Apple Business Manager with MDM-enforced configuration. All inference data stays on-device; audit trail via MDM logs. For 100+ device fleets, per-device savings vs cloud API at $0.01-0.03/query break even in 3-6 months. Plan 1-2 iOS engineers full-time for a production-quality private LLM app.
**Frontier: Not applicable**
iOS devices cannot serve multi-user concurrent inference, cannot train or fine-tune, and are architecturally unsuited for >10B parameter models. The iPhone is the inference edge node — it handles user-facing inference; heavy lifting runs on [NVIDIA L40S](/hardware/nvidia-l40s), [RTX 5090](/hardware/rtx-5090), or cloud infrastructure. The A19 Pro (iPhone 17 Pro, late 2026) is rumored at 12GB RAM, which would unlock 8B-10B Q4 at 8K context — a meaningful capability jump. Until then, 8GB is the ceiling.
Runtime guidance
Decision tree for iOS on-device inference:
**Swift-native app targeting Apple Intelligence integration → MLX Swift**
[MLX Swift](https://github.com/ml-explore/mlx-swift) provides Metal-accelerated inference with unified memory management across CPU, GPU, and Neural Engine. Best token rate on Apple Silicon, lowest battery drain per token, native Swift API. Supports MLX model format — convert GGUF via mlx-lm convert. Model coverage: Mistral, Llama, Phi, Qwen, Gemma architectures well-supported. Correct choice for consumer iOS apps where UX polish and battery life are priorities.
**Cross-platform app (iOS + Android) or broadest model support → llama.cpp iOS**
[llama.cpp](/tools/llama-cpp) iOS bindings cover 40+ model architectures via Metal acceleration. Token rate is 10-15% lower than MLX Swift due to manual buffer management vs Metal's unified allocator. Battery drain is 5-10% higher per hour. Benefit: GGUF format is the de facto standard — thousands of pre-quantized variants on Hugging Face reduce conversion overhead. Use when model availability or cross-platform code sharing matter more than the last 10% of performance.
**Apple Intelligence-aware (notification summaries, writing tools, Siri) → CoreML + Apple Intelligence APIs**
Apple Intelligence runs on [CoreML](https://developer.apple.com/documentation/coreml) models deployed with iOS. Not user-replaceable, not open-weight. Interact through App Intents and Writing Tools API. Zero inference management overhead. The model is a black box — you accept Apple's quality and update cadence. Use for enhancing iOS workflows without shipping your own model.
**Comparison:**
- Token rate (7B Q4, [A18 Pro](/hardware/apple-a18-pro)): MLX Swift 20-22 tok/s, llama.cpp iOS 17-19 tok/s, CoreML N/A
- Model format: MLX (.safetensors converted), GGUF, CoreML .mlpackage
- Battery efficiency: MLX Swift best, llama.cpp iOS good, CoreML excellent
- Model coverage: MLX Swift (Mistral/Llama/Phi/Qwen/Gemma), llama.cpp iOS (40+ architectures), CoreML (Apple-curated)
- First-launch compile: MLX Swift 6-12s, llama.cpp iOS 8-15s, CoreML pre-compiled
**Hybrid approach:** MLX Swift for primary chat surface (best UX + battery), llama.cpp as fallback for unsupported architectures, CoreML for system-level integrations (Siri shortcuts, App Intents).