RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Tasks/Mobile & Edge/iPhone AI
Mobile & Edge
ios ai
apple intelligence

iPhone AI

On-device AI on iPhone. Apple Intelligence (A18 Pro+), MLC-LLM iOS apps, third-party MLX-on-iOS deployment.

Capability notes

On-device iPhone AI in 2026 splits into two lanes: **Apple Intelligence** (system-level, first-party feature set) and **third-party local inference** (MLX Swift, llama.cpp iOS, Private LLM apps). **Apple Intelligence** runs on the Neural Engine + GPU of [Apple A17 Pro](/hardware/apple-a17-pro) and [Apple A18 Pro](/hardware/apple-a18-pro) chips. As of iOS 19 (mid-2026), Apple ships on-device models for notification summarization (~3B parameter class, distilled), writing tools, image generation (Image Playground + Genmoji), and the improved Siri intent system. These models are baked into the OS, updated with system updates, and not user-replaceable. Apple Intelligence requires a device with 8GB+ RAM — iPhone 15 Pro, 15 Pro Max, iPhone 16, 16 Plus, 16 Pro, 16 Pro Max, and all iPhone 17 series. **Third-party local inference** is where operators get real leverage. [MLX Swift](https://github.com/ml-explore/mlx-swift) — Apple's Metal-accelerated ML framework — enables iOS apps to run transformer models at competitive token rates. [llama.cpp](/tools/llama-cpp) provides iOS bindings. The model size ceiling on an 8GB iPhone is 4-5GB usable for weights after OS overhead: Llama-3.1-8B at Q4 (4.5GB, 15-18 tok/s on A18 Pro, 12-15 on A17 Pro) or Qwen-3-8B at Q4. 1-3B models run at 30-50 tok/s. 7B at Q2 fits but degrades visibly. 13B-class models do not fit at usable quantizations on 8GB devices. On iPhone 16 Pro (A18 Pro, 8GB), 3B Q4 reaches 35-45 tok/s at 4K context. The privacy advantage is architecturally enforced for local-only apps: zero network calls during inference, no telemetry, all state on-device. For regulated industries (healthcare, legal, defense), this is the primary argument for on-device inference over cloud APIs. What does not work: sustained multi-turn agentic loops (thermal throttle within 8-12 minutes), models above ~10B, concurrent model serving, on-device fine-tuning (LoRA on 3B requires >8GB), and full-resolution vision-language models.

If you just want to try this

Lowest-friction path to a working setup.

Install [LM Studio iOS](https://apps.apple.com/us/app/lm-studio/id…) from the App Store on an iPhone 15 Pro or newer. The app handles model download, quantization selection, and Metal-accelerated inference without a terminal. Step 1: Open the App Store, search "LM Studio," and install. The app is free. Step 2: Browse the in-app model catalog. For a responsive chat experience, select **Qwen-3-8B (Q4_K_M)** at 4.5GB — this fits on 8GB devices and runs at 15-18 tok/s on A17 Pro, 18-22 tok/s on A18 Pro. Q4_K_M is the practical quality floor for coherent multi-turn conversation. Step 3: Download the model over Wi-Fi (4.5GB). Cellular download works but carriers may throttle beyond 5GB. Step 4: Switch to Airplane Mode to verify offline operation. Tap "Load Model," wait 8-12 seconds for Metal shader compilation on first load, then start chatting. Alternate path: **[Private LLM](/tools/llama-cpp)** from the App Store. Supports [llama.cpp](/tools/llama-cpp) MLX backend, equally offline-capable. Both apps are functionally equivalent for single-model chat. What you get: fully offline AI chat on your phone. No API keys, no cloud, no server logs. Single-turn Q&A and writing assistance feel indistinguishable from cloud chatbots at 15-25 tok/s. Multi-turn reasoning with tool use and context beyond 4K are where the gap vs server models opens. Skip if: you need 70B-class models, concurrent multi-user serving, or your phone has <8GB RAM (pre-iPhone 15 Pro devices lack the Neural Engine capabilities and memory floor).

For production deployment

Operator-grade recommendation.

On-device iOS AI strategy operates within three hard constraints: App Store policy, device memory ceiling, and iOS background-task behavior. **Model shipping vs download.** Shipping a [GGUF](/tools/llama-cpp) or MLX model inside the IPA guarantees offline-first availability. The cost: a 4GB Q4 model pushes the IPA past Apple's 200MB cellular download limit, triggering a mandatory Wi-Fi prompt. On-Demand Resources exemption requires App Review justification. The safer path: ship a 100MB placeholder model plus in-app download flow. [Ollama's iOS app](/tools/ollama) uses this pattern. **App Store policy.** Apple permits local ML inference under the standard developer agreement. Guideline 2.5.2 (Developer Q&A 2025) clarifies that GGUF and MLX weight files are data, not executable code, and are permitted as in-app downloadable content. BGAppRefreshTask gives ~30 seconds of background execution — insufficient for model loading plus inference. For background inference, a VoIP push or processing entitlement is required, both constrained by App Review. **Update cadence.** Model updates inside the app bundle require App Review (24-72 hours). In-app downloaded models update server-side instantly. For managed fleets, use MDM-enforced model versioning via Managed App Configuration — pin model versions across devices without App Review. Apple's MDM framework (com.apple.configuration.managed) supports key-value configuration that your app reads at launch. **MDM deployment.** Via Jamf, Kandji, or Microsoft Intune, deploy your private LLM app. Enforce per-app VPN for model download from internal infrastructure. Set a Managed App Configuration key to pin the approved model version. The app downloads the pinned model, verifies its checksum, and enters offline-only mode. Audit trails exist in MDM logs (deployment confirmation, config push receipt). No inference data leaves the device. **When on-device wins.** Field use cases with intermittent/no connectivity (field inspections, secure facilities, aviation, maritime). Regulated environments where cloud processing is banned (HIPAA, attorney-client privilege, classified settings). Latency-critical UI interactions under 200ms round-trip. **When on-device loses.** Workloads requiring >10B parameter models. Multi-user concurrent serving. Real-time vector database indexing. Training or fine-tuning. Full-resolution vision-language tasks.

What breaks

Failure modes operators see in the wild.

- **Battery drain on sustained inference.** A 3B Q4 model consumes 15-20% battery per hour; 7B Q4 pushes to 22-28%. The GPU is always engaged for attention layers — the Neural Engine alone cannot handle the full transformer stack. Symptom: phone drops from 80% to 30% in under two hours. Mitigation: batch inference into closed-loop bursts. Design UX around single-round completion. Monitor thermal state via ProcessInfo().thermalState and throttle when .serious or .critical. - **Thermal throttling.** iPhones lack active cooling. After 8-12 minutes of continuous 7B inference, the A18 Pro reaches ~95°C junction temperature and throttles GPU from 1.4GHz to ~900MHz. Token rate drops 35-50%. External glass reaches 44-48°C. Symptom: tok/s drops from 18 to 9 across a long conversation. Mitigation: cap context at 4K. Defer heavy reasoning to a server when connectivity permits. - **Model size ceiling on 8GB devices.** After iOS reserves ~2.5GB, 5.5GB remains for your app. A 7B Q4 at 4K uses ~4.5GB for weights + 0.8-1.5GB for KV cache. At 8K context, KV cache exceeds available memory. Symptom: Metal buffer allocation error or <1 tok/s. Mitigation: use 3B-4B models for 8K+ context. 7B at 8K requires 12GB+ RAM — no iPhone ships with this. - **iOS background task restrictions.** On app switch, iOS suspends after ~5 seconds. BGAppRefreshTask gives ~30 seconds. Symptom: user switches apps mid-inference, returns to find model unloaded and conversation reset. Mitigation: serialize model state and KV cache to disk on background event. Display a resume spinner on foreground return. - **App Store policy risk.** Apple permits open-weight model inference but restricts NSFW-capable models and models trained on copyrighted data in specific domains (music generation, voice cloning). Symptom: App Review rejection citing content policies when your app loads uncensored models. Mitigation: ship refusal guardrails in the app layer, document model provenance and training data compliance in App Review notes. - **First-launch Metal shader compilation.** First model load after install compiles GPU shaders: 8-15 seconds on A17 Pro, 6-10 seconds on A18 Pro. Subsequent loads use cached shaders (<1 second). Symptom: 12-second unresponsive spinner on first use. Mitigation: pre-warm shader compilation on app first-launch in the background before the user enters chat. Display a progress indicator during compilation.

Hardware guidance

**Hobbyist: iPhone 15 Pro (8GB, A17 Pro)** Entry point. Runs 1B-3B at 25-35 tok/s, 7B Q4 at 12-15 tok/s with 2K context. Sufficient for personal offline chat, writing assistance, basic summarization. One device, no fleet. **Hobbyist: iPhone 16 (8GB, A18)** Same 8GB ceiling, 15-20% faster token rates due to A18 architectural improvements. 3B Q4 at 35-45 tok/s, 7B Q4 at 15-18 tok/s. Better sustained thermals than A17 Pro (~10 min longer before throttle). The non-Pro value pick. **SMB: iPhone 16 Pro fleet (8GB, A18 Pro)** Practical ceiling for single-user on-device AI. 3B Q4 at 40-50 tok/s, 7B Q4 at 18-22 tok/s, 8K context on 3B. Deploy via MDM with Managed App Configuration for model version pinning. Ten iPhone 16 Pro devices running private LLM apps = field-team AI access with zero server infrastructure and zero data egress. Cost: ~$999/device + MDM licensing ($4-8/device/month via Jamf Pro or Kandji). Flat per-device cost model, no inference-per-query charges. **Enterprise: iPhone fleet + private app + MDM** Build a private iOS app wrapping [MLX Swift](https://github.com/ml-explore/mlx-swift) with your fine-tuned model. Distribute via Apple Business Manager with MDM-enforced configuration. All inference data stays on-device; audit trail via MDM logs. For 100+ device fleets, per-device savings vs cloud API at $0.01-0.03/query break even in 3-6 months. Plan 1-2 iOS engineers full-time for a production-quality private LLM app. **Frontier: Not applicable** iOS devices cannot serve multi-user concurrent inference, cannot train or fine-tune, and are architecturally unsuited for >10B parameter models. The iPhone is the inference edge node — it handles user-facing inference; heavy lifting runs on [NVIDIA L40S](/hardware/nvidia-l40s), [RTX 5090](/hardware/rtx-5090), or cloud infrastructure. The A19 Pro (iPhone 17 Pro, late 2026) is rumored at 12GB RAM, which would unlock 8B-10B Q4 at 8K context — a meaningful capability jump. Until then, 8GB is the ceiling.

Runtime guidance

Decision tree for iOS on-device inference: **Swift-native app targeting Apple Intelligence integration → MLX Swift** [MLX Swift](https://github.com/ml-explore/mlx-swift) provides Metal-accelerated inference with unified memory management across CPU, GPU, and Neural Engine. Best token rate on Apple Silicon, lowest battery drain per token, native Swift API. Supports MLX model format — convert GGUF via mlx-lm convert. Model coverage: Mistral, Llama, Phi, Qwen, Gemma architectures well-supported. Correct choice for consumer iOS apps where UX polish and battery life are priorities. **Cross-platform app (iOS + Android) or broadest model support → llama.cpp iOS** [llama.cpp](/tools/llama-cpp) iOS bindings cover 40+ model architectures via Metal acceleration. Token rate is 10-15% lower than MLX Swift due to manual buffer management vs Metal's unified allocator. Battery drain is 5-10% higher per hour. Benefit: GGUF format is the de facto standard — thousands of pre-quantized variants on Hugging Face reduce conversion overhead. Use when model availability or cross-platform code sharing matter more than the last 10% of performance. **Apple Intelligence-aware (notification summaries, writing tools, Siri) → CoreML + Apple Intelligence APIs** Apple Intelligence runs on [CoreML](https://developer.apple.com/documentation/coreml) models deployed with iOS. Not user-replaceable, not open-weight. Interact through App Intents and Writing Tools API. Zero inference management overhead. The model is a black box — you accept Apple's quality and update cadence. Use for enhancing iOS workflows without shipping your own model. **Comparison:** - Token rate (7B Q4, [A18 Pro](/hardware/apple-a18-pro)): MLX Swift 20-22 tok/s, llama.cpp iOS 17-19 tok/s, CoreML N/A - Model format: MLX (.safetensors converted), GGUF, CoreML .mlpackage - Battery efficiency: MLX Swift best, llama.cpp iOS good, CoreML excellent - Model coverage: MLX Swift (Mistral/Llama/Phi/Qwen/Gemma), llama.cpp iOS (40+ architectures), CoreML (Apple-curated) - First-launch compile: MLX Swift 6-12s, llama.cpp iOS 8-15s, CoreML pre-compiled **Hybrid approach:** MLX Swift for primary chat surface (best UX + battery), llama.cpp as fallback for unsupported architectures, CoreML for system-level integrations (Siri shortcuts, App Intents).

Setup walkthrough

  1. Install MLC Chat from the iOS App Store (free, open-source).
  2. Open the app → tap "Model Store" → download Llama 3.2 3B Q4_K_M (~2 GB, downloads in ~1-2 minutes on Wi-Fi).
  3. Tap the model → type "Explain quantum computing in simple terms." First response in 2-5 seconds on iPhone 15 Pro or newer.
  4. For larger models: download Qwen 2.5 7B Q4_K_M (~4.5 GB) — runs at 8-15 tok/s on A18 Pro (iPhone 16 Pro).
  5. For Apple Intelligence features (summarization, writing tools): Settings → Apple Intelligence → turn on. Works on iPhone 15 Pro+ with iOS 18+.
  6. For developer workflows: install MLX Swift (Apple's on-device ML framework) and build custom inference apps.

All processing is on-device. No network needed after download.

The cheap setup

iPhone 15 Pro (A17 Pro, 8 GB RAM, ~$700-800 used). Runs Llama 3.2 3B at 15-25 tok/s, Qwen 2.5 7B Q4 at 8-12 tok/s. Battery impact: ~5% per 10 minutes of inference. iPhone 16 (A18, 8 GB, ~$800 new) runs 7B models at 10-15 tok/s. iPhone SE 3rd gen (A15, 4 GB, ~$250 used) runs Llama 3.2 1B at 15-20 tok/s but 3B models are tight on 4 GB. For serious on-device AI, minimum is A17 Pro + 8 GB. An iPhone 15 Pro at $700 is the cheapest entry point for competent on-device LLM inference.

The serious setup

iPhone 16 Pro Max (A18 Pro, 8 GB RAM, ~$1,200 new). Runs Qwen 2.5 7B at 12-18 tok/s, Llama 3.2 3B at 25-40 tok/s. MLC Chat supports model switching in <5 seconds. For the absolute best iPhone AI: iPhone 16 Pro Max + MLC Chat + Qwen 2.5 7B for general chat + Stable Diffusion (Draw Things app) for on-device image gen (SD 1.5 at ~30s per 512×512 on A18 Pro). Total investment: ~$1,200-1,500. This is the ceiling — iPhones max out at 8 GB RAM, so 7B-8B models are the limit regardless of price tier.

Common beginner mistake

The mistake: Expecting iPhone on-device AI to match desktop GPU performance — downloading a 7B model and getting 8 tok/s vs. 60 tok/s on an RTX 3060. Why it fails: The A18 Pro Neural Engine + GPU combined deliver ~35 TOPS (INT8). An RTX 3060 delivers ~100+ TOPS (FP16/INT8) with dedicated CUDA cores and 12 GB of dedicated VRAM. The iPhone shares 8 GB between OS, apps, AND model weights. The fix: Use quantized models (Q4_K_M or lower). Use the smallest model that meets your quality needs — Llama 3.2 3B is the sweet spot for iPhone (fast, fits in 4-6 GB). For heavy inference, use the iPhone for quick queries on-the-go; do batch work on a desktop GPU. iPhone AI is for convenience, not throughput.

Recommended setup for iphone ai

Recommended hardware
  • Apple A18 Pro →
  • Best GPU for local AI →
Recommended runtimes

Browse all tools for runtimes that fit this workload.

Budget build
AI PC under $1,000 →
Best GPU for this task
Best GPU for local AI →

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

  • Buying for spec-sheet VRAM without modeling KV cache + activation overhead
  • Underestimating quantization quality loss below Q4
  • Skipping flash-attention support (real perf gap on long context)
  • Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running iphone ai locally. Each links to a diagnose+fix walkthrough.

  • CUDA out of memory →
  • Model keeps crashing →
  • Ollama running slow →
  • llama.cpp too slow →

Before you buy

Verify your specific hardware can handle iphone ai before committing money.

  • Will it run on my hardware? →
  • Custom compatibility check →
  • GPU recommender (4 questions) →

Featured hardware

Apple A18 Pro
Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
  • Best used GPU for local AI →
  • Will it run on my hardware? →
Compare hardware
  • Curated head-to-heads →
  • Custom comparison tool →
  • RTX 4090 vs RTX 5090 →
  • RTX 3090 vs RTX 4090 →
Troubleshooting
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →
  • Model keeps crashing →
Specialized buyer guides
  • GPU for ComfyUI (image-gen) →
  • GPU for KoboldCpp (RP/long-context) →
  • GPU for AI agents →
  • GPU for local OCR →
  • GPU for voice cloning →
  • Upgrade from RTX 3060 →
  • Beginner setup →
  • AI PC for students →
Updated 2026 roundup
  • Best free local AI tools (2026) →