iPhone on-device AI stack — Llama 3.2 3B / Phi-3.5 Mini via MLX Swift
App-bundled local LLM inference on iPhone 15 Pro / 16 Pro using MLX Swift + a 3B-class quantized model. The mobile-AI stack you ship in production iOS apps — battery-aware, thermal-aware, App Store reviewable. No fake numbers; honest about the throttle curve.
What this stack accomplishes
This is the iOS-app-bundled local LLM inference stack for production deployment in 2026. Apple Intelligence reshaped the conversation in 2024-2025; by 2026, shipping a 3B-class on-device model in your iOS app is operationally viable for summarization, classification, voice transcription post-processing, and offline-first features.
The honest framing of what this is and isn't:
- Is: a production-grade path for a 3B-class model running on-device, app-bundled, no network calls, App Store reviewable.
- Is not: a replacement for cloud LLMs. iPhone tok/s lags desktop GPU tok/s by 4-10×. Sustained workloads thermal-throttle.
- Is not: a 7B-class deployment. iPhone RAM (8 GB) bottlenecks anything past 4B.
Hardware required
iPhone 15 Pro or newer (A17 Pro+ Neural Engine) · iPad M4 (38 TOPS NPU + 120 GB/s memory bandwidth) for tablet-tier · iOS 17.4+ for MLX Swift Apple Intelligence-class deployment · Mac with Xcode 15.4+ for the build/sign toolchain · ~5 GB Mac storage for the model + Xcode caches
Components — what to install and why
- 01HardwareTarget SoC (iPhone 16 Pro)apple-a18-pro
A18 Pro 38 TOPS Neural Engine + 8GB RAM. The 8GB floor is what makes 3B-class models viable on-device — A17 Pro at 8GB also works but with tighter KV-cache headroom.
- 02HardwareTablet-tier alternativeapple-m4-ipad
iPad Pro M4 has 120 GB/s memory bandwidth (vs 60 on phones) — sustained-load throughput is meaningfully higher. The right target if your app is iPad-first or supports both form factors.
- 03ToolOn-device runtime (Apple-first-party Swift API)mlx-swift
MLX Swift is Apple's first-party path. Same model checkpoints as desktop MLX-LM (write once, run on Mac + iPhone + iPad). Active Apple maintenance — updated alongside iOS releases. iOS-only is the catch.
- 04ModelPrimary 3B chat modelllama-3.2-3b-instruct
3B at INT4 quant (~1.9 GB on disk) fits comfortably in the 8GB iPhone RAM with 4K context. Llama Community License permits app-bundling. Apple's MLX Swift example apps demonstrate this exact configuration.
- 05ModelAlternative 3.8B model with stronger instruction-followingphi-3.5-mini-instruct
Phi-3.5 Mini is 3.8B and slightly heavier than Llama 3.2 3B but with better instruction-following polish. MIT licensed. Pick when prompt adherence matters more than raw throughput.
- 06ModelMultilingual 3B alternativeqwen-2.5-3b-instruct
Qwen 2.5 3B at INT4 is the multilingual choice. Note Qwen License for the 3B size class (not Apache 2.0). Similar memory footprint as Llama 3.2 3B.
Step-by-step setup (Swift Package + model checkpoint)
1. Add MLX Swift to your Xcode project
// Package.swift
dependencies: [
.package(url: "https://github.com/ml-explore/mlx-swift", from: "0.18.0"),
.package(
url: "https://github.com/ml-explore/mlx-swift-examples",
branch: "main"
)
],
targets: [
.executableTarget(
name: "MyApp",
dependencies: [
.product(name: "MLX", package: "mlx-swift"),
.product(name: "MLXLLM", package: "mlx-swift-examples"),
]
)
]2. Bundle a quantized model with the app
# On your Mac (model conversion)
pip install mlx-lm
mlx_lm.convert \
--hf-path meta-llama/Llama-3.2-3B-Instruct \
--quantize \
--q-bits 4 \
--mlx-path ./Llama-3.2-3B-Instruct-mlx-int4
# Output: ~1.9 GB. Add to Xcode project as a folder reference
# under YourApp/Resources/Models/. App Store binary cap is 4 GB, so
# 3B-INT4 fits easily; 7B-INT4 would not.3. Load + run inference (Swift)
import MLX
import MLXLLM
let modelURL = Bundle.main.url(
forResource: "Llama-3.2-3B-Instruct-mlx-int4",
withExtension: nil
)!
let modelContainer = try await LLMModelFactory.shared.loadContainer(
configuration: .init(directory: modelURL)
)
let result = try await modelContainer.perform { context in
let input = try await context.processor.prepare(
input: .init(prompt: "Summarize this in one sentence: ...")
)
return try generate(
input: input,
parameters: .init(maxTokens: 200, temperature: 0.4),
context: context
)
}
print(result.output)
print("Tokens/sec: \(result.tokensPerSecond)")4. Pre-warm at app launch (avoid first-token cliff)
// In SceneDelegate or App.init:
Task.detached(priority: .userInitiated) {
try? await modelContainer.perform { context in
// Warm-up generate of 1 token to load weights into NPU cache
let warmup = try await context.processor.prepare(
input: .init(prompt: " ")
)
_ = try generate(
input: warmup,
parameters: .init(maxTokens: 1),
context: context
)
}
}
// First-real-query latency drops from 2-3s cold to <500ms warm.Thermal + battery reality check
Mobile NPU + GPU inference is thermally bounded, not compute-bounded. The first 2-3 minutes of inference run at peak tok/s; past 5-10 minutes the device throttles 25-50% under sustained load. Plan your UX around this:
- Bursty UX wins. 30-second summarization of an article: fast and snappy.
- Continuous chat falls off. A 20-minute conversational session will visibly slow.
- Background continuity — iOS aggressively suspends apps. Use
BackgroundProcessingTaskfor long-running summaries; expect interruptions. - Battery: ~3-7% per 10-min active inference session on iPhone 16 Pro (editorial estimate). Measure on your workload.
- Charging mitigates thermal throttling but adds heat in the other direction. Test the user experience while plugged in vs unplugged.
Expected outcome
Ship an iOS app that loads a 3B-model checkpoint at app start (~2-3 sec on iPhone 16 Pro), serves single-stream LLM inference at editorial-estimated 8-15 tok/s decode (cold), and gracefully degrades when the device thermal-throttles after 5-10 min of sustained load. Battery cost: ~3-7% per 10-min session at peak; verify on your specific device + workload before shipping.
App Store review considerations
- App size: 3B-INT4 model is ~1.9 GB. Apple's app size cap on initial install (4 GB) tolerates this; cellular install limits (200 MB without override) do not. Use
NSBundleResourceRequestfor on-demand resource download if you need cellular installs. - Privacy disclosures: on-device inference is the simplest privacy story possible — disclose that AI runs on-device, no data leaves the device.
- Battery transparency: heavy AI usage will get flagged in iOS Battery settings. Make this clear in your onboarding so users aren't surprised.
- License compliance: bundling Llama 3.2 3B requires Llama Community License attribution in your About / Settings screen. Phi-3.5 (MIT) and Qwen 2.5 (Qwen License) have different requirements — check before submission.
Failure modes you'll hit
- Cold-start latency feels broken. First model load on app launch is 2-3 seconds on iPhone 16 Pro. Without pre-warm, the first user query feels frozen. Always pre-warm at app launch.
- Memory pressure crashes. 8 GB iPhone RAM is shared with iOS, your UI, and any other apps. The 3B-INT4 model + KV cache + activations consume ~3-3.5 GB of working memory; combined with iOS + your app, you can hit memory pressure on the iPhone 15 Pro (8 GB total). Test with
os_proc_available_memory()instrumentation. - Backgrounding kills inference mid-stream. If the user backgrounds your app during a long generation, iOS suspends the process. Save partial state and resume on foreground.
- Thermal throttling looks like the model got dumber. Throttled tok/s drops 30-50% under sustained load. UX-wise this can feel like degraded quality; instrument tok/s and surface a "device warming up" indicator if your UX needs it.
- iOS 17.4+ requirement. MLX Swift requires recent iOS. Check deployment target before assuming the API is available.
Troubleshooting
Symptom: model loads but generation is silent / hangs. Check that your model directory is added as a folder reference (blue folder icon in Xcode), not a group (yellow folder). Group references flatten the contents into the bundle root and break the MLX loader's file lookup.
Symptom: tok/s is 2-3× slower than expected. Verify the device is plugged in or has been cool-booted (no recent heavy CPU usage). Thermal-throttled measurements are not representative of the cold tok/s numbers.
Symptom: works on simulator, crashes on device. The simulator runs MLX on Apple Silicon Mac hardware; on-device runs on iPhone NPU. Memory mapping and quant kernel coverage differ. Always test on physical device before committing architecture decisions.
Variations and alternatives
Phi-3.5 Mini variant: swap the model bundle for Phi-3.5 Mini. Slightly heavier (3.8B vs 3B) but better instruction-following polish. MIT license simplifies attribution.
Multilingual variant: swap to Qwen 2.5 3B for stronger non-English support. Note Qwen License requires attribution.
iPad-first deployment: target iPad M4. 120 GB/s memory bandwidth (vs 60 on phones) sustains higher tok/s under load.
Cross-platform alternative: if you need Android too, see Android on-device AI stack (also v9-shipped). MLX Swift is iOS-only.
Who should avoid this stack
- Cross-platform apps — MLX Swift is iOS-only. Use MLC LLM if you need shared toolchain.
- 7B+ model requirements — iPhone RAM doesn't fit. Cloud or device-as-thin-client is the right answer.
- Continuous-use workloads (live tutoring, real-time translation): thermal throttling will visibly degrade the experience.
- App Store-cellular-install-critical apps — 1.9 GB model bundle won't install over cellular without on-demand resources rework.
Going deeper
- Android on-device AI stack — the cross-platform sibling with the same architectural pattern.
- Apple A18 Pro hardware page — SoC specs + NPU capabilities.
- MLX Swift operational review — runtime depth.
- Request an iPhone tok/s benchmark — see the queue and reproduce protocol.