iPhone on-device AI stack (May 2026) — Llama 3.2 3B / Phi-3.5 Mini via MLX Swift

What this stack accomplishes

This is the iOS-app-bundled local LLM inference stack for production deployment in 2026. Apple Intelligence reshaped the conversation in 2024-2025; by 2026, shipping a 3B-class on-device model in your iOS app is operationally viable for summarization, classification, voice transcription post-processing, and offline-first features.

The honest framing of what this is and isn't:

Is: a production-grade path for a 3B-class model running on-device, app-bundled, no network calls, App Store reviewable.
Is not: a replacement for cloud LLMs. iPhone tok/s lags desktop GPU tok/s by 4-10×. Sustained workloads thermal-throttle.
Is not: a 7B-class deployment. iPhone RAM (8 GB) bottlenecks anything past 4B.

Hardware required

iPhone 15 Pro or newer (A17 Pro+ Neural Engine) · iPad M4 (38 TOPS NPU + 120 GB/s memory bandwidth) for tablet-tier · iOS 17.4+ for MLX Swift Apple Intelligence-class deployment · Mac with Xcode 15.4+ for the build/sign toolchain · ~5 GB Mac storage for the model + Xcode caches

Components — what to install and why

The stack

01
HardwareTarget SoC (iPhone 16 Pro)
apple-a18-pro
A18 Pro 38 TOPS Neural Engine + 8GB RAM. The 8GB floor is what makes 3B-class models viable on-device — A17 Pro at 8GB also works but with tighter KV-cache headroom.
02
HardwareTablet-tier alternative
apple-m4-ipad
iPad Pro M4 has 120 GB/s memory bandwidth (vs 60 on phones) — sustained-load throughput is meaningfully higher. The right target if your app is iPad-first or supports both form factors.
03
ToolOn-device runtime (Apple-first-party Swift API)
mlx-swift
MLX Swift is Apple's first-party path. Same model checkpoints as desktop MLX-LM (write once, run on Mac + iPhone + iPad). Active Apple maintenance — updated alongside iOS releases. iOS-only is the catch.
04
ModelPrimary 3B chat model
llama-3.2-3b-instruct
3B at INT4 quant (~1.9 GB on disk) fits comfortably in the 8GB iPhone RAM with 4K context. Llama Community License permits app-bundling. Apple's MLX Swift example apps demonstrate this exact configuration.
05
ModelAlternative 3.8B model with stronger instruction-following
phi-3.5-mini-instruct
Phi-3.5 Mini is 3.8B and slightly heavier than Llama 3.2 3B but with better instruction-following polish. MIT licensed. Pick when prompt adherence matters more than raw throughput.
06
ModelMultilingual 3B alternative
qwen-2.5-3b-instruct
Qwen 2.5 3B at INT4 is the multilingual choice. Note Qwen License for the 3B size class (not Apache 2.0). Similar memory footprint as Llama 3.2 3B.

Step-by-step setup (Swift Package + model checkpoint)

1. Add MLX Swift to your Xcode project

// Package.swift
dependencies: [
    .package(url: "https://github.com/ml-explore/mlx-swift", from: "0.18.0"),
    .package(
        url: "https://github.com/ml-explore/mlx-swift-examples",
        branch: "main"
    )
],
targets: [
    .executableTarget(
        name: "MyApp",
        dependencies: [
            .product(name: "MLX", package: "mlx-swift"),
            .product(name: "MLXLLM", package: "mlx-swift-examples"),
        ]
    )
]

2. Bundle a quantized model with the app

# On your Mac (model conversion)
pip install mlx-lm
mlx_lm.convert \
    --hf-path meta-llama/Llama-3.2-3B-Instruct \
    --quantize \
    --q-bits 4 \
    --mlx-path ./Llama-3.2-3B-Instruct-mlx-int4

# Output: ~1.9 GB. Add to Xcode project as a folder reference
# under YourApp/Resources/Models/. App Store binary cap is 4 GB, so
# 3B-INT4 fits easily; 7B-INT4 would not.

3. Load + run inference (Swift)

import MLX
import MLXLLM

let modelURL = Bundle.main.url(
    forResource: "Llama-3.2-3B-Instruct-mlx-int4",
    withExtension: nil
)!

let modelContainer = try await LLMModelFactory.shared.loadContainer(
    configuration: .init(directory: modelURL)
)

let result = try await modelContainer.perform { context in
    let input = try await context.processor.prepare(
        input: .init(prompt: "Summarize this in one sentence: ...")
    )
    return try generate(
        input: input,
        parameters: .init(maxTokens: 200, temperature: 0.4),
        context: context
    )
}

print(result.output)
print("Tokens/sec: \(result.tokensPerSecond)")

4. Pre-warm at app launch (avoid first-token cliff)

// In SceneDelegate or App.init:
Task.detached(priority: .userInitiated) {
    try? await modelContainer.perform { context in
        // Warm-up generate of 1 token to load weights into NPU cache
        let warmup = try await context.processor.prepare(
            input: .init(prompt: " ")
        )
        _ = try generate(
            input: warmup,
            parameters: .init(maxTokens: 1),
            context: context
        )
    }
}
// First-real-query latency drops from 2-3s cold to <500ms warm.

Thermal + battery reality check

Mobile NPU + GPU inference is thermally bounded, not compute-bounded. The first 2-3 minutes of inference run at peak tok/s; past 5-10 minutes the device throttles 25-50% under sustained load. Plan your UX around this:

Bursty UX wins. 30-second summarization of an article: fast and snappy.
Continuous chat falls off. A 20-minute conversational session will visibly slow.
Background continuity — iOS aggressively suspends apps. Use BackgroundProcessingTask for long-running summaries; expect interruptions.
Battery: ~3-7% per 10-min active inference session on iPhone 16 Pro (editorial estimate). Measure on your workload.
Charging mitigates thermal throttling but adds heat in the other direction. Test the user experience while plugged in vs unplugged.

Expected outcome

Ship an iOS app that loads a 3B-model checkpoint at app start (~2-3 sec on iPhone 16 Pro), serves single-stream LLM inference at editorial-estimated 8-15 tok/s decode (cold), and gracefully degrades when the device thermal-throttles after 5-10 min of sustained load. Battery cost: ~3-7% per 10-min session at peak; verify on your specific device + workload before shipping.

App Store review considerations

App size: 3B-INT4 model is ~1.9 GB. Apple's app size cap on initial install (4 GB) tolerates this; cellular install limits (200 MB without override) do not. Use NSBundleResourceRequest for on-demand resource download if you need cellular installs.
Privacy disclosures: on-device inference is the simplest privacy story possible — disclose that AI runs on-device, no data leaves the device.
Battery transparency: heavy AI usage will get flagged in iOS Battery settings. Make this clear in your onboarding so users aren't surprised.
License compliance: bundling Llama 3.2 3B requires Llama Community License attribution in your About / Settings screen. Phi-3.5 (MIT) and Qwen 2.5 (Qwen License) have different requirements — check before submission.

Failure modes you'll hit

Cold-start latency feels broken. First model load on app launch is 2-3 seconds on iPhone 16 Pro. Without pre-warm, the first user query feels frozen. Always pre-warm at app launch.
Memory pressure crashes. 8 GB iPhone RAM is shared with iOS, your UI, and any other apps. The 3B-INT4 model + KV cache + activations consume ~3-3.5 GB of working memory; combined with iOS + your app, you can hit memory pressure on the iPhone 15 Pro (8 GB total). Test with os_proc_available_memory() instrumentation.
Backgrounding kills inference mid-stream. If the user backgrounds your app during a long generation, iOS suspends the process. Save partial state and resume on foreground.
Thermal throttling looks like the model got dumber. Throttled tok/s drops 30-50% under sustained load. UX-wise this can feel like degraded quality; instrument tok/s and surface a "device warming up" indicator if your UX needs it.
iOS 17.4+ requirement. MLX Swift requires recent iOS. Check deployment target before assuming the API is available.

Troubleshooting

Symptom: model loads but generation is silent / hangs. Check that your model directory is added as a folder reference (blue folder icon in Xcode), not a group (yellow folder). Group references flatten the contents into the bundle root and break the MLX loader's file lookup.

Symptom: tok/s is 2-3× slower than expected. Verify the device is plugged in or has been cool-booted (no recent heavy CPU usage). Thermal-throttled measurements are not representative of the cold tok/s numbers.

Symptom: works on simulator, crashes on device. The simulator runs MLX on Apple Silicon Mac hardware; on-device runs on iPhone NPU. Memory mapping and quant kernel coverage differ. Always test on physical device before committing architecture decisions.

Variations and alternatives

Phi-3.5 Mini variant: swap the model bundle for Phi-3.5 Mini. Slightly heavier (3.8B vs 3B) but better instruction-following polish. MIT license simplifies attribution.

Multilingual variant: swap to Qwen 2.5 3B for stronger non-English support. Note Qwen License requires attribution.

iPad-first deployment: target iPad M4. 120 GB/s memory bandwidth (vs 60 on phones) sustains higher tok/s under load.

Cross-platform alternative: if you need Android too, see Android on-device AI stack (also v9-shipped). MLX Swift is iOS-only.

Who should avoid this stack

Cross-platform apps — MLX Swift is iOS-only. Use MLC LLM if you need shared toolchain.
7B+ model requirements — iPhone RAM doesn't fit. Cloud or device-as-thin-client is the right answer.
Continuous-use workloads (live tutoring, real-time translation): thermal throttling will visibly degrade the experience.
App Store-cellular-install-critical apps — 1.9 GB model bundle won't install over cellular without on-demand resources rework.

Going deeper

Android on-device AI stack — the cross-platform sibling with the same architectural pattern.
Apple A18 Pro hardware page — SoC specs + NPU capabilities.
MLX Swift operational review — runtime depth.
Request an iPhone tok/s benchmark — see the queue and reproduce protocol.