Mobile + edge AI benchmark gap report
The honest answer to “can I run AI on my phone / NPU / Jetson?” — what we have measurements for, what we've queued, and which devices the catalog doesn't even cover yet. We don't fake mobile numbers. If a device has no measured tok/s, this page says so.
Devices we have measurements for
Mobile / edge hardware rows in the catalog. Benchmark count comes from the editorial benchmark table. A device with zero benchmarks is in the catalog because the row is editorially curated, but we have no measured tok/s for it — it's an open measurement target.
Don't see your device? Request a mobile benchmark.
The mobile + edge hardware ecosystem moves fast. If the measurement you need isn't in the roadmap below, request it explicitly — editorial reviews and accepts specific, well-motivated requests within a week.
Pending mobile + edge benchmark opportunities
Pulled from the public benchmark roadmap and filtered to mobile / edge runtimes + hardware. These are combos we'd like measured next. If you have the rig, click “I can measure this” to land on the submission form prefilled.
- Mediumtarget: 10-22 tok/s decode (Adreno GPU path)
Snapdragon 8 Elite + Llama 3.2 3B (MLC LLM, GPU)
Llama 3.2 3B Instruct on Qualcomm Snapdragon 8 Elite · MLC LLM · Q4_K_M (TVM-quant)
Why we want thisMLC LLM is cross-platform and the most-deployed mobile LLM runtime. The Adreno-vs-Hexagon comparison on the same SoC determines whether NPU lock-in is worth the throughput gain.
- Mediumtarget: 20-35 tok/s decode (cold); throttle curve TBD
iPad M4 + Qwen 2.5 3B (MLX, sustained-load curve)
Qwen 2.5 3B Instruct on Apple M4 (iPad Pro) · MLX-LM · MLX-4bit
Why we want thisTablet-class on-device viability for journaling / long-form summarization. Needs the throttle curve, not just peak tok/s.
- Mediumtarget: 18-35 tok/s decode (estimate)
Intel Lunar Lake + Phi-3.5 Mini (OpenVINO NPU)
Phi-3.5 Mini Instruct on Intel Core Ultra 7 258V (Lunar Lake) · ONNX Runtime Mobile · INT8
Why we want thisLunar Lake is the Intel reference for Copilot+ PCs. Comparison vs Snapdragon X NPU determines which Copilot+ chip operators should prefer for on-device LLMs.
- Mediumtarget: 20-40 tok/s decode (estimate)
Snapdragon X Elite + Phi-3.5 Mini (ONNX Runtime + DirectML NPU)
Phi-3.5 Mini Instruct on Qualcomm Snapdragon X Elite · ONNX Runtime Mobile · INT8
Why we want thisCopilot+ PC ecosystem is rapidly expanding. The Snapdragon X NPU vs Lunar Lake NPU vs CPU-fallback comparison is the operator decision for Windows on-device deployments.
- Hightarget: 12-25 tok/s decode (Hexagon NPU, estimate)
Snapdragon 8 Elite + Phi-3.5 Mini (Qualcomm AI Hub, INT8)
Phi-3.5 Mini Instruct on Qualcomm Snapdragon 8 Elite · Qualcomm AI Hub · INT8
Why we want thisSnapdragon 8 Elite is the mid-2025 flagship for Android on-device LLM inference. Establishing the NPU-vs-GPU-fallback tradeoff numbers is critical for the Android-on-device guidance.
- Hightarget: 8-15 tok/s decode (estimate, sustained)
iPhone 16 Pro + Llama 3.2 3B (MLX Swift, INT4)
Llama 3.2 3B Instruct on Apple A18 Pro · MLX Swift · MLX-INT4
Why we want thisMobile on-device LLM viability is the most-asked question in the iPhone-developer ecosystem in 2026. A measured tok/s + battery drain + thermal throttling curve answers 'can I ship this in my app?'
- Mediumtarget: 4-9 tok/s decode (Thunderbolt 5 inter-node)
4× Mac Mini M4 Pro Exo cluster + Llama 3.1 70B (MLX-4bit)
Llama 3.1 70B Instruct on — · Exo · MLX-4bit
Why we want thisMulti-Mac Exo clusters are an emerging pattern. The cluster-vs-single-Mac-Studio comparison establishes whether the cluster is ever the right answer outside extreme memory targets.
- Hightarget: 8-14 tok/s decode (single stream)
Mac Studio M3 Ultra 192GB + Qwen 3.5 235B-A17B (MLX-4bit)
Qwen 3.5 235B-A17B (MoE) on — · MLX-LM · MLX-4bit
Why we want thisThe Apple-vs-NVIDIA comparison at the frontier-MoE tier is the most-asked question for Mac Studio buyers. Editorial estimate is 25-30% of NVIDIA throughput; measured value would close the loop.
Devices we want measurements for but don't have catalog rows for
Hand-curated editorial opinion — devices that matter to mobile / edge AI operators but where we either don't have a complete hardware row, don't have measurements, or both. These aren't pulled from the database; they reflect the editorial judgement of where the gaps hurt operators most. We link to manufacturer pages where useful; we don't reproduce specs we haven't verified.
- Editorial · uncovered
iPhone 15 Pro / Apple Neural Engine (A17 Pro / A18 Pro)
App-bundled local LLM inference on iOS is the most-asked mobile question we get. The Neural Engine is exposed through Core ML and MLX Swift — but there's no first-party tok/s benchmark from Apple, and we haven't independently measured it.
Runtime ecosystemCore ML · MLX Swift · ExecuTorch (Metal/CoreML backends) - Editorial · uncovered
Snapdragon X Elite (X1E-84-100)
Copilot+ PC reference NPU with 45 TOPS. Operators ask whether the X Elite is a viable Llama 3.2 / Phi 3.5 host. The catalog has the SoC row but no measured tok/s yet.
Runtime ecosystemONNX Runtime Mobile · Qualcomm AI Hub · DirectML · IPEX-LLM (CPU path) - Editorial · uncovered
Snapdragon 8 Elite (mobile flagship)
The 2024-2025 Android flagship SoC with Hexagon NPU. Qualcomm AI Hub publishes vendor numbers; we want operator-reproduced tok/s for Phi 3.5 Mini and Llama 3.2 1B/3B on shipping handsets.
Runtime ecosystemQualcomm AI Hub · MLC LLM (Adreno GPU) · ONNX Runtime Mobile - Editorial · uncovered
Intel Lunar Lake NPU (Core Ultra 200V)
48 TOPS NPU shipping in late-2024 / 2025 thin-and-lights. OpenVINO and IPEX-LLM both target it. We have a Lunar Lake hardware row but no measured local-LLM tok/s yet — vendor numbers exist but haven't been reproduced.
Runtime ecosystemOpenVINO · IPEX-LLM · ONNX Runtime Mobile · DirectML - Editorial · uncovered
AMD Ryzen AI 300-series NPU (Strix Point)
50 TOPS XDNA 2 NPU. Ryzen AI 9 HX 370 is in our catalog as a row, but the NPU path through ONNX Runtime + AMD's Ryzen AI software is poorly documented relative to Intel/Qualcomm. Operators want measurements.
Runtime ecosystemAMD Ryzen AI software · ONNX Runtime · DirectML - Editorial · uncovered
NVIDIA Jetson Orin Nano / AGX Orin
The reference edge AI dev kit family. AGX Orin (275 TOPS) is the production target for robotics + edge inference; Orin Nano (40 TOPS) is the hobbyist tier. CUDA + TensorRT-LLM both target Jetson, but we have no Jetson rows in the catalog yet.
Runtime ecosystemTensorRT-LLM · llama.cpp (CUDA) · vLLM (limited) · NVIDIA NIM Edge - Editorial · uncovered
Raspberry Pi 5 + AI Hat+ (Hailo-8L)
26 TOPS Hailo NPU on a Pi-5-shaped expansion board. Hugely popular for edge / IoT operators. LLM support is limited — Hailo's compiler targets vision models more than transformer decoders — but operator demand is real.
Runtime ecosystemHailo Runtime · llama.cpp (CPU on Pi 5) · ONNX Runtime - Editorial · uncovered
Google Coral TPU (USB / M.2)
Edge TPU at 4 TOPS INT8. Predates the current LLM wave; designed for vision / classification, not transformer decode. We list it for honesty: operators repeatedly ask, and the honest answer is "not a viable LLM host today."
Runtime ecosystemTensorFlow Lite · Edge TPU compiler
Mobile-edge requests open for claiming
Operators have asked for these measurements via /benchmarks/request and editorial accepted them. Each row is open for any operator with the matching rig to claim and measure. The filter is hardware-slug exact-match against a curated mobile/edge list — if your device fits one of these, claiming costs nothing and the measurement lands on the public roadmap.
No mobile-edge requests open for claiming right now.
That doesn't mean the gap is closed — most mobile hardware in the editorial list above has no request row yet either. Be the first to request one via the request form.
Mobile-friendly workflows worth pairing with on-device hardware
Editorial guidance — not pulled from the registry. Hand-curated pairings of workflow + silicon + runtime that we've seen actually work in 2026, with an honest one-line rationale. If you're looking for a starting point on a phone or laptop NPU, these are the shapes that ship.
- Editorial · workflow
Voice transcription
Apple M-series + iPhone (mlx-swift on iOS)
RuntimeWhisper.cpp / WhisperKitWhisper-large-v3-turbo runs comfortably on M2/M3/M4 with sub-realtime latency; mlx-swift exposes the same model to iOS apps through Core ML or MLX directly. The decoded transcript never leaves the device.
- Editorial · workflow
On-device chat assistant
Snapdragon X Elite (Copilot+ PC)
RuntimeONNX Runtime Mobile · llama.cpp (CPU path)Phi-3.5 Mini (3.8B) at INT4 fits comfortably in 16GB unified memory and produces serviceable chat output without invoking the GPU. The Hexagon NPU path through ONNX Runtime is the speed-tilted option but is less reproducible — operator-reported numbers vary widely with driver versions.
- Editorial · workflow
Mobile RAG over personal docs
iPhone 15 Pro / iPhone 16 (A17 Pro / A18 Pro)
RuntimeLlama 3.2 3B via mlx-swift or ExecuTorchA 3B-parameter model paired with a small embedding model (e.g. all-MiniLM-L6-v2 ported to Core ML) is enough to answer questions over a personal Notes / iMessage corpus. Vector index can live in SQLite via sqlite-vss; the entire pipeline runs offline.
- Editorial · workflow
Edge speech assistant
NVIDIA Jetson Orin Nano (40 TOPS)
RuntimeWhisper.cpp + llama.cpp (CUDA)Pair Whisper-small for STT with a 1-3B LLM at INT4 for response generation. Latency is acceptable for kiosk / robotics use cases; the Orin Nano is the sweet-spot dev kit for shipping. AGX Orin handles 7-8B LLMs comfortably for higher-quality responses.
OS / NPU / runtime coverage matrix
Honest status for the mobile + edge silicon × runtime combinations operators ask about most. “Covered” means the catalog has measurements an operator could reproduce. “Partial” means the runtime path exists but isn't fully exercised in our corpus. “Uncovered” means the path is technically supported but we have no measured tok/s yet. “Not supported” means the silicon isn't structurally a viable host for the workload in 2026 — we say so plainly rather than imply otherwise.
| Combo | Status | Note |
|---|---|---|
| Snapdragon X Elite + ONNX Runtime + Phi 3.5 | Uncovered | Hardware row exists; ONNX Runtime path through Hexagon NPU is documented; we have no measured tok/s yet. |
| Snapdragon X Elite + ExecuTorch | Partial | ExecuTorch's QNN backend targets the Hexagon NPU but documentation is sparse. Vendor-published numbers exist; operator reproduction is rare. |
| Apple Neural Engine + decoder-only LLMs (any runtime) | Not supported | ANE in 2026 is structurally not a viable LLM accelerator — Core ML's transformer compiler covers encoders + small decoders only; production LLMs run on the GPU through MLX or llama.cpp Metal instead. |
| Apple Silicon GPU + MLX + Llama 3.2 | Covered | MLX-LM is the production path on macOS / iPadOS; small models also run on iPhone via mlx-swift. We have measurements on M-series. |
| Intel Lunar Lake NPU + ONNX Runtime + Llama 3 | Uncovered | NPU path through OpenVINO and IPEX-LLM is documented; Lunar Lake 258V hardware row exists; we have no measured local-LLM tok/s yet. |
| AMD Ryzen AI NPU + DirectML + Phi | Partial | DirectML reaches the XDNA 2 NPU on Windows; AMD's Ryzen AI software stack works for Phi 3.5 but operator-reproduced numbers are scarce relative to Intel / Qualcomm. |
| NVIDIA Jetson Orin Nano + llama.cpp (CUDA) | Partial | Path is well-supported in the upstream llama.cpp project; Jetson hardware rows aren't yet in our catalog, so measurements live in operator threads rather than here. |
| Hailo-8L (Pi 5 AI Hat+) + LLM decode | Not supported | Hailo's compiler is vision-tilted; transformer decoder support is experimental and not production-ready in 2026. Treat the AI Hat+ as a vision accelerator, not an LLM host. |
Where to go next
Every model+hardware combo we want measured next, mobile and otherwise.