Community submitted(Mobile-edge gap report)

Mobile + edge AI benchmark gap report

The honest answer to “can I run AI on my phone / NPU / Jetson?” — what we have measurements for, what we've queued, and which devices the catalog doesn't even cover yet. We don't fake mobile numbers. If a device has no measured tok/s, this page says so.

By Fredoline Eruo · Last reviewed 2026-05-07

Devices we have measurements for

Mobile / edge hardware rows in the catalog. Benchmark count comes from the editorial benchmark table. A device with zero benchmarks is in the catalog because the row is editorially curated, but we have no measured tok/s for it — it's an open measurement target.

Catalog rows

With benchmarks

Uncovered

Mobile opportunities

pc-npu
amd
NPU
AMD Ryzen AI 9 HX 370 (Strix Point)
No measurements yet50 INT8 TOPS
I can measure this →
mobile-soc
apple
NPU
Apple A17 Pro
No measurements yet35 INT8 TOPS
I can measure this →
mobile-soc
apple
NPU
Apple A18 Pro
No measurements yet38 INT8 TOPS
I can measure this →
soc
apple
Apple M1 Max
No measurements yet
I can measure this →
soc
apple
Apple M1 Ultra
No measurements yet
I can measure this →
soc
apple
Apple M2 Max
No measurements yet
I can measure this →
soc
apple
Apple M2 Ultra
No measurements yet
I can measure this →
soc
apple
Apple M3 Max
No measurements yet
I can measure this →
soc
apple
Apple M3 Ultra
No measurements yet
I can measure this →
mobile-soc
apple
NPU
Apple M4 (iPad Pro)
No measurements yet38 INT8 TOPS
I can measure this →
soc
apple
Apple M4 Max
No measurements yet
I can measure this →
soc
apple
Apple M4 Pro
No measurements yet
I can measure this →
soc
apple
Apple M4 Ultra
No measurements yet
I can measure this →
mobile-soc
google
NPU
Google Tensor G4
No measurements yet35 INT8 TOPS
I can measure this →
pc-npu
intel
NPU
Intel Core Ultra 7 258V (Lunar Lake)
No measurements yet48 INT8 TOPS
I can measure this →
mobile-soc
qualcomm
NPU
Qualcomm Snapdragon 8 Elite
No measurements yet80 INT8 TOPS
I can measure this →
mobile-soc
qualcomm
NPU
Qualcomm Snapdragon 8 Gen 3
No measurements yet45 INT8 TOPS
I can measure this →
soc
qualcomm
Qualcomm Snapdragon X Elite
No measurements yet
I can measure this →
soc
qualcomm
Qualcomm Snapdragon X Plus
No measurements yet
I can measure this →

Don't see your device? Request a mobile benchmark.

The mobile + edge hardware ecosystem moves fast. If the measurement you need isn't in the roadmap below, request it explicitly — editorial reviews and accepts specific, well-motivated requests within a week.

Request a mobile benchmark →Full benchmark roadmap →

Pending mobile + edge benchmark opportunities

Pulled from the public benchmark roadmap and filtered to mobile / edge runtimes + hardware. These are combos we'd like measured next. If you have the rig, click “I can measure this” to land on the submission form prefilled.

Medium
target: 10-22 tok/s decode (Adreno GPU path)
Snapdragon 8 Elite + Llama 3.2 3B (MLC LLM, GPU)
Llama 3.2 3B Instruct on Qualcomm Snapdragon 8 Elite · MLC LLM · Q4_K_M (TVM-quant)
Why we want this
MLC LLM is cross-platform and the most-deployed mobile LLM runtime. The Adreno-vs-Hexagon comparison on the same SoC determines whether NPU lock-in is worth the throughput gain.
I can measure this →Model page Hardware page
Medium
target: 20-35 tok/s decode (cold); throttle curve TBD
iPad M4 + Qwen 2.5 3B (MLX, sustained-load curve)
Qwen 2.5 3B Instruct on Apple M4 (iPad Pro) · MLX-LM · MLX-4bit
Why we want this
Tablet-class on-device viability for journaling / long-form summarization. Needs the throttle curve, not just peak tok/s.
I can measure this →Model page Hardware page
Medium
target: 18-35 tok/s decode (estimate)
Intel Lunar Lake + Phi-3.5 Mini (OpenVINO NPU)
Phi-3.5 Mini Instruct on Intel Core Ultra 7 258V (Lunar Lake) · ONNX Runtime Mobile · INT8
Why we want this
Lunar Lake is the Intel reference for Copilot+ PCs. Comparison vs Snapdragon X NPU determines which Copilot+ chip operators should prefer for on-device LLMs.
I can measure this →Model page Hardware page
Medium
target: 20-40 tok/s decode (estimate)
Snapdragon X Elite + Phi-3.5 Mini (ONNX Runtime + DirectML NPU)
Phi-3.5 Mini Instruct on Qualcomm Snapdragon X Elite · ONNX Runtime Mobile · INT8
Why we want this
Copilot+ PC ecosystem is rapidly expanding. The Snapdragon X NPU vs Lunar Lake NPU vs CPU-fallback comparison is the operator decision for Windows on-device deployments.
I can measure this →Model page Hardware page
High
target: 12-25 tok/s decode (Hexagon NPU, estimate)
Snapdragon 8 Elite + Phi-3.5 Mini (Qualcomm AI Hub, INT8)
Phi-3.5 Mini Instruct on Qualcomm Snapdragon 8 Elite · Qualcomm AI Hub · INT8
Why we want this
Snapdragon 8 Elite is the mid-2025 flagship for Android on-device LLM inference. Establishing the NPU-vs-GPU-fallback tradeoff numbers is critical for the Android-on-device guidance.
I can measure this →Model page Hardware page
High
target: 8-15 tok/s decode (estimate, sustained)
iPhone 16 Pro + Llama 3.2 3B (MLX Swift, INT4)
Llama 3.2 3B Instruct on Apple A18 Pro · MLX Swift · MLX-INT4
Why we want this
Mobile on-device LLM viability is the most-asked question in the iPhone-developer ecosystem in 2026. A measured tok/s + battery drain + thermal throttling curve answers 'can I ship this in my app?'
I can measure this →Model page Hardware page
Medium
target: 4-9 tok/s decode (Thunderbolt 5 inter-node)
4× Mac Mini M4 Pro Exo cluster + Llama 3.1 70B (MLX-4bit)
Llama 3.1 70B Instruct on — · Exo · MLX-4bit
Why we want this
Multi-Mac Exo clusters are an emerging pattern. The cluster-vs-single-Mac-Studio comparison establishes whether the cluster is ever the right answer outside extreme memory targets.
I can measure this →Model page
High
target: 8-14 tok/s decode (single stream)
Mac Studio M3 Ultra 192GB + Qwen 3.5 235B-A17B (MLX-4bit)
Qwen 3.5 235B-A17B (MoE) on — · MLX-LM · MLX-4bit
Why we want this
The Apple-vs-NVIDIA comparison at the frontier-MoE tier is the most-asked question for Mac Studio buyers. Editorial estimate is 25-30% of NVIDIA throughput; measured value would close the loop.
I can measure this →Model page

Devices we want measurements for but don't have catalog rows for

Hand-curated editorial opinion — devices that matter to mobile / edge AI operators but where we either don't have a complete hardware row, don't have measurements, or both. These aren't pulled from the database; they reflect the editorial judgement of where the gaps hurt operators most. We link to manufacturer pages where useful; we don't reproduce specs we haven't verified.

Editorial · uncovered
iPhone 15 Pro / Apple Neural Engine (A17 Pro / A18 Pro)
App-bundled local LLM inference on iOS is the most-asked mobile question we get. The Neural Engine is exposed through Core ML and MLX Swift — but there's no first-party tok/s benchmark from Apple, and we haven't independently measured it.
Runtime ecosystem
Core ML · MLX Swift · ExecuTorch (Metal/CoreML backends)
Vendor page →
Editorial · uncovered
Snapdragon X Elite (X1E-84-100)
Copilot+ PC reference NPU with 45 TOPS. Operators ask whether the X Elite is a viable Llama 3.2 / Phi 3.5 host. The catalog has the SoC row but no measured tok/s yet.
Runtime ecosystem
ONNX Runtime Mobile · Qualcomm AI Hub · DirectML · IPEX-LLM (CPU path)
Vendor page →
Editorial · uncovered
Snapdragon 8 Elite (mobile flagship)
The 2024-2025 Android flagship SoC with Hexagon NPU. Qualcomm AI Hub publishes vendor numbers; we want operator-reproduced tok/s for Phi 3.5 Mini and Llama 3.2 1B/3B on shipping handsets.
Runtime ecosystem
Qualcomm AI Hub · MLC LLM (Adreno GPU) · ONNX Runtime Mobile
Vendor page →
Editorial · uncovered
Intel Lunar Lake NPU (Core Ultra 200V)
48 TOPS NPU shipping in late-2024 / 2025 thin-and-lights. OpenVINO and IPEX-LLM both target it. We have a Lunar Lake hardware row but no measured local-LLM tok/s yet — vendor numbers exist but haven't been reproduced.
Runtime ecosystem
OpenVINO · IPEX-LLM · ONNX Runtime Mobile · DirectML
Vendor page →
Editorial · uncovered
AMD Ryzen AI 300-series NPU (Strix Point)
50 TOPS XDNA 2 NPU. Ryzen AI 9 HX 370 is in our catalog as a row, but the NPU path through ONNX Runtime + AMD's Ryzen AI software is poorly documented relative to Intel/Qualcomm. Operators want measurements.
Runtime ecosystem
AMD Ryzen AI software · ONNX Runtime · DirectML
Vendor page →
Editorial · uncovered
NVIDIA Jetson Orin Nano / AGX Orin
The reference edge AI dev kit family. AGX Orin (275 TOPS) is the production target for robotics + edge inference; Orin Nano (40 TOPS) is the hobbyist tier. CUDA + TensorRT-LLM both target Jetson, but we have no Jetson rows in the catalog yet.
Runtime ecosystem
TensorRT-LLM · llama.cpp (CUDA) · vLLM (limited) · NVIDIA NIM Edge
Vendor page →
Editorial · uncovered
Raspberry Pi 5 + AI Hat+ (Hailo-8L)
26 TOPS Hailo NPU on a Pi-5-shaped expansion board. Hugely popular for edge / IoT operators. LLM support is limited — Hailo's compiler targets vision models more than transformer decoders — but operator demand is real.
Runtime ecosystem
Hailo Runtime · llama.cpp (CPU on Pi 5) · ONNX Runtime
Vendor page →
Editorial · uncovered
Google Coral TPU (USB / M.2)
Edge TPU at 4 TOPS INT8. Predates the current LLM wave; designed for vision / classification, not transformer decode. We list it for honesty: operators repeatedly ask, and the honest answer is "not a viable LLM host today."
Runtime ecosystem
TensorFlow Lite · Edge TPU compiler
Vendor page →

Mobile-edge requests open for claiming

Operators have asked for these measurements via /benchmarks/request and editorial accepted them. Each row is open for any operator with the matching rig to claim and measure. The filter is hardware-slug exact-match against a curated mobile/edge list — if your device fits one of these, claiming costs nothing and the measurement lands on the public roadmap.

No mobile-edge requests open for claiming right now.

That doesn't mean the gap is closed — most mobile hardware in the editorial list above has no request row yet either. Be the first to request one via the request form.

Mobile-friendly workflows worth pairing with on-device hardware

Editorial guidance — not pulled from the registry. Hand-curated pairings of workflow + silicon + runtime that we've seen actually work in 2026, with an honest one-line rationale. If you're looking for a starting point on a phone or laptop NPU, these are the shapes that ship.

Editorial · workflow
Voice transcription
Apple M-series + iPhone (mlx-swift on iOS)
Runtime
Whisper.cpp / WhisperKit
Whisper-large-v3-turbo runs comfortably on M2/M3/M4 with sub-realtime latency; mlx-swift exposes the same model to iOS apps through Core ML or MLX directly. The decoded transcript never leaves the device.
Editorial · workflow
On-device chat assistant
Snapdragon X Elite (Copilot+ PC)
Runtime
ONNX Runtime Mobile · llama.cpp (CPU path)
Phi-3.5 Mini (3.8B) at INT4 fits comfortably in 16GB unified memory and produces serviceable chat output without invoking the GPU. The Hexagon NPU path through ONNX Runtime is the speed-tilted option but is less reproducible — operator-reported numbers vary widely with driver versions.
Editorial · workflow
Mobile RAG over personal docs
iPhone 15 Pro / iPhone 16 (A17 Pro / A18 Pro)
Runtime
Llama 3.2 3B via mlx-swift or ExecuTorch
A 3B-parameter model paired with a small embedding model (e.g. all-MiniLM-L6-v2 ported to Core ML) is enough to answer questions over a personal Notes / iMessage corpus. Vector index can live in SQLite via sqlite-vss; the entire pipeline runs offline.
Editorial · workflow
Edge speech assistant
NVIDIA Jetson Orin Nano (40 TOPS)
Runtime
Whisper.cpp + llama.cpp (CUDA)
Pair Whisper-small for STT with a 1-3B LLM at INT4 for response generation. Latency is acceptable for kiosk / robotics use cases; the Orin Nano is the sweet-spot dev kit for shipping. AGX Orin handles 7-8B LLMs comfortably for higher-quality responses.

OS / NPU / runtime coverage matrix

Honest status for the mobile + edge silicon × runtime combinations operators ask about most. “Covered” means the catalog has measurements an operator could reproduce. “Partial” means the runtime path exists but isn't fully exercised in our corpus. “Uncovered” means the path is technically supported but we have no measured tok/s yet. “Not supported” means the silicon isn't structurally a viable host for the workload in 2026 — we say so plainly rather than imply otherwise.

Combo	Status	Note
Snapdragon X Elite + ONNX Runtime + Phi 3.5	Uncovered	Hardware row exists; ONNX Runtime path through Hexagon NPU is documented; we have no measured tok/s yet.
Snapdragon X Elite + ExecuTorch	Partial	ExecuTorch's QNN backend targets the Hexagon NPU but documentation is sparse. Vendor-published numbers exist; operator reproduction is rare.
Apple Neural Engine + decoder-only LLMs (any runtime)	Not supported	ANE in 2026 is structurally not a viable LLM accelerator — Core ML's transformer compiler covers encoders + small decoders only; production LLMs run on the GPU through MLX or llama.cpp Metal instead.
Apple Silicon GPU + MLX + Llama 3.2	Covered	MLX-LM is the production path on macOS / iPadOS; small models also run on iPhone via mlx-swift. We have measurements on M-series.
Intel Lunar Lake NPU + ONNX Runtime + Llama 3	Uncovered	NPU path through OpenVINO and IPEX-LLM is documented; Lunar Lake 258V hardware row exists; we have no measured local-LLM tok/s yet.
AMD Ryzen AI NPU + DirectML + Phi	Partial	DirectML reaches the XDNA 2 NPU on Windows; AMD's Ryzen AI software stack works for Phi 3.5 but operator-reproduced numbers are scarce relative to Intel / Qualcomm.
NVIDIA Jetson Orin Nano + llama.cpp (CUDA)	Partial	Path is well-supported in the upstream llama.cpp project; Jetson hardware rows aren't yet in our catalog, so measurements live in operator threads rather than here.
Hailo-8L (Pi 5 AI Hat+) + LLM decode	Not supported	Hailo's compiler is vision-tilted; transformer decoder support is experimental and not production-ready in 2026. Treat the AI Hat+ as a vision accelerator, not an LLM host.

Where to go next

Every model+hardware combo we want measured next, mobile and otherwise.

See the full benchmark roadmap

OrSubmit a benchmark Browse all benchmarks

Devices we have measurements for

Don't see your device? Request a mobile benchmark.

Pending mobile + edge benchmark opportunities

Snapdragon 8 Elite + Llama 3.2 3B (MLC LLM, GPU)

iPad M4 + Qwen 2.5 3B (MLX, sustained-load curve)

Intel Lunar Lake + Phi-3.5 Mini (OpenVINO NPU)

Snapdragon X Elite + Phi-3.5 Mini (ONNX Runtime + DirectML NPU)

Snapdragon 8 Elite + Phi-3.5 Mini (Qualcomm AI Hub, INT8)

iPhone 16 Pro + Llama 3.2 3B (MLX Swift, INT4)

4× Mac Mini M4 Pro Exo cluster + Llama 3.1 70B (MLX-4bit)

Mac Studio M3 Ultra 192GB + Qwen 3.5 235B-A17B (MLX-4bit)

Devices we want measurements for but don't have catalog rows for

iPhone 15 Pro / Apple Neural Engine (A17 Pro / A18 Pro)

Snapdragon X Elite (X1E-84-100)

Snapdragon 8 Elite (mobile flagship)

Intel Lunar Lake NPU (Core Ultra 200V)

AMD Ryzen AI 300-series NPU (Strix Point)

NVIDIA Jetson Orin Nano / AGX Orin

Raspberry Pi 5 + AI Hat+ (Hailo-8L)

Google Coral TPU (USB / M.2)

Mobile-edge requests open for claiming

Mobile-friendly workflows worth pairing with on-device hardware

Voice transcription

On-device chat assistant

Mobile RAG over personal docs

Edge speech assistant

OS / NPU / runtime coverage matrix

Where to go next