RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
  1. >
  2. Home
  3. /Benchmarks
  4. /Mobile + edge gap report
◯Community submitted(Mobile-edge gap report)

Mobile + edge AI benchmark gap report

The honest answer to “can I run AI on my phone / NPU / Jetson?” — what we have measurements for, what we've queued, and which devices the catalog doesn't even cover yet. We don't fake mobile numbers. If a device has no measured tok/s, this page says so.

By Fredoline Eruo · Last reviewed 2026-05-07

Devices we have measurements for

Mobile / edge hardware rows in the catalog. Benchmark count comes from the editorial benchmark table. A device with zero benchmarks is in the catalog because the row is editorially curated, but we have no measured tok/s for it — it's an open measurement target.

Catalog rows
19
With benchmarks
0
Uncovered
19
Mobile opportunities
8
  • pc-npu
    amd
    NPU

    AMD Ryzen AI 9 HX 370 (Strix Point)

    No measurements yet50 INT8 TOPS

    I can measure this →
  • mobile-soc
    apple
    NPU

    Apple A17 Pro

    No measurements yet35 INT8 TOPS

    I can measure this →
  • mobile-soc
    apple
    NPU

    Apple A18 Pro

    No measurements yet38 INT8 TOPS

    I can measure this →
  • soc
    apple

    Apple M1 Max

    No measurements yet

    I can measure this →
  • soc
    apple

    Apple M1 Ultra

    No measurements yet

    I can measure this →
  • soc
    apple

    Apple M2 Max

    No measurements yet

    I can measure this →
  • soc
    apple

    Apple M2 Ultra

    No measurements yet

    I can measure this →
  • soc
    apple

    Apple M3 Max

    No measurements yet

    I can measure this →
  • soc
    apple

    Apple M3 Ultra

    No measurements yet

    I can measure this →
  • mobile-soc
    apple
    NPU

    Apple M4 (iPad Pro)

    No measurements yet38 INT8 TOPS

    I can measure this →
  • soc
    apple

    Apple M4 Max

    No measurements yet

    I can measure this →
  • soc
    apple

    Apple M4 Pro

    No measurements yet

    I can measure this →
  • soc
    apple

    Apple M4 Ultra

    No measurements yet

    I can measure this →
  • mobile-soc
    google
    NPU

    Google Tensor G4

    No measurements yet35 INT8 TOPS

    I can measure this →
  • pc-npu
    intel
    NPU

    Intel Core Ultra 7 258V (Lunar Lake)

    No measurements yet48 INT8 TOPS

    I can measure this →
  • mobile-soc
    qualcomm
    NPU

    Qualcomm Snapdragon 8 Elite

    No measurements yet80 INT8 TOPS

    I can measure this →
  • mobile-soc
    qualcomm
    NPU

    Qualcomm Snapdragon 8 Gen 3

    No measurements yet45 INT8 TOPS

    I can measure this →
  • soc
    qualcomm

    Qualcomm Snapdragon X Elite

    No measurements yet

    I can measure this →
  • soc
    qualcomm

    Qualcomm Snapdragon X Plus

    No measurements yet

    I can measure this →

Don't see your device? Request a mobile benchmark.

The mobile + edge hardware ecosystem moves fast. If the measurement you need isn't in the roadmap below, request it explicitly — editorial reviews and accepts specific, well-motivated requests within a week.

Request a mobile benchmark →Full benchmark roadmap →

Pending mobile + edge benchmark opportunities

Pulled from the public benchmark roadmap and filtered to mobile / edge runtimes + hardware. These are combos we'd like measured next. If you have the rig, click “I can measure this” to land on the submission form prefilled.

  • Medium
    target: 10-22 tok/s decode (Adreno GPU path)

    Snapdragon 8 Elite + Llama 3.2 3B (MLC LLM, GPU)

    Llama 3.2 3B Instruct on Qualcomm Snapdragon 8 Elite · MLC LLM · Q4_K_M (TVM-quant)

    Why we want this

    MLC LLM is cross-platform and the most-deployed mobile LLM runtime. The Adreno-vs-Hexagon comparison on the same SoC determines whether NPU lock-in is worth the throughput gain.

    I can measure this →Model pageHardware page
  • Medium
    target: 20-35 tok/s decode (cold); throttle curve TBD

    iPad M4 + Qwen 2.5 3B (MLX, sustained-load curve)

    Qwen 2.5 3B Instruct on Apple M4 (iPad Pro) · MLX-LM · MLX-4bit

    Why we want this

    Tablet-class on-device viability for journaling / long-form summarization. Needs the throttle curve, not just peak tok/s.

    I can measure this →Model pageHardware page
  • Medium
    target: 18-35 tok/s decode (estimate)

    Intel Lunar Lake + Phi-3.5 Mini (OpenVINO NPU)

    Phi-3.5 Mini Instruct on Intel Core Ultra 7 258V (Lunar Lake) · ONNX Runtime Mobile · INT8

    Why we want this

    Lunar Lake is the Intel reference for Copilot+ PCs. Comparison vs Snapdragon X NPU determines which Copilot+ chip operators should prefer for on-device LLMs.

    I can measure this →Model pageHardware page
  • Medium
    target: 20-40 tok/s decode (estimate)

    Snapdragon X Elite + Phi-3.5 Mini (ONNX Runtime + DirectML NPU)

    Phi-3.5 Mini Instruct on Qualcomm Snapdragon X Elite · ONNX Runtime Mobile · INT8

    Why we want this

    Copilot+ PC ecosystem is rapidly expanding. The Snapdragon X NPU vs Lunar Lake NPU vs CPU-fallback comparison is the operator decision for Windows on-device deployments.

    I can measure this →Model pageHardware page
  • High
    target: 12-25 tok/s decode (Hexagon NPU, estimate)

    Snapdragon 8 Elite + Phi-3.5 Mini (Qualcomm AI Hub, INT8)

    Phi-3.5 Mini Instruct on Qualcomm Snapdragon 8 Elite · Qualcomm AI Hub · INT8

    Why we want this

    Snapdragon 8 Elite is the mid-2025 flagship for Android on-device LLM inference. Establishing the NPU-vs-GPU-fallback tradeoff numbers is critical for the Android-on-device guidance.

    I can measure this →Model pageHardware page
  • High
    target: 8-15 tok/s decode (estimate, sustained)

    iPhone 16 Pro + Llama 3.2 3B (MLX Swift, INT4)

    Llama 3.2 3B Instruct on Apple A18 Pro · MLX Swift · MLX-INT4

    Why we want this

    Mobile on-device LLM viability is the most-asked question in the iPhone-developer ecosystem in 2026. A measured tok/s + battery drain + thermal throttling curve answers 'can I ship this in my app?'

    I can measure this →Model pageHardware page
  • Medium
    target: 4-9 tok/s decode (Thunderbolt 5 inter-node)

    4× Mac Mini M4 Pro Exo cluster + Llama 3.1 70B (MLX-4bit)

    Llama 3.1 70B Instruct on — · Exo · MLX-4bit

    Why we want this

    Multi-Mac Exo clusters are an emerging pattern. The cluster-vs-single-Mac-Studio comparison establishes whether the cluster is ever the right answer outside extreme memory targets.

    I can measure this →Model page
  • High
    target: 8-14 tok/s decode (single stream)

    Mac Studio M3 Ultra 192GB + Qwen 3.5 235B-A17B (MLX-4bit)

    Qwen 3.5 235B-A17B (MoE) on — · MLX-LM · MLX-4bit

    Why we want this

    The Apple-vs-NVIDIA comparison at the frontier-MoE tier is the most-asked question for Mac Studio buyers. Editorial estimate is 25-30% of NVIDIA throughput; measured value would close the loop.

    I can measure this →Model page

Devices we want measurements for but don't have catalog rows for

Hand-curated editorial opinion — devices that matter to mobile / edge AI operators but where we either don't have a complete hardware row, don't have measurements, or both. These aren't pulled from the database; they reflect the editorial judgement of where the gaps hurt operators most. We link to manufacturer pages where useful; we don't reproduce specs we haven't verified.

  • Editorial · uncovered

    iPhone 15 Pro / Apple Neural Engine (A17 Pro / A18 Pro)

    App-bundled local LLM inference on iOS is the most-asked mobile question we get. The Neural Engine is exposed through Core ML and MLX Swift — but there's no first-party tok/s benchmark from Apple, and we haven't independently measured it.

    Runtime ecosystem
    Core ML · MLX Swift · ExecuTorch (Metal/CoreML backends)
    Vendor page →
  • Editorial · uncovered

    Snapdragon X Elite (X1E-84-100)

    Copilot+ PC reference NPU with 45 TOPS. Operators ask whether the X Elite is a viable Llama 3.2 / Phi 3.5 host. The catalog has the SoC row but no measured tok/s yet.

    Runtime ecosystem
    ONNX Runtime Mobile · Qualcomm AI Hub · DirectML · IPEX-LLM (CPU path)
    Vendor page →
  • Editorial · uncovered

    Snapdragon 8 Elite (mobile flagship)

    The 2024-2025 Android flagship SoC with Hexagon NPU. Qualcomm AI Hub publishes vendor numbers; we want operator-reproduced tok/s for Phi 3.5 Mini and Llama 3.2 1B/3B on shipping handsets.

    Runtime ecosystem
    Qualcomm AI Hub · MLC LLM (Adreno GPU) · ONNX Runtime Mobile
    Vendor page →
  • Editorial · uncovered

    Intel Lunar Lake NPU (Core Ultra 200V)

    48 TOPS NPU shipping in late-2024 / 2025 thin-and-lights. OpenVINO and IPEX-LLM both target it. We have a Lunar Lake hardware row but no measured local-LLM tok/s yet — vendor numbers exist but haven't been reproduced.

    Runtime ecosystem
    OpenVINO · IPEX-LLM · ONNX Runtime Mobile · DirectML
    Vendor page →
  • Editorial · uncovered

    AMD Ryzen AI 300-series NPU (Strix Point)

    50 TOPS XDNA 2 NPU. Ryzen AI 9 HX 370 is in our catalog as a row, but the NPU path through ONNX Runtime + AMD's Ryzen AI software is poorly documented relative to Intel/Qualcomm. Operators want measurements.

    Runtime ecosystem
    AMD Ryzen AI software · ONNX Runtime · DirectML
    Vendor page →
  • Editorial · uncovered

    NVIDIA Jetson Orin Nano / AGX Orin

    The reference edge AI dev kit family. AGX Orin (275 TOPS) is the production target for robotics + edge inference; Orin Nano (40 TOPS) is the hobbyist tier. CUDA + TensorRT-LLM both target Jetson, but we have no Jetson rows in the catalog yet.

    Runtime ecosystem
    TensorRT-LLM · llama.cpp (CUDA) · vLLM (limited) · NVIDIA NIM Edge
    Vendor page →
  • Editorial · uncovered

    Raspberry Pi 5 + AI Hat+ (Hailo-8L)

    26 TOPS Hailo NPU on a Pi-5-shaped expansion board. Hugely popular for edge / IoT operators. LLM support is limited — Hailo's compiler targets vision models more than transformer decoders — but operator demand is real.

    Runtime ecosystem
    Hailo Runtime · llama.cpp (CPU on Pi 5) · ONNX Runtime
    Vendor page →
  • Editorial · uncovered

    Google Coral TPU (USB / M.2)

    Edge TPU at 4 TOPS INT8. Predates the current LLM wave; designed for vision / classification, not transformer decode. We list it for honesty: operators repeatedly ask, and the honest answer is "not a viable LLM host today."

    Runtime ecosystem
    TensorFlow Lite · Edge TPU compiler
    Vendor page →

Mobile-edge requests open for claiming

Operators have asked for these measurements via /benchmarks/request and editorial accepted them. Each row is open for any operator with the matching rig to claim and measure. The filter is hardware-slug exact-match against a curated mobile/edge list — if your device fits one of these, claiming costs nothing and the measurement lands on the public roadmap.

No mobile-edge requests open for claiming right now.

That doesn't mean the gap is closed — most mobile hardware in the editorial list above has no request row yet either. Be the first to request one via the request form.

Mobile-friendly workflows worth pairing with on-device hardware

Editorial guidance — not pulled from the registry. Hand-curated pairings of workflow + silicon + runtime that we've seen actually work in 2026, with an honest one-line rationale. If you're looking for a starting point on a phone or laptop NPU, these are the shapes that ship.

  • Editorial · workflow

    Voice transcription

    Apple M-series + iPhone (mlx-swift on iOS)

    Runtime
    Whisper.cpp / WhisperKit

    Whisper-large-v3-turbo runs comfortably on M2/M3/M4 with sub-realtime latency; mlx-swift exposes the same model to iOS apps through Core ML or MLX directly. The decoded transcript never leaves the device.

  • Editorial · workflow

    On-device chat assistant

    Snapdragon X Elite (Copilot+ PC)

    Runtime
    ONNX Runtime Mobile · llama.cpp (CPU path)

    Phi-3.5 Mini (3.8B) at INT4 fits comfortably in 16GB unified memory and produces serviceable chat output without invoking the GPU. The Hexagon NPU path through ONNX Runtime is the speed-tilted option but is less reproducible — operator-reported numbers vary widely with driver versions.

  • Editorial · workflow

    Mobile RAG over personal docs

    iPhone 15 Pro / iPhone 16 (A17 Pro / A18 Pro)

    Runtime
    Llama 3.2 3B via mlx-swift or ExecuTorch

    A 3B-parameter model paired with a small embedding model (e.g. all-MiniLM-L6-v2 ported to Core ML) is enough to answer questions over a personal Notes / iMessage corpus. Vector index can live in SQLite via sqlite-vss; the entire pipeline runs offline.

  • Editorial · workflow

    Edge speech assistant

    NVIDIA Jetson Orin Nano (40 TOPS)

    Runtime
    Whisper.cpp + llama.cpp (CUDA)

    Pair Whisper-small for STT with a 1-3B LLM at INT4 for response generation. Latency is acceptable for kiosk / robotics use cases; the Orin Nano is the sweet-spot dev kit for shipping. AGX Orin handles 7-8B LLMs comfortably for higher-quality responses.

OS / NPU / runtime coverage matrix

Honest status for the mobile + edge silicon × runtime combinations operators ask about most. “Covered” means the catalog has measurements an operator could reproduce. “Partial” means the runtime path exists but isn't fully exercised in our corpus. “Uncovered” means the path is technically supported but we have no measured tok/s yet. “Not supported” means the silicon isn't structurally a viable host for the workload in 2026 — we say so plainly rather than imply otherwise.

ComboStatusNote
Snapdragon X Elite + ONNX Runtime + Phi 3.5
Uncovered
Hardware row exists; ONNX Runtime path through Hexagon NPU is documented; we have no measured tok/s yet.
Snapdragon X Elite + ExecuTorch
Partial
ExecuTorch's QNN backend targets the Hexagon NPU but documentation is sparse. Vendor-published numbers exist; operator reproduction is rare.
Apple Neural Engine + decoder-only LLMs (any runtime)
Not supported
ANE in 2026 is structurally not a viable LLM accelerator — Core ML's transformer compiler covers encoders + small decoders only; production LLMs run on the GPU through MLX or llama.cpp Metal instead.
Apple Silicon GPU + MLX + Llama 3.2
Covered
MLX-LM is the production path on macOS / iPadOS; small models also run on iPhone via mlx-swift. We have measurements on M-series.
Intel Lunar Lake NPU + ONNX Runtime + Llama 3
Uncovered
NPU path through OpenVINO and IPEX-LLM is documented; Lunar Lake 258V hardware row exists; we have no measured local-LLM tok/s yet.
AMD Ryzen AI NPU + DirectML + Phi
Partial
DirectML reaches the XDNA 2 NPU on Windows; AMD's Ryzen AI software stack works for Phi 3.5 but operator-reproduced numbers are scarce relative to Intel / Qualcomm.
NVIDIA Jetson Orin Nano + llama.cpp (CUDA)
Partial
Path is well-supported in the upstream llama.cpp project; Jetson hardware rows aren't yet in our catalog, so measurements live in operator threads rather than here.
Hailo-8L (Pi 5 AI Hat+) + LLM decode
Not supported
Hailo's compiler is vision-tilted; transformer decoder support is experimental and not production-ready in 2026. Treat the AI Hat+ as a vision accelerator, not an LLM host.

Where to go next

Every model+hardware combo we want measured next, mobile and otherwise.

See the full benchmark roadmap
OrSubmit a benchmarkBrowse all benchmarks