How to run local AI on Android (May 2026) — Snapdragon, Pixel, and the honest path
Android local LLM operator's guide. MLC LLM, llama.cpp via Termux, ONNX Runtime Mobile, ExecuTorch, Qualcomm AI Hub for Hexagon NPU. Realistic 1-7B model sizes, OEM ROM fragmentation reality, battery and thermal limits.
The Android device-tier reality
Android in 2026 is not one platform. From the operator's point of view, it's at least four:
- Snapdragon 8 Elite (Galaxy S25, OnePlus 13, etc.): 16 GB RAM common, Hexagon NPU at ~80 TOPS INT8, Adreno 830 GPU. The current ceiling — 7B Q4 LLMs run at usable speed; 13B Q4 loads with patience. snapdragon-8-elite.
- Snapdragon 8 Gen 3 (Galaxy S24, OnePlus 12): 12 GB RAM common, Hexagon NPU at ~45 TOPS, Adreno 750 GPU. 7B Q4 viable, 3B is the sweet spot for sustained use. snapdragon-8-gen-3.
- Pixel 8 Pro / 9 / 9 Pro (Tensor G3 / G4): 12-16 GB RAM. Tensor NPU TOPS aren't officially published; community measurements suggest mid-Snapdragon parity. Gemini Nano runs natively on these. google-tensor-g4.
- Anything older or mid-range: 6-8 GB RAM, Snapdragon 7-series or 8 Gen 1, MediaTek Dimensity. 1-3B class, short sessions, expect throttling.
The single fact that changes everything on Android: the OS doesn't guarantee NPU access to third-party apps in a portable way. NNAPI is being deprecated; vendor-specific paths (Qualcomm AI Hub, Samsung ENN, MediaTek NeuroPilot) are stable but proprietary. We'll come back to that.
The five runtimes you can actually use
The honest map of what runs on Android today:
- MLC LLM: the cross-device runtime. Same checkpoint compiles for Adreno GPU, iOS Metal, and WebGPU. Adreno path doesn't use Hexagon NPU but runs reliably on any flagship Android. Best operator pick for cross-platform shipping.
- llama.cpp via Termux or in-process JNI: works, GGUF-compatible, ARM SIMD path is mature in 2026. Slower than MLC LLM's Adreno path but simplest to debug if you already know llama.cpp on desktop. CPU-only unless you build with the OpenCL / Vulkan backend.
- ONNX Runtime Mobile: Microsoft's mobile inference runtime. Stable but model-conversion-heavy. Picks if you're already in an ONNX pipeline; less compelling for greenfield mobile LLM work.
- ExecuTorch: PyTorch-native mobile runtime. Backend-pluggable: CPU, Vulkan, NNAPI (deprecated), and a growing set of vendor delegates. The right pick if your model authoring is PyTorch-native and you want one toolchain across mobile + edge.
- Qualcomm AI Hub: Snapdragon-locked, NPU-first. Qualcomm publishes pre-quantized checkpoints tuned for Hexagon. Per Qualcomm's published numbers, ~30-50% faster than MLC LLM Adreno path on the same phone. The catch: Snapdragon-only — no Tensor, MediaTek, or Exynos NPU support.
Realistic model sizing by device tier
Single-stream, short context, Q4 quantization unless noted:
- Snapdragon 8 Elite, 16 GB RAM: Llama 3.2 3B Q4 for snappy chat; Phi-3.5 Mini 3.8B for stronger instruction-following; 7B Q4 (~4 GB weights) for short bursts; 13B Q4 (~7 GB) loads but throttles fast.
- Snapdragon 8 Gen 3, 12 GB RAM: 3B is the comfort tier; 7B Q4 works for short sessions but pushes RAM pressure.
- Pixel 9 / 9 Pro, 16 GB RAM: 3-7B class on the Adreno path; Gemini Nano for Google's on-device APIs.
- Older / mid-range, 6-8 GB RAM: Llama 3.2 1B Q4 (~700 MB) and SmolLM 2 1.7B are the realistic targets.
Throughput varies considerably by runtime and thermal state. See /benchmarks/mobile-edge for measured Android device coverage and /benchmarks/wanted for the gaps. If you've measured your phone and want to contribute, submit at /submit/benchmark.
Install paths in order of effort
- MLC Chat APK (10 minutes). The community-built MLC Chat Android APK. Sideload, pick a model, chat. Useful for smoke-testing whether your phone actually runs LLMs at all.
- Layla / Local AI / similar Play Store apps (5 minutes). Several consumer apps in 2026 bundle llama.cpp under the hood. Quality varies; treat as “does it run” testing, not a production decision.
- llama.cpp via Termux (30-60 minutes). Install Termux from F-Droid (the Play Store version is deprecated),
apt install clang make, clone llama.cpp, build withmake -j$(nproc), pull a GGUF, run./llama-cli. CPU only by default; addGGML_OPENCL=1for Adreno GPU if your driver cooperates. Fragile but rewarding. - ExecuTorch or MLC LLM in your own Android app (4-12 hours). Android Studio + NDK + your runtime SDK + a quantized model. See Android on-device AI stack for the full step-by-step.
- Qualcomm AI Hub (Snapdragon flagship only, 4-12 hours). Sign up, pick a Qualcomm-published quant, integrate the QNN SDK into your app. Best raw throughput; locks you to Snapdragon.
OEM fragmentation, NNAPI deprecation, and other Android-specific pain
The Android-specific failure modes you don't hit on iPhone:
- NNAPI is deprecated. Google announced NNAPI deprecation in 2024 and is steering developers to LiteRT (the successor to TensorFlow Lite) and vendor-specific delegates. If you wrote NNAPI code, plan a migration. Most third-party LLM runtimes never relied on NNAPI heavily because the NPU path was inconsistent.
- OEM ROMs aggressively kill background processes. Xiaomi MIUI, OnePlus OxygenOS, Samsung One UI all kill long-running foreground services in different ways. An on-device LLM keeping a model in RAM for 30 minutes between queries is the kind of thing OEM ROMs treat as suspicious.
- Adreno OpenCL drivers vary by ROM. The same Snapdragon 8 Gen 3 phone with two different ROMs can have wildly different OpenCL stability. llama.cpp builds that work on stock Pixel can crash on a flashed AOSP build.
- Vendor NPU SDKs are mutually exclusive. Qualcomm QNN, Samsung ENN, MediaTek NeuroPilot, Google EdgeTPU. If you write to one, you've picked your hardware. Cross-vendor NPU access is a problem nobody has solved.
Battery and thermal reality
Editorial estimates, single-stream Q4 inference on a flagship Snapdragon 8 Gen 3 phone running a 3B-class model:
- Battery: 5-12% drained per 10-minute active chat session. The variance is real — Adreno GPU path is more efficient than CPU path; NPU path is more efficient still when it's available.
- Thermal throttle: 4-8 minutes of sustained inference is enough to trigger a 25-40% throughput drop on most phones. Pixel devices throttle faster than Galaxy S; OnePlus traditionally has more thermal headroom.
- Charging while inferencing: surprisingly, this often makes throttling worse, not better, because charging adds heat. Run on battery for benchmarking.
What doesn't work on Android in 2026
- Universal NPU offload across vendors. Pick a vendor SDK or accept GPU-only.
- Continuous agent loops on battery. Same physics as iOS — thermals throttle, OEM ROMs kill the process.
- 13B+ at usable speed. Loads on 16 GB phones, throttles before you finish a long response.
- Production deployment to non-flagship phones. The 1-3B range is realistic; assume your users have 6-8 GB RAM, not 16.
- Cross-OEM consistent throughput numbers. The same model on the same SoC will benchmark differently on different OEM ROMs. This is why our mobile-edge gap report tracks device + ROM, not just SoC.
Common failure modes
- Termux build fails with linker error. F-Droid Termux is the supported version; the Play Store one is stale.
pkg upgradefirst. - OpenCL driver crash mid-decode. Adreno OpenCL is known-flaky on some ROMs. Fall back to CPU build (
GGML_OPENCL=0). - App killed in background. OEM battery optimizer. Whitelist the app in OS settings or run as a foreground service with a notification.
- QNN model fails to load. Qualcomm AI Hub quants are tied to specific Snapdragon generations. Loading a 8 Elite quant on 8 Gen 3 will silently degrade or crash.
- Storage exhaustion. 7B Q4 GGUF + cache + your app's state is 5-8 GB on disk. Cheap phones with 64 GB internal storage run out.
Going deeper
- Android on-device AI stack — the production-grade build recipe.
- Run local AI on iPhone — the cross-platform comparison.
- Best mobile AI runtimes — the runtime tier list.
- Mobile / edge benchmark gap report — device coverage status.
- Will it run? — model + phone tier verdict tool.
Next step for Android operators
Smoke test on your phone before investing in Termux or app development.