ONNX Runtime
Microsoft's cross-platform inference runtime for ONNX models. The reference path when you need a single runtime that targets CUDA + DirectML + CoreML + OpenVINO + ROCm from one binary. Stronger on classical models (vision, NLP, speech) than on LLMs vs vLLM/llama.cpp.
Overview
What ONNX Runtime actually is
ONNX Runtime is Microsoft's cross-platform inference runtime for the ONNX model format, and the only meaningful single-runtime path that targets CUDA, DirectML, CoreML, OpenVINO, and ROCm from one binary. It is not a new training framework, not a new model format outside ONNX, and not specifically an LLM engine — it is a graph-execution runtime that runs whatever ONNX model you hand it on whatever Execution Provider (EP) the host hardware supports.
That positioning means ONNX Runtime is strongest on classical ML workloads (vision, NLP encoders, speech, embeddings, classical transformers used in feature pipelines) and weakest, comparatively, on bleeding-edge LLM serving. For a 70B-class generative model in production, vLLM and TensorRT-LLM outclass it on throughput. For a vision model that has to run on a customer's Surface laptop and a Mac and a Linux box from one binary, ONNX Runtime is unrivaled.
Where it fits in the stack
ONNX Runtime lives at the runtime layer for cross-platform model deployment. The typical stack:
- Source model: PyTorch / TensorFlow / scikit-learn → exported to ONNX
- Optimization:
onnxruntime.quantizationfor INT8 / W4A16,olivefor graph-level fusion - Runtime: ONNX Runtime + the right Execution Provider for the target
- Hardware: anything an EP exists for — NVIDIA, AMD, Intel CPU/GPU/NPU, Apple Silicon, Snapdragon NPUs
It is not the right runtime for a 70B FP8 chatbot on H100 (use TensorRT-LLM), and it is not the right runtime for a GGUF-only homelab (use llama.cpp). It is the runtime for "this model has to ship on five different OS / GPU combinations and I need one inference path."
Best use cases
- Cross-platform desktop ML. Vision, OCR, speech, embeddings, classical transformers shipping inside a desktop app on Windows + macOS + Linux from one ONNX file.
- Windows + AMD GPU inference via DirectML. The DirectML EP is the cleanest "AMD GPU on Windows" path that exists outside llama.cpp's HIPBLAS.
- NPU-targeted inference. Snapdragon X / Lunar Lake NPUs both expose ONNX Runtime EPs. See /stacks/android-on-device-ai.
- On-device embeddings for RAG. Sentence-transformers exported to ONNX run fast on CPU and fit cleanly into a desktop app.
- Mobile ML. ONNX Runtime Mobile (a separate but related build) is the Android-default for non-LLM ML.
OS support
| OS | Quality | Notes |
|---|---|---|
| Windows 10 / 11 | excellent | reference platform; full DirectML EP coverage |
| Linux (x86_64) | excellent | full CUDA, ROCm, OpenVINO, CPU |
| macOS (Apple Silicon) | excellent | CoreML EP for ANE / GPU; CPU baseline |
| Linux (ARM64) | good | CPU + Vulkan-class fallbacks |
| Android | good | via ONNX Runtime Mobile |
| iOS | good | via CoreML EP + ONNX Runtime Mobile |
Hardware / backend support
The EP catalog in May 2026 (only the relevant ones for AI are listed):
- CUDA EP (NVIDIA all generations)
- TensorRT EP (NVIDIA; uses TensorRT under the hood for maximum throughput on supported ops)
- DirectML EP (Windows; AMD + Intel + NVIDIA + Snapdragon NPU)
- CoreML EP (Apple Silicon; targets ANE + GPU + CPU)
- OpenVINO EP (Intel CPU / iGPU / NPU; see OpenVINO)
- ROCm EP (AMD on Linux; see ROCm)
- CPU EP (every platform; the always-available fallback)
- QNN EP (Qualcomm Snapdragon NPU; the Snapdragon X Elite path)
Model / quant format support
- FP32 / FP16 / BF16 — the baseline
- INT8 —
onnxruntime.quantizationproduces both static and dynamic INT8 models - W4A16 (INT4 weights, FP16 activations) — supported via the Olive toolkit; the LLM-relevant precision
- NF4 / FP8 — partial; lags TensorRT-LLM
- No GGUF, no AWQ-INT4 directly, no EXL2, no MLX — different ecosystem
For the cross-runtime quant ladder see /systems/quantization-formats.
Setup path
The Python install:
pip install onnxruntime-gpu # CUDA EP
# or:
pip install onnxruntime-directml # DirectML EP
A minimal inference call:
import onnxruntime as ort
sess = ort.InferenceSession("model.onnx", providers=["CUDAExecutionProvider"])
out = sess.run(None, {"input_ids": ids})
For an LLM-shaped workflow, export from Hugging Face with optimum-cli export onnx, then optimize with the Olive toolchain. The full pipeline is documented at onnxruntime.ai.
What breaks first
- EP fallback to CPU silently. If the GPU EP fails to initialize (driver, CUDA mismatch, missing library), ONNX Runtime falls back to CPU EP without raising. Always log the active EP at startup.
- HF -> ONNX conversion drift. Newer architectures (novel attention, MoE routers) sometimes need patched exporters; the conversion step is the most common source of "the ONNX model produces different outputs than the HF original."
- DirectML quirks on the AMD path. Some ops fall back to CPU; per-op profiling is the only way to find them.
- CUDA + cuDNN version pinning. The CUDA EP is built against specific cuDNN majors; mixing minors can produce silent corruption.
- Mobile build size. ONNX Runtime Mobile needs explicit op-set pruning to keep APK / IPA size sane.
Alternatives by intent
| If you want… | Reach for |
|---|---|
| LLM-tuned high-throughput serving | vLLM, TensorRT-LLM |
| GGUF-native local LLMs | llama.cpp or Ollama |
| Apple-native | MLX-LM (LLMs) or CoreML directly (classical) |
| Intel CPU / iGPU / NPU first-party | OpenVINO directly (no ONNX layer) |
| Mobile-only | ExecuTorch, MLC-LLM, or ONNX Runtime Mobile |
Best pairings
- OpenVINO EP for Intel hardware — the cleanest "Intel CPU + iGPU + NPU" path
- DirectML EP + a Windows desktop app — the cleanest cross-vendor Windows GPU path
- CoreML EP + a macOS app — the ANE-aware Apple path
- Snapdragon X Elite + QNN EP — the laptop NPU path
- Apple A18 Pro + CoreML EP via ONNX Runtime Mobile — the iOS NPU path
Who should avoid ONNX Runtime
- Operators serving 70B+ generative LLMs in production. The throughput tier above ONNX Runtime exists; use vLLM or TensorRT-LLM.
- Homelabs running GGUF-native models. No reason to go through an ONNX export step.
- Workloads that need maximum AWQ / EXL2 / FP8 throughput. Wrong runtime; pick CUDA-server engines.
- Single-platform deployments. If you're deploying only on Linux + NVIDIA, the cross-platform overhead is wasted; pick a CUDA-native runtime directly.
Related
- Stacks: /stacks/android-on-device-ai, /stacks/private-rag-laptop
- System guides: /systems/quantization-formats, /setup
- Hardware: Snapdragon X Elite, Apple A18 Pro, RTX 4090
- Errors: /errors/wsl2-gpu-not-detected
Pros
- Cross-platform + cross-backend with one runtime — rare in this space
- DirectML provider unlocks Windows + AMD + NPU paths most Linux-native runtimes can't reach
- Microsoft-maintained — production-grade roadmap
Cons
- LLM-specific optimizations behind vLLM and llama.cpp
- Hugging Face → ONNX conversion is an extra step vs direct GGUF / safetensors
- Quant ecosystem narrower than the Hugging Face mainline
Compatibility
| Operating systems | Windows macOS Linux |
| GPU backends | NVIDIA CUDA DirectML CoreML OpenVINO ROCm |
| License | Open source · free + open-source |
Runtime health
Operator-grade signals on how actively ONNX Runtime is being maintained, how fresh its measurements are, and what failure classes operators have flagged. Every label below is anchored to a real date or count — we never infer maintainer activity we can't show.
Release cadence
Derived from the most recent editorial signal on this row.
6 days since last refresh · source: lastUpdated
Benchmark freshness
How recent the editorial measurements on this runtime are.
No editorial benchmarks for this runtime yet.
Community reproduction
Submissions that match an editorial measurement on similar hardware.
No community reproductions on file yet.
Get ONNX Runtime
Frequently asked
Is ONNX Runtime free?
What operating systems does ONNX Runtime support?
Which GPUs work with ONNX Runtime?
Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.
Related — keep moving
Verify ONNX Runtime runs on your specific hardware before committing money.