ONNX Runtime

Overview

What ONNX Runtime actually is

ONNX Runtime is Microsoft's cross-platform inference runtime for the ONNX model format, and the only meaningful single-runtime path that targets CUDA, DirectML, CoreML, OpenVINO, and ROCm from one binary. It is not a new training framework, not a new model format outside ONNX, and not specifically an LLM engine — it is a graph-execution runtime that runs whatever ONNX model you hand it on whatever Execution Provider (EP) the host hardware supports.

That positioning means ONNX Runtime is strongest on classical ML workloads (vision, NLP encoders, speech, embeddings, classical transformers used in feature pipelines) and weakest, comparatively, on bleeding-edge LLM serving. For a 70B-class generative model in production, vLLM and TensorRT-LLM outclass it on throughput. For a vision model that has to run on a customer's Surface laptop and a Mac and a Linux box from one binary, ONNX Runtime is unrivaled.

Where it fits in the stack

ONNX Runtime lives at the runtime layer for cross-platform model deployment. The typical stack:

Source model: PyTorch / TensorFlow / scikit-learn → exported to ONNX
Optimization: onnxruntime.quantization for INT8 / W4A16, olive for graph-level fusion
Runtime: ONNX Runtime + the right Execution Provider for the target
Hardware: anything an EP exists for — NVIDIA, AMD, Intel CPU/GPU/NPU, Apple Silicon, Snapdragon NPUs

It is not the right runtime for a 70B FP8 chatbot on H100 (use TensorRT-LLM), and it is not the right runtime for a GGUF-only homelab (use llama.cpp). It is the runtime for "this model has to ship on five different OS / GPU combinations and I need one inference path."

Best use cases

Cross-platform desktop ML. Vision, OCR, speech, embeddings, classical transformers shipping inside a desktop app on Windows + macOS + Linux from one ONNX file.
Windows + AMD GPU inference via DirectML. The DirectML EP is the cleanest "AMD GPU on Windows" path that exists outside llama.cpp's HIPBLAS.
NPU-targeted inference. Snapdragon X / Lunar Lake NPUs both expose ONNX Runtime EPs. See /stacks/android-on-device-ai.
On-device embeddings for RAG. Sentence-transformers exported to ONNX run fast on CPU and fit cleanly into a desktop app.
Mobile ML. ONNX Runtime Mobile (a separate but related build) is the Android-default for non-LLM ML.

OS support

OS	Quality	Notes
Windows 10 / 11	excellent	reference platform; full DirectML EP coverage
Linux (x86_64)	excellent	full CUDA, ROCm, OpenVINO, CPU
macOS (Apple Silicon)	excellent	CoreML EP for ANE / GPU; CPU baseline
Linux (ARM64)	good	CPU + Vulkan-class fallbacks
Android	good	via ONNX Runtime Mobile
iOS	good	via CoreML EP + ONNX Runtime Mobile

Hardware / backend support

The EP catalog in May 2026 (only the relevant ones for AI are listed):

CUDA EP (NVIDIA all generations)
TensorRT EP (NVIDIA; uses TensorRT under the hood for maximum throughput on supported ops)
DirectML EP (Windows; AMD + Intel + NVIDIA + Snapdragon NPU)
CoreML EP (Apple Silicon; targets ANE + GPU + CPU)
OpenVINO EP (Intel CPU / iGPU / NPU; see OpenVINO)
ROCm EP (AMD on Linux; see ROCm)
CPU EP (every platform; the always-available fallback)
QNN EP (Qualcomm Snapdragon NPU; the Snapdragon X Elite path)

Model / quant format support

FP32 / FP16 / BF16 — the baseline
INT8 — onnxruntime.quantization produces both static and dynamic INT8 models
W4A16 (INT4 weights, FP16 activations) — supported via the Olive toolkit; the LLM-relevant precision
NF4 / FP8 — partial; lags TensorRT-LLM
No GGUF, no AWQ-INT4 directly, no EXL2, no MLX — different ecosystem

For the cross-runtime quant ladder see /systems/quantization-formats.

Setup path

The Python install:

pip install onnxruntime-gpu      # CUDA EP
# or:
pip install onnxruntime-directml # DirectML EP

A minimal inference call:

import onnxruntime as ort
sess = ort.InferenceSession("model.onnx", providers=["CUDAExecutionProvider"])
out = sess.run(None, {"input_ids": ids})

For an LLM-shaped workflow, export from Hugging Face with optimum-cli export onnx, then optimize with the Olive toolchain. The full pipeline is documented at onnxruntime.ai.

What breaks first

EP fallback to CPU silently. If the GPU EP fails to initialize (driver, CUDA mismatch, missing library), ONNX Runtime falls back to CPU EP without raising. Always log the active EP at startup.
HF -> ONNX conversion drift. Newer architectures (novel attention, MoE routers) sometimes need patched exporters; the conversion step is the most common source of "the ONNX model produces different outputs than the HF original."
DirectML quirks on the AMD path. Some ops fall back to CPU; per-op profiling is the only way to find them.
CUDA + cuDNN version pinning. The CUDA EP is built against specific cuDNN majors; mixing minors can produce silent corruption.
Mobile build size. ONNX Runtime Mobile needs explicit op-set pruning to keep APK / IPA size sane.

Alternatives by intent

If you want…	Reach for
LLM-tuned high-throughput serving	vLLM, TensorRT-LLM
GGUF-native local LLMs	llama.cpp or Ollama
Apple-native	MLX-LM (LLMs) or CoreML directly (classical)
Intel CPU / iGPU / NPU first-party	OpenVINO directly (no ONNX layer)
Mobile-only	ExecuTorch, MLC-LLM, or ONNX Runtime Mobile

Best pairings

OpenVINO EP for Intel hardware — the cleanest "Intel CPU + iGPU + NPU" path
DirectML EP + a Windows desktop app — the cleanest cross-vendor Windows GPU path
CoreML EP + a macOS app — the ANE-aware Apple path
Snapdragon X Elite + QNN EP — the laptop NPU path
Apple A18 Pro + CoreML EP via ONNX Runtime Mobile — the iOS NPU path

Who should avoid ONNX Runtime

Operators serving 70B+ generative LLMs in production. The throughput tier above ONNX Runtime exists; use vLLM or TensorRT-LLM.
Homelabs running GGUF-native models. No reason to go through an ONNX export step.
Workloads that need maximum AWQ / EXL2 / FP8 throughput. Wrong runtime; pick CUDA-server engines.
Single-platform deployments. If you're deploying only on Linux + NVIDIA, the cross-platform overhead is wasted; pick a CUDA-native runtime directly.

Compatibility

Operating systems	Windows macOS Linux
GPU backends	NVIDIA CUDA DirectML CoreML OpenVINO ROCm
License	Open source · free + open-source

Runtime health

Operator-grade signals on how actively ONNX Runtime is being maintained, how fresh its measurements are, and what failure classes operators have flagged. Every label below is anchored to a real date or count — we never infer maintainer activity we can't show.

Release cadence

Derived from the most recent editorial signal on this row.

Active

Updated May 7, 2026

6 days since last refresh · source: lastUpdated

Benchmark freshness

How recent the editorial measurements on this runtime are.

0editorial benchmarks

No editorial benchmarks for this runtime yet.

Community reproduction

Submissions that match an editorial measurement on similar hardware.

0reproduced reports

No community reproductions on file yet.

Get ONNX Runtime

Official site

https://onnxruntime.ai

GitHub

https://github.com/microsoft/onnxruntime

Frequently asked

Is ONNX Runtime free?

ONNX Runtime has a paid tier (free + open-source). Check the pricing page for current terms.

What operating systems does ONNX Runtime support?

ONNX Runtime supports Windows, macOS, Linux.

Which GPUs work with ONNX Runtime?

ONNX Runtime supports NVIDIA CUDA, DirectML, CoreML, OpenVINO, ROCm. CPU-only inference is also possible but slow.

Reviewed by RunLocalAI Editorial. See our editorial policy for how we evaluate tools.