degradesEditorialReviewed May 2026

ONNX Runtime uses CPU instead of GPU — force the CUDA provider

ONNX Runtime silently falls back to CPU even with a GPU present. Fix the provider registration, package choice, and model export to get GPU inference working.

ONNX RuntimeNVIDIA CUDAPythonTransformers
By Fredoline Eruo · Last verified 2026-05-08

Diagnostic order — most likely first

#1

pip installed onnxruntime instead of onnxruntime-gpu

Diagnose

`pip list | grep onnxruntime` shows `onnxruntime` (not `onnxruntime-gpu`). The CPU-only package is the default. ONNX Runtime doesn't error — it just silently falls back to CPU.

Fix

`pip uninstall onnxruntime && pip install onnxruntime-gpu`. The two packages conflict; you can only have one installed. Always verify with `python -c 'import onnxruntime; print(onnxruntime.get_available_providers())'` — the list must include `CUDAExecutionProvider`.

#2

CUDA provider not registered in the session options

Diagnose

`onnxruntime.get_available_providers()` returns `['CPUExecutionProvider']` only, even with `onnxruntime-gpu` installed. The CUDA shared libraries aren't on the library path.

Fix

Add CUDA and cuDNN to your library path. On Linux: `export LD_LIBRARY_PATH=/usr/local/cuda/lib64:$LD_LIBRARY_PATH`. On Windows: add CUDA's `bin` directory to the system PATH. Then verify with `onnxruntime.get_device()` — it should return `'GPU'`.

#3

Wrong execution provider order in session options

Diagnose

`onnxruntime.get_available_providers()` shows CUDA, but the session was created with `providers=['CPUExecutionProvider']` or the CUDA provider is lower in the list. ONNX runs on the first available provider.

Fix

Create session with CUDA first: `session = onnxruntime.InferenceSession('model.onnx', providers=['CUDAExecutionProvider', 'CPUExecutionProvider'])`. The fallback to CPU only happens if CUDA fails to initialize. Always list CUDA first.

#4

Model was exported with operators not supported by the CUDA provider

Diagnose

Session creation logs a warning: `Failed to create CUDAExecutionProvider. Falling back to CPUExecutionProvider.`. The model contains ops that the CUDA EP can't execute — typically custom ops or ops from a newer ONNX opset than your runtime supports.

Fix

During export, set `opset_version` to a widely-supported version (opset 17 is the current sweet spot for ONNX Runtime 1.17+). Avoid custom ops unless you've registered them with the CUDA EP. Test the exported ONNX model with `onnx.checker.check_model()` and `onnxruntime.InferenceSession(..., providers=['CUDAExecutionProvider'])` immediately after export.

Frequently asked questions

How do I verify ONNX Runtime is actually using the GPU?

Run `nvidia-smi -l 1` during inference. If ONNX Runtime uses GPU, you'll see VRAM allocation from the `python` process. Also check `session.get_providers()` — the first entry should be `CUDAExecutionProvider`. If CPU is the first provider or the only one, GPU is not in use.

Is ONNX Runtime faster than PyTorch for inference?

For single-batch, single-stream inference (the typical local AI use case): rarely. PyTorch's eager execution with `torch.compile` is competitive. ONNX Runtime shines for batched, multi-stream serving where graph optimizations and kernel fusion pay off. For interactive chat, the difference is usually negligible.

Can I use ONNX Runtime with AMD GPUs?

Yes, via the ROCm execution provider: install `onnxruntime-rocm` instead of `onnxruntime-gpu`. The setup is similar — ROCm must be installed and on the library path. However, ONNX Runtime ROCm support lags behind CUDA; some ops fall back to CPU. Check the ONNX Runtime docs for your version's ROCm support matrix.

Related troubleshooting

When the fix is hardware

A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time: