Operator path
Operator-reviewed

AMD ROCm: running local AI on Radeon

For: Owners of RX 7900 XTX, 7900 XT, or RX 9070 cards willing to commit to Linux. By the end: A ROCm-backed inference service on Linux running 13-32B-class models, with a pinned driver stack you can re-create.

By Fredoline Eruo8 milestonesLast reviewed 2026-05-07

Be honest with yourself first: ROCm on consumer Radeon is Linux-only in any practical production sense, the official support matrix moves slowly, and a kernel upgrade can ruin your weekend. The reward is a 24GB card (7900 XTX) that actually runs the modern open-weight models for half the price of a 4090. This path walks the eight disciplines that keep that reward from costing you sanity.

Confirm your card is on the supported list

Older cards (RX 6000 series, Vega) work in some configs, but support is research-quality, not production. The current targets are RX 7900 XTX (24GB), RX 7900 XT (20GB), and the new RX 9070-series. If you have a 6800/6900, you can make it work but you're off the paved road; many of the steps below assume RDNA 3 or newer.

When this is done you should have
Card confirmed in AMD's official ROCm support matrix for the version you intend to install. RX 7900 XTX / 7900 XT / 9070 / 9070 XT are the realistic 2026 targets.

Install Ubuntu LTS or Debian Stable, nothing else

ROCm is officially supported against specific Ubuntu LTS and RHEL-family kernels. Arch / Fedora / openSUSE work, sometimes, but you become the QA team. If you want a machine that survives a year of updates, Ubuntu LTS is the right answer. This isn't taste — it's how AMD ships.

When this is done you should have
Ubuntu 22.04 / 24.04 LTS or Debian Stable installed. Kernel version recorded. No rolling-release temptation.

Pin the kernel and the driver to a known-good combo

The number-one ROCm pain story: an unattended kernel upgrade left the amdgpu kernel module incompatible with the new kernel headers. The card no longer initializes and you're booting to recovery. Fix: hold the kernel package, do upgrades on a schedule, test before committing.

Record the working combo (kernel X, ROCm Y, driver Z) in a file. When you upgrade later, you'll need it.

When this is done you should have
Kernel package on hold (apt-mark hold linux-image-*). amdgpu kernel module loaded. ROCm version installed and pinned.

Verify ROCm sees the card

Before you try a model, verify the card. rocminfo should list your GPU as a compute device. rocm-smi should report non-zero VRAM and reasonable idle temperature. If either of these fails, no inference framework will work — debug here, not in vLLM.

Common gotcha: a fresh ROCm install requires the user to be in the render and video groups. Check `groups` and re-login if needed.

When this is done you should have
rocminfo lists your card. rocm-smi shows reasonable temps + memory. clinfo (or similar) confirms compute is available.

Run llama.cpp with the ROCm backend

llama.cpp is the most reliable runtime on consumer ROCm. Build with `make GGML_HIP=1` (or fetch the pre-built binaries that include HIP). Run a 7B Q4 model and check tokens-per-second. On a 7900 XTX you should see at least 70-100 tok/s on a 7B Q4 model; lower means a build flag is wrong or the wrong kernels are being selected.

When this is done you should have
llama.cpp built with HIPBLAS support, a 7B model running on the GPU at >30 tok/s. nvidia-smi-equivalent (rocm-smi) shows the card in use.

Try vLLM ROCm — and have a fallback ready

vLLM ROCm exists, works, and lags the CUDA path on features and stability. For batched serving, when it works, it's significantly faster than llama.cpp. For interactive single-stream use, the gap is much smaller. Try it; if it fights you for more than an hour, fall back to llama.cpp without shame.

When this is done you should have
vLLM ROCm wheel installed, a model loaded, the OpenAI-compatible endpoint running. Or: a documented fallback to llama.cpp if vLLM doesn't cooperate.

Pick a model that actually fits

7900 XTX 24GB comfortably handles a 32B model in Q4 with 8K-16K context. 7900 XT 20GB is comfortable at 14B and tight at 32B. Don't try to load 70B on a single 24GB card — you can with severe quantization, but the quality cost usually isn't worth it.

When this is done you should have
A 13-32B model loaded in Q4-Q5, generating tokens at usable speed. Headroom verified for context.

Document the working stack and the rebuild path

ROCm setups are not stable across updates the way NVIDIA setups can be. Treat your working configuration as something you'll have to recreate, and write it down accordingly. Three months from now, future-you will be grateful when a kernel update breaks something and the recovery is a matter of following your own document.

When this is done you should have
A document with the exact kernel + ROCm + runtime versions, build flags, and the rebuild procedure for a fresh machine.

Next recommended step

Operator-grade Linux paths for ROCm and CUDA, with the maintenance and observability cross-cuts.