System guide · Operations

Observability for local AI

GPU utilization, VRAM leaks, KV-cache pressure, decode tok/s, queue depth, request latency. Prometheus + Grafana setup, dcgm-exporter, vLLM metrics, alert thresholds that actually catch the right failures.

By Fredoline Eruo · Reviewed 2026-05-07 · ~1,900 words

Why observability matters more than performance

A local-AI deployment that runs at 90% of theoretical peak with no monitoring will fail in production. A deployment that runs at 60% of peak with good monitoring will outlast it. The reason is mundane: you cannot fix what you cannot see, and the failure modes for local AI are slow-motion — VRAM creep, thermal drift, KV-cache pressure, queue depth growth — none of which show up as crashes until they suddenly do.

This guide covers the metrics that matter, the architecture that surfaces them, and the alert thresholds that catch the right failures without paging on noise.

The four signals that matter for local AI

Borrowed from Google's SRE book and adapted: latency, traffic, errors, saturation. For local AI specifically:

Latency. Time-to-first-token (TTFT) and decode tokens-per-second. End-to-end request latency p50/p95/p99.
Traffic. Concurrent request count, queued request count, tokens generated per minute.
Errors. OOM events, model load failures, tool-call parse failures, dropped requests.
Saturation. KV-cache utilization, GPU utilization, VRAM utilization, queue depth.

Cloud LLM operators monitor a similar set; the difference for local AI is that saturation hits earlier and harder. A 4090 saturating at 24 GB VRAM is one user away from OOM; an H100 cluster at 60% saturation has plenty of headroom.

Architecture: Prometheus + Grafana + dcgm-exporter + runtime metrics

The canonical local-AI observability stack in 2026:

Prometheus as the time-series database (homelab can run a single instance; production wants HA).
Grafana for dashboards and alerting.
nvidia-dcgm-exporter for GPU temp / power / utilization / ECC / memory-junction temp.
Runtime-native metrics from vLLM, SGLang, or your inference engine. Most modern engines expose /metrics in Prometheus format.
node_exporter for host CPU / RAM / disk / network.

For Apple Silicon, the picture is thinner — there's no equivalent of dcgm-exporter on the MLX side as of mid-2026. powermetrics sampled into Prometheus via a sidecar is the practical workaround. MLX-LM exposes basic metrics; you'll write a small adapter.

What to alert on (and what NOT to alert on)

The default failure mode of new observability stacks is alert fatigue. The discipline is to alert only on conditions that require action now. Examples that pass that test:

GPU temperature ≥ 84 °C for ≥ 5 minutes (thermal throttling imminent).
KV-cache utilization ≥ 90% for ≥ 60 seconds (next request OOMs).
p99 request latency ≥ 5× baseline for ≥ 10 minutes (something has broken).
Inference engine process not running (obvious, but easy to forget).
Disk free space < 10% on Docker volume mount (will become an outage in hours).

Examples to NOT alert on, despite the temptation: GPU utilization < 50%; cold-start latency on the first request after deploy; transient queue-depth spikes during agentic bursts. These are dashboards, not alerts.

GPU metrics: temperature, power, ECC, memory junction

The dcgm-exporter metrics that earn their keep:

DCGM_FI_DEV_GPU_TEMP — die temp. Sustained ≥ 84 °C means more airflow.
DCGM_FI_DEV_MEMORY_TEMP — memory junction temp. The 4090's HBM3X creeps up over time as thermal pads age. ≥ 95 °C is the warning signal.
DCGM_FI_DEV_POWER_USAGE — sustained near TDP means PSU under stress.
DCGM_FI_DEV_ECC_UNCORRECTABLE_ERRORS — non-zero on consumer cards is rare. On datacenter cards it's a hardware-replacement signal.
DCGM_FI_DEV_FB_USED — VRAM used. Track over time; sustained creep is a memory leak.

vLLM / SGLang / Ollama metrics

vLLM's Prometheus endpoint is the gold standard. The key series:

vllm:e2e_request_latency_seconds — request-level p50/p95/p99 latency.
vllm:gpu_cache_usage_perc — KV-cache utilization. Above 90% sustained = next OOM.
vllm:num_requests_running — concurrent decode count.
vllm:num_requests_waiting — queue depth. Above 0 sustained = capacity problem.
vllm:prompt_tokens_total + vllm:generation_tokens_total — throughput.

SGLang exposes similar metrics with a slightly different naming scheme; the prefix-cache hit rate is the metric that justifies SGLang over vLLM (high hit rate = your workload benefits; low hit rate = you might as well run vLLM).

Ollama in 2026 still doesn't expose Prometheus metrics natively. The workaround is to scrape /api/ps for loaded-model state and parse log output for token-rate signals. Not great, but Ollama's audience is solo users for whom dashboards are overkill anyway.

VRAM leak detection workflow

Real VRAM leaks are rare in mature inference engines. Apparent VRAM leaks are common — and almost always due to KV-cache that didn't release on request completion, model weights that got loaded twice, or fragmentation. The workflow to disambiguate:

Plot DCGM_FI_DEV_FB_USED over a 24h window.
If steady (saw-toothed but bounded), no leak — workload is just heavy.
If monotonically increasing, restart the inference engine and watch. If the leak resumes immediately, it's the engine; if it takes hours, it's an upstream consumer.
If you see fragmentation symptoms (OOM at the same context length that worked yesterday), restart the engine. See /systems/local-ai-maintenance.

KV-cache pressure interpretation

KV-cache utilization is the most operator-actionable metric in the local-AI dashboard. Low (< 30%) means your effective context budget is way under-used and you can extend context or take more concurrent users. Mid (30-70%) is the healthy band. High (70-90%) means you're one long prompt away from queueing. ≥ 90% sustained means OOM is coming.

The mistake to avoid: chasing 100% KV-cache utilization as a peak-throughput target. The variance band you care about is steady-state, not peak.

Latency tracking: TTFT vs decode vs end-to-end

Three latency dimensions, each with a different failure mode:

TTFT (time-to-first-token). Dominated by prefill — model attention over the prompt. Long prompts × lots of concurrent users = TTFT explosion. Mitigation: prefix cache (SGLang).
Decode rate (tokens-per-second after first token). Dominated by GPU memory bandwidth. Falls when concurrent decode contention rises.
End-to-end latency (request submitted → completion). The user-felt metric. The sum of TTFT + (output_length / decode_rate) + queueing.

Track all three. If end-to-end latency rises but TTFT and decode are flat, queueing is the cause — capacity, not engine performance.

Logs: when Loki helps, when it doesn't

Logs are non-negotiable for production; optional for solo homelab. Loki is the right log store for a Prometheus-shaped stack — same labels, same retention semantics. The failure mode is volume: a chatty inference engine plus a chatty agent loop plus a chatty Open WebUI fills 50 GB / month easily. Set retention realistically (30 days for solo, 90 for production with audit-log retention concerns).

Failure modes specific to observability stacks

Observability stacks fail too:

Prometheus disk fill. The single most common observability outage. Set retention; alert on disk free.
Grafana auto-update. Grafana sometimes ships breaking changes to dashboards on minor bumps. Pin the image SHA.
dcgm-exporter version drift against driver. Match dcgm-exporter version to driver major version.
Alertmanager notification storms on transient blips. Use for: 5m liberally on alerts.

vs cloud monitoring: where the patterns differ

Cloud LLM observability emphasizes per-tenant dimensions — usage by user, model, customer. Local AI emphasizes per-resource dimensions — VRAM, KV cache, GPU temp. The cloud's billing model is the source of truth; local AI's resource saturation is. Stitch them carefully if you run a hybrid (LiteLLM gateway in front of both local + cloud).

Bringing it together: an example dashboard

A working homelab dashboard has 4-5 panels:

GPU temp + power (DCGM time series, last 24h).
KV-cache utilization (vLLM gauge + 24h sparkline).
Request latency p50/p95/p99 (vLLM histogram, last 1h).
Concurrent / queued requests (vLLM gauges).
Token throughput per minute (vLLM counter rate).

That's enough to keep a single-workstation deployment honest. For multi-user serving (see /workflows/multi-user-local-ai-server), add per-user latency breakdown + cost-equivalent panels. For homelab API gateways (see /workflows/homelab-ai-api), add per-key throughput.

Adjacent: maintenance covers what to do when the dashboard reveals a problem; security covers the audit-log retention requirements.