Regression methodology

How /benchmarks/regressions reaches its possible regression / possible improvement / insufficient evidence labels — the rules, the inputs, and the reasons we never publish "X regressed by Y%" as fact.

Why we don't publish confirmed regressions

A single divergent measurement between two runtime versions is never enough to claim a regression. Real-world benchmark divergence has at least four causes other than the version change: different operators, different hardware lots, different OS / driver / kernel versions, and pure measurement noise (room temperature, thermal envelope, GPU age, PCIe lane count). A confirmed regression requires ruling out all of these.

Public "regression confirmed" claims do real damage — maintainers get blamed for noise, contributors stop submitting because they don't want to be the source of a witch-hunt, and editorial credibility burns. The conservative path: surface the candidate, name the missing evidence, invite reproduction. If two independent operators reproduce, the candidate becomes a confirmed change in the editorial record. Until then it stays a candidate.

What counts as a paired measurement

The detector compares two benchmark rows that match on the (model, hardware, runtime, quant, context) tuple but differ on exactly ONE version axis (e.g. same vLLM 0.7.0 vs 0.7.3, everything else identical). If multiple axes vary, the comparison is rejected — there's no way to attribute the delta.

Both rows must come from the editorial benchmark corpus or the reproduced community corpus. Pending and rejected submissions are excluded — we don't want a single unreviewed outlier to manufacture a candidate.

Version axes we track

The detector knows about twelve version axes. A delta is attributed to whichever axis differs:

Runtime version (vLLM, llama.cpp, Ollama, MLX, SGLang, TensorRT-LLM, ExLlamaV2)
Driver version (NVIDIA driver, AMDGPU, Apple GPU driver)
CUDA version
ROCm version
Metal version
PyTorch / underlying framework version
OS major version
Kernel version (Linux only)
Quantization library version (llama.cpp's ggml format, AWQ, GPTQ)
Tokenizer / chat-template version (when separately versioned)
Flash-attention version (when used)
Engine config version (e.g. paged-attention block size, batch policy)

When operators submit benchmarks, each axis gets a slot in the row. Missing axes count as missing evidence — the candidate is still surfaced, but with the missing fields listed so the reproducer knows what to record.

Threshold rules

|Δ| ≥ 15%: candidate label is regression (negative) or improvement (positive). 15% is wider than any plausible measurement noise on a warm-cache same-rig comparison.

5% ≤ |Δ| < 15%: the comparison is shown as insufficient — the delta is real-looking but within plausible operator noise. We ask for a reproduction before promoting.

|Δ| < 5%: suppressed entirely. Below this floor the comparison is unlikely to mean anything actionable.

Sample size: if either side has fewer than 2 measurements, we surface the candidate but downgrade the confidence to insufficient regardless of the delta.

Label semantics

Possible regression — tok/s dropped ≥15% on a single-axis paired comparison. Real regressions usually look like this. So do operator- error noise, thermal-throttle pairings, and lot-variance pairings. Reproduction is the gate.
Possible improvement — tok/s rose ≥15%. We surface improvements at the same bar we surface regressions. Asymmetric thresholds would launder vendor-friendly framing.
Insufficient evidence — either the delta is in the 5-15% noise band, or one side has too few measurements, or a missing version axis means we can't attribute the delta cleanly.

Confirming or dismissing

A candidate becomes a confirmed change only after an independent operator reproduces the comparison and the reproduction lands within the same direction. The reproduction must:

Use the same model, hardware tier, quantization, and context length.
Differ from the original ONLY on the same version axis.
Land within ±10% of the candidate's newer measurement.
Pass moderation under the standard benchmark policy.

Until those conditions are met, the candidate stays candidate. The reproduction queue surfaces these as priority targets for community operators.

Known noise sources

Thermal envelope drift (cold rig vs warm rig: 5-12% tok/s difference is common)
GPU lot variance (same SKU, different silicon batch: 3-7%)
PCIe lane count differences on the host (x16 vs x8 mid-load)
Background processes (browser tabs, monitoring agents — measurable on Apple Silicon)
Driver-side power-management state at measurement start
Sampler/temperature drift if not pinned exactly
Tokenizer template differences when comparing across runtimes (chat-template drift between vLLM and Ollama)

What this engine cannot do

Detect regressions where every operator upgraded together (no paired data)
Detect regressions in correctness — only throughput. Use /evaluations for accuracy / pass@1 / MMLU-style signal.
Attribute multi-axis regressions (e.g. driver upgrade AND CUDA upgrade in the same run)
Detect mobile / NPU regressions while the mobile-edge corpus is still thin
Tell you whether the regression is intentional (engine team made a tradeoff) or accidental

Next recommended step

Browse current candidates

OrReproduction queue Submit a reproducing benchmark