Methodology · Anti-patterns

Local AI benchmarking mistakes — the ten that show up everywhere

The negative-space companion to the benchmarking pillar. Batch mismatch, prompt mismatch, quant mismatch, driver mismatch, thermal throttling, cherry-picking, conflating eval scores with throughput, ignoring TTFT, cross-runtime comparisons without cohort fixing, and not recording the command line. Each mistake with the honest version next to it.

By Fredoline Eruo · Last reviewed 2026-05-08 · ~1,200 words

Why these patterns hide

Almost no one publishes a wrong benchmark on purpose. The mistakes below are unintentional — the result of measuring with one configuration and reading the number under a different one. The fix in every case is the same: pin the cohort, label what you measured, and don't let the framing drift past the configuration. This is the negative-space companion to /guides/how-to-benchmark-local-ai; read that one first if you haven't.

1. Batch mismatch

Comparing a single-stream tok/s number to a batch=N tok/s number. Batched serving in vLLM, SGLang, or TensorRT-LLM can deliver 4-8× the aggregate throughput of a single-stream llama.cpp run on the same GPU because the same weight reads serve multiple decode steps. The single-stream user does not see that aggregate. Both numbers are correct; comparing them as if they describe the same workload is the mistake.

The honest version: “Single-stream: 35-45 tok/s on llama.cpp. Aggregate batch=8 throughput: ~220 tok/s on vLLM, ~28 tok/s per stream perceived by each user.” Two numbers, two cohorts, no implied equivalence.

2. Prompt mismatch

TTFT and prefill tok/s scale with prompt length. A 200-token prompt and a 4,000-token prompt produce different latency profiles on the same hardware. Comparing “my 4090 hits 0.3s TTFT” against someone else's “my 4090 hits 1.8s TTFT” without specifying the prompt length is comparing different measurements that happen to share a unit.

The honest version: “TTFT 0.3s at 200-token prompt; 1.8s at 4,096-token prompt; same model, same runtime, same GPU.” The prompt length is part of the cohort.

3. Quant mismatch

Q4_K_M (GGUF), AWQ-INT4, GPTQ-int4, EXL2-4.5bpw, and MLX-4bit are all “4-bit” in the loose sense, and none of them produce the same model. Q4_K_M uses mixed precision per layer; AWQ runs different kernel paths; EXL2 has variable bits-per-weight assignments; MLX-4bit is its own scheme. Comparing tok/s across these without naming them precisely lets a cohort mismatch sneak in unannounced.

The honest version: “Qwen2.5-32B Q4_K_M (GGUF, llama.cpp): 25 tok/s. Qwen2.5-32B AWQ-INT4 (vLLM): 38 tok/s. Different quantizations, different runtimes, do not interchange.”

4. Driver mismatch

A benchmark measured on NVIDIA driver 555.x and reproduced on 570.x can swing 5-15% in either direction. ROCm releases routinely move llama.cpp throughput by 10-25%. Apple Silicon updates can change MLX kernels overnight. If you don't pin and report the driver / runtime stack, your reproduction will look like a measurement disagreement when it's actually a stack difference.

The honest version: “Measured on NVIDIA driver 570.86.10, CUDA 12.6, vLLM 0.6.3.post1, Linux kernel 6.8.0.” Anyone who reproduces against a different stack labels their attempt as a different cohort. See /resources/versioned-benchmarking.

5. Thermal throttling

Laptop GPUs and tightly-binned desktops throttle under sustained inference load. A 30-second burst on a Razer Blade or an MSI gaming laptop will produce a number that the same machine cannot hold for two minutes. Reporting that 30-second number as the throughput is reporting the boost clock, not the workload.

The honest version: “Burst (first 30s, cold): 42 tok/s. Sustained (5 min continuous generation, GPU 78°C, no fan policy override): 31 tok/s.” Both numbers, labeled. Sustained is what an operator actually feels.

6. Cherry-picking

Run five times, report the fastest. This is not measurement; it's selecting a sample after the fact. The right approach is to report the median and the standard deviation across at least five runs, with the cold first run discarded as warmup. llama-bench -r 5 does this for you; vLLM's benchmark_serving reports percentiles by default.

The honest version: “5 runs after 1 warmup. Median 38.4 tok/s, stddev 1.2 tok/s, range 36.9-40.1.” Same data, completely different signal.

7. Conflating eval scores with throughput

MMLU, GPQA, and HumanEval measure correctness. Tok/s measures speed. They are different axes, and a benchmark that mixes them — “the 32B model is faster and smarter than the 14B” — is reporting two unrelated measurements as one. A Q2_K quantization can be faster than Q4_K_M while scoring measurably worse on lm-evaluation-harness. A coding model can be slower than a general model while solving more HumanEval problems. Throughput says nothing about quality.

The honest version: “Throughput: 38 tok/s. Quality: HumanEval 71.3%, MMLU 78.1% (lm-evaluation-harness commit a3f2b1).” Two columns, both pinned, neither implies the other.

8. Ignoring TTFT

Decode tok/s is the single number most benchmarks publish. For a chat session it's the right one — but agent loops, voice interfaces, and streaming UIs are dominated by latency to first token. A runtime that decodes at 50 tok/s but takes 1.5 seconds to begin streaming feels worse for an agent loop than a runtime that decodes at 35 tok/s but begins streaming in 200ms. Measuring decode-only flatters the runtime that prefills slowly and hurts the runtime that prefills fast.

The honest version: “TTFT median 0.34s at 512-token prompt. Decode 38 tok/s sustained. For agent loops with short prompts, latency dominates; for long-form generation, decode dominates.”

9. Cross-runtime comparison without context

“vLLM does 80 tok/s, llama.cpp does 60.” Without specifying that vLLM was running batch=8 with AWQ-INT4 on a continuous-batching server while llama.cpp was running single-stream Q4_K_M with no batching, the comparison is meaningless. Both numbers are correct for their cohorts. The cohorts aren't the same workload.

The honest version: “vLLM 0.6.3, AWQ-INT4, batch=8 continuous-batching server: 80 tok/s aggregate, ~10 tok/s per perceived stream. llama.cpp b3812, Q4_K_M, single-stream: 60 tok/s. Different deployment shapes — pick the one that matches your workload.” See /benchmarks/cohorts for how the catalog enforces cohort matching.

10. Not recording the command line

The command you ran is the only artifact that lets someone else reproduce the measurement. “I ran Qwen2.5-32B on a 4090 and got 30 tok/s” is not a benchmark. A benchmark is the command, the runtime version, the driver state, the model file checksum, the prompt, the output token count, and the median+stddev across runs. Without that bundle, the number is a vibe, not data.

The honest version: Publish the full reproduction artifact. /resources/reproduction-guide walks through what that looks like; /benchmarks/reproduce is the live queue of operators independently re-running each other's measurements.

The pattern under all ten

Every mistake on this list is a cohort that drifted past the measurement. The fix is the same in every case: write down what you measured, pin every variable that the number depends on, and label the comparison cohort honestly. If you can't name the cohort, the number doesn't mean anything yet. If you can name it, it can graduate up the trust ladder at /trust/benchmarks and earn a confidence tier from /resources/confidence-methodology.

Next recommended step

The how-to that pairs with this list of how-not-to.

Benchmarking errors compound fast: a 2-bit quant on one GPU versus an 8-bit quant on another tells you nothing about hardware performance — it only tells you quantization is the dominant variable. Comparing results across hardware classes, quantization levels, and context lengths without controlling for those variables produces numbers that mislead more than they inform. Standardized methodology matters more than the GPU model name printed on the chart.

The standardized baselines worth comparing against: best GPU for Llama, and custom GPU comparison.