Will-It-Run methodology

The exact math behind our hardware compatibility predictions. Nothing hidden, nothing hand-waved. If a number on the site is wrong, it's wrong here too — we'd rather you find the bug than trust us blindly.

Last reviewed: May 6, 2026

Every prediction on /will-it-run is one of three things:

  • Measured — we ran the exact combination on owner hardware
  • Extrapolated — we have a measurement on similar-bandwidth hardware and scaled it
  • Estimated — we computed it from first principles using the formulas below

Each row of the results page tells you which it is. Trust the green badges most, the gray badges least.

VRAM accounting

The total memory needed to load and run a model is:

total_vram = weights + kv_cache + activations + runtime_overhead

Weights

Calculated from the model's parameter count and quantization precision:

weights_gb = (parameter_count_b * bits_per_param) / 8

The non-obvious part: bits_per_param isn't always what the quant name suggests. Q4_K_M uses 6-bit precision on attention and feed-forward layers and 4-bit elsewhere — its effective average is 4.83 bits/param, not 4. Same logic for other K-quant variants. We use these calibrated values, not naive ones.

KV cache (the part most calculators get wrong)

Every token in the context window costs key-value cache memory. With Grouped-Query Attention (GQA, used by all modern models), the formula is:

kv_cache_bytes = 2 * num_layers * num_kv_heads * head_dim
                * context_length * batch_size * bytes_per_element

The 2 accounts for both K and V tensors. The crucial detail: it uses num_kv_heads, NOT num_attention_heads. For Llama 3.1 8B that's 8 KV heads vs 32 attention heads — a 4× smaller cache than a naive calculator would predict. Llama 3.3 70B has 80 layers × 8 KV heads × 128 head_dim, which means at FP16 each token of context costs 2 × 80 × 8 × 128 × 2 = 327,680 bytes, or 320 MB per 1K tokens.

FP8 KV cache (newer runners on Hopper/Ada hardware) halves this. INT4 KV cache quartersit.

Activations and runtime overhead

Activation memory empirically scales as 0.05 × weights_gb + 0.001 × context_length — about 5% of weights, plus ~1KB per context token for intermediate buffers.

Runtime overhead (CUDA driver, kernel buffers, allocator fragmentation) is fixed per backend: ~1.8 GB for CUDA, ~2.0 GB for ROCm, ~0.7 GB for Metal/MLX, ~0.5 GB for CPU paths.

Speed prediction

Decode speed (tokens per second)

LLM token generation is memory-bandwidth bound, not compute bound. Every token requires reading the full model weights from memory. So the theoretical ceiling is:

peak_tps = bandwidth_gbps / weights_gb

Real-world tok/s is some fraction of peak depending on the runner and backend:

  • llama.cpp on CUDA — ~65% of peak (the most common path)
  • ExLlamaV2 on CUDA — ~80% (highest, custom kernels)
  • vLLM on CUDA — ~70% single-stream
  • llama.cpp on ROCm — ~55%
  • MLX on Metal — ~70%
  • llama.cpp Vulkan — ~45%
  • CPU only — ~45%
  • Hybrid GPU/CPU offload — ~30% (the harmonic-mean penalty)

Example: an RTX 5080 has 960 GB/s bandwidth. Llama 3.1 8B at Q4_K_M is 4.9 GB. Theoretical peak: 960 / 4.9 ≈ 196 tok/s. With llama.cpp CUDA at 65% efficiency, predicted ~127 tok/s. Real measurements typically land within 15% of that.

Time to first token (TTFT)

Prefill (processing the prompt before generating any tokens) is compute-bound, not memory-bound. The full prompt is processed in parallel, so the GPU's matrix-multiplication throughput is the bottleneck:

ttft_ms = (prompt_tokens * params * 2) / (compute_tflops * efficiency)

For most consumer cards on a typical 512-token prompt, this is well under a second. Old cards or huge prompts can push it past 2 seconds — at which point we surface "TTFT: slow" on the row. We currently predict TTFT only when the hardware has known FP16 TFLOPS in our database.

CPU offload speed

When a model doesn't fit fully in VRAM, llama.cpp can split layers between GPU and CPU. The combined speed is the harmonic mean weighted by layer fraction:

combined_tps = 1 / (gpu_frac/gpu_tps + cpu_frac/cpu_tps)

CPU tok/s depends on memory bandwidth class — DDR5-5600 dual-channel is ~80 GB/s effective, DDR4-3200 is ~40 GB/s, an 8-channel server (DDR5 EPYC) ~360 GB/s. We multiply by 0.4 to account for AVX-2/AVX-512 efficiency vs theoretical FP32 throughput. This is why our tool marks "CPU is the bottleneck" when the CPU side of the offload is the slower path — most calculators don't tell you this and the answer is often counterintuitive (upgrade RAM, not GPU).

Apple Silicon (special case)

Unified memory means CPU and GPU share one pool. There's no PCIe transfer cost. Effective bandwidth is the chip's memory bandwidth (e.g., 546 GB/s on M4 Max, 819 GB/s on M3 Ultra). We treat system_ram - 8 GB as the effective "VRAM" budget (leaving 8 GB for OS and apps). MLX-LM gets a 0.70 efficiency factor; llama.cpp Metal gets 0.62.

Mixture-of-experts (MoE) handling

For MoE models like Qwen 3 235B-A22B (235B total / 22B active) or Llama 4 Scout (109B / 17B active), VRAM is gated by total parameters but speed is gated by active parameters. We use active_parameter_count_b in the speed formula and parameter_count_b in the VRAM formula. This is why an MoE model can fit on a card that struggles with a dense model of the same total size, but runs slower than a dense model of the active size.

Use-case ranking

When you pick a use case (Coding, Reasoning, Vision, etc.), we re-rank the comfortable tier by a fit score:

  • Direct tag matches on the model's use_cases field
  • Family heuristics (e.g., DeepSeek R1 family scores higher for reasoning)
  • Slug heuristics (e.g., anything with "coder" in the slug for coding)
  • Hard filters (Vision = multimodal-only)
  • Parameter-count modifiers (bigger models score higher for harder tasks like reasoning)

The final ranking is 0.6 × use_case_fit + 0.4 × predicted_tokens_per_sec — weighted toward fit but speed still matters.

Confidence levels

Every tok/s number gets a one-letter badge:

  • M (green) — Measured. We ran exactly this model + hardware + quant + runner combo. Most accurate.
  • M~ (green) — Measured-near. We ran this model on this hardware but at a different context size or quant; we extrapolated.
  • C (blue) — Community. Submitted by a third party with citation. Useful for breadth, less reliable than measured.
  • ~ (yellow) — Extrapolated. We have a measurement for this model on similar-bandwidth hardware (e.g., we tested Llama 3.1 8B on RTX 5080 → predict for RTX 5070 Ti by bandwidth ratio).
  • E (gray) — Estimated. Pure formula. No benchmark data for this model on this hardware tier yet.

Limitations and caveats

  • We don't currently model thermal throttling. A poorly-cooled laptop will perform 10–30% below predictions during sustained inference.
  • Driver versions matter. New CUDA / ROCm / Metal drivers can change real performance by 5–15%. We only track this when it shows up in measured benchmarks.
  • Multi-GPU tensor parallelism is approximated as 2× VRAM, ~1.5× speed. Real numbers vary by interconnect (PCIe vs NVLink) and runner support.
  • Speculative decoding and draft-model setups can multiply speed by 1.5–3× but require paired models. We don't predict this yet.
  • KV cache quantization (FP8 / INT4 KV) is supported in our math but defaults to FP16. We mark it as a future expansion.

Sources and calibration

The constants in our engine were calibrated against:

Found a wrong number?

Please tell us: corrections@runlocalai.co. Include the URL, the model + hardware combo, and what you measured. We update the engine weekly; verified corrections ship within a week and get a credit on the page.

The full spec

The complete engineering spec for the tool is in our repo at docs/will-it-run-spec.md. It covers the future roadmap (multi-GPU, speculative decoding, FP8 KV cache, sponsored benchmark vetting) and the architectural decisions behind the schema and ranking.