Benchmark roadmap — what we want measured
The public version of our benchmark queue. Each entry is a model+hardware combo we'd like a measurement for, and why that measurement would unlock useful pages or sharpen a confidence tier. If you have the rig, click “I can measure this” — the submission form arrives prefilled with model, hardware, and runtime.
0 measured · 0 wanted · 168 unstarted
| Hardware ↓ / Model → | DeepSeek V4 Pro (1 | Qwen 3.5 235B-A17B | Qwen 3 235B-A22B | Llama 3.1 8B Instr | DeepSeek R1 (671B | DeepSeek V4 Flash | Llama 4 Scout | Qwen 3 30B-A3B | Qwen 2.5 Coder 32B | Llama 3.3 70B Inst | Qwen 3 32B | Gemma 4 31B Dense | Qwen 3 8B | Mistral Medium 3.5 |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| AMD Instinct MI325X | ||||||||||||||
| NVIDIA H200 | ||||||||||||||
| NVIDIA L40S | ||||||||||||||
| NVIDIA B200 | ||||||||||||||
| NVIDIA H100 PCIe | ||||||||||||||
| NVIDIA L40 | ||||||||||||||
| NVIDIA RTX PRO 6000 Blackwel | ||||||||||||||
| NVIDIA RTX 6000 Ada Generati | ||||||||||||||
| AMD Instinct MI300X | ||||||||||||||
| NVIDIA GB200 NVL72 | ||||||||||||||
| AMD Instinct MI355X | ||||||||||||||
| AMD Instinct MI300A (APU) |
Have a rig? Run a benchmark in 5 minutes.
The community-benchmark scripts capture power source, GPU clock, CUDA version, thermal state — every variable that affects the tok/s number. Output is a paste-ready result block. Nothing uploads automatically.
.\scripts\community-benchmark\run-benchmark.ps1
./scripts/community-benchmark/run-benchmark.sh
How this works
1. We mark a model+hardware combo as “wanted” when measuring it would unlock new pages, sharpen a confidence tier, or fill a gap that operators are repeatedly asking about.
2. When you click “I can measure this,” the submission form opens with the model, hardware, and runtime prefilled. You add your measurements (tok/s, VRAM, context, runtime version, OS).
3. Submissions still go through editorial review. We don't auto-publish. If your numbers are plausible and well-documented, we mark the opportunity as measured and the benchmark goes live with full attribution.
- Criticaltarget: 55-75 tok/s decode (single stream)
Single RTX 5090 + Qwen 3 Coder 32B (vLLM, AWQ-INT4)
Qwen 3 Coder 32B on NVIDIA GeForce RTX 5090 · vLLM · AWQ-INT4
Why we want thisThe single-5090 baseline is the comparison anchor for every multi-GPU recommendation on this site. Without it, the 'should I just buy one bigger card?' question can't be answered with confidence.
Unlocks/hardware/rtx-5090/guides/choosing-a-gpu-for-local-ai-2026/guides/running-local-ai-on-multiple-gpus-2026/will-it-run/rtx-5090Run this benchmark on your rigMODEL="qwen-3:32b" ./scripts/community-benchmark/run-benchmark.sh
- Hightarget: 12-25 tok/s decode (Hexagon NPU, estimate)
Snapdragon 8 Elite + Phi-3.5 Mini (Qualcomm AI Hub, INT8)
Phi-3.5 Mini Instruct on Qualcomm Snapdragon 8 Elite · Qualcomm AI Hub · INT8
Why we want thisSnapdragon 8 Elite is the mid-2025 flagship for Android on-device LLM inference. Establishing the NPU-vs-GPU-fallback tradeoff numbers is critical for the Android-on-device guidance.
Unlocks/hardware/snapdragon-8-elite/tools/qualcomm-ai-hub/stacks/android-on-device-aiRun this benchmark on your rigMODEL="phi-3.5:phi-3.5-mini-instruct" ./scripts/community-benchmark/run-benchmark.sh
- Hightarget: 8-15 tok/s decode (estimate, sustained)
iPhone 16 Pro + Llama 3.2 3B (MLX Swift, INT4)
Llama 3.2 3B Instruct on Apple A18 Pro · MLX Swift · MLX-INT4
Why we want thisMobile on-device LLM viability is the most-asked question in the iPhone-developer ecosystem in 2026. A measured tok/s + battery drain + thermal throttling curve answers 'can I ship this in my app?'
Unlocks/hardware/apple-a18-pro/systems/mobile-local-ai/stacks/iphone-on-device-aiRun this benchmark on your rigMODEL="llama-3.2:3b" ./scripts/community-benchmark/run-benchmark.sh
- Hightarget: 100-160 tok/s decode (single stream)
4× H100 SXM + DeepSeek V4 Flash (vLLM TP-4, INT4)
DeepSeek V4 Flash (284B MoE) on · vLLM · AWQ-INT4
Why we want thisDeepSeek V4 Flash with the MTP head is claimed to be the throughput leader. Verifying the MTP advantage on production hardware is high-value for V4-Pro-vs-V4-Flash decision-making.
Unlocks/hardware-combos/vllm-tensor-parallel-h100-workstation/stacks/h100-tensor-parallel-workstation/models/deepseek-v4-flashRun this benchmark on your rigMODEL="deepseek-v4:deepseek-v4-flash" ./scripts/community-benchmark/run-benchmark.sh
- Hightarget: 8-14 tok/s decode (single stream)
Mac Studio M3 Ultra 192GB + Qwen 3.5 235B-A17B (MLX-4bit)
Qwen 3.5 235B-A17B (MoE) on · MLX-LM · MLX-4bit
Why we want thisThe Apple-vs-NVIDIA comparison at the frontier-MoE tier is the most-asked question for Mac Studio buyers. Editorial estimate is 25-30% of NVIDIA throughput; measured value would close the loop.
Unlocks/hardware-combos/mac-studio-m3-ultra-192gb/stacks/apple-silicon-ai/will-it-run/combo/mac-studio-m3-ultra-192gb/guides/running-local-ai-on-multiple-gpus-2026Run this benchmark on your rigMODEL="qwen-3.5:17b" ./scripts/community-benchmark/run-benchmark.sh
- Hightarget: 60-90 tok/s decode (single stream)
4× H100 SXM + Qwen 3.5 235B-A17B (vLLM TP-4, FP8)
Qwen 3.5 235B-A17B (MoE) on · vLLM · FP8
Why we want thisThe frontier-MoE production reference. Organizations weighing $200k+ DGX-class purchases vs cloud rental need measured throughput to model cost-per-million-tokens accurately.
Unlocks/hardware-combos/vllm-tensor-parallel-h100-workstation/stacks/h100-tensor-parallel-workstation/will-it-run/combo/vllm-tensor-parallel-h100-workstation/guides/running-local-ai-on-multiple-gpus-2026Run this benchmark on your rigMODEL="qwen-3.5:17b" ./scripts/community-benchmark/run-benchmark.sh
- Hightarget: 28-36 tok/s decode (PCIe only)
Dual RTX 4090 + Llama 3.3 70B Q4 (vLLM tensor-parallel)
Llama 3.3 70B Instruct on · vLLM · AWQ-INT4
Why we want thisPairs with the dual-3090 measurement to quantify the NVLink-vs-PCIe penalty. The 4090 NVLink absence is the single most-misunderstood spec gap; a measured comparison ends the speculation.
Unlocks/hardware-combos/dual-rtx-4090/stacks/dual-4090-workstation/will-it-run/combo/dual-rtx-4090/guides/running-local-ai-on-multiple-gpus-2026Run this benchmark on your rigMODEL="llama-3.3:70b" ./scripts/community-benchmark/run-benchmark.sh
- Hightarget: 25-32 tok/s decode (NVLink)
Dual RTX 3090 + Llama 3.3 70B Q4 (vLLM tensor-parallel)
Llama 3.3 70B Instruct on · vLLM · AWQ-INT4
Why we want thisThe dual-3090 NVLink build is the most-recommended prosumer multi-GPU configuration on this site. Without a measured benchmark, the 25-32 tok/s estimate carries editorial-only confidence — operators making $1,500+ buying decisions deserve real numbers.
Unlocks/hardware-combos/dual-rtx-3090/stacks/dual-3090-workstation/will-it-run/combo/dual-rtx-3090/guides/running-local-ai-on-multiple-gpus-2026Run this benchmark on your rigMODEL="llama-3.3:70b" ./scripts/community-benchmark/run-benchmark.sh
- Mediumtarget: 10-22 tok/s decode (Adreno GPU path)
Snapdragon 8 Elite + Llama 3.2 3B (MLC LLM, GPU)
Llama 3.2 3B Instruct on Qualcomm Snapdragon 8 Elite · MLC LLM · Q4_K_M (TVM-quant)
Why we want thisMLC LLM is cross-platform and the most-deployed mobile LLM runtime. The Adreno-vs-Hexagon comparison on the same SoC determines whether NPU lock-in is worth the throughput gain.
Unlocks/hardware/snapdragon-8-elite/tools/mlc-llm/stacks/android-on-device-aiRun this benchmark on your rigMODEL="llama-3.2:3b" ./scripts/community-benchmark/run-benchmark.sh
- Mediumtarget: 20-35 tok/s decode (cold); throttle curve TBD
iPad M4 + Qwen 2.5 3B (MLX, sustained-load curve)
Qwen 2.5 3B Instruct on Apple M4 (iPad Pro) · MLX-LM · MLX-4bit
Why we want thisTablet-class on-device viability for journaling / long-form summarization. Needs the throttle curve, not just peak tok/s.
Unlocks/hardware/apple-m4-ipad/systems/mobile-local-aiRun this benchmark on your rigMODEL="qwen-2.5:3b" ./scripts/community-benchmark/run-benchmark.sh
- Mediumtarget: 18-35 tok/s decode (estimate)
Intel Lunar Lake + Phi-3.5 Mini (OpenVINO NPU)
Phi-3.5 Mini Instruct on Intel Core Ultra 7 258V (Lunar Lake) · ONNX Runtime Mobile · INT8
Why we want thisLunar Lake is the Intel reference for Copilot+ PCs. Comparison vs Snapdragon X NPU determines which Copilot+ chip operators should prefer for on-device LLMs.
Unlocks/hardware/intel-lunar-lake-258v/systems/mobile-local-aiRun this benchmark on your rigMODEL="phi-3.5:phi-3.5-mini-instruct" ./scripts/community-benchmark/run-benchmark.sh
- Mediumtarget: 20-40 tok/s decode (estimate)
Snapdragon X Elite + Phi-3.5 Mini (ONNX Runtime + DirectML NPU)
Phi-3.5 Mini Instruct on Qualcomm Snapdragon X Elite · ONNX Runtime Mobile · INT8
Why we want thisCopilot+ PC ecosystem is rapidly expanding. The Snapdragon X NPU vs Lunar Lake NPU vs CPU-fallback comparison is the operator decision for Windows on-device deployments.
Unlocks/hardware/snapdragon-x-elite/tools/onnx-runtime-mobile/systems/mobile-local-aiRun this benchmark on your rigMODEL="phi-3.5:phi-3.5-mini-instruct" ./scripts/community-benchmark/run-benchmark.sh
- Mediumtarget: 30-45 tok/s per stream × 4-32 concurrent
Ray Serve 4-node × 2× 4090 + Qwen 3 32B (concurrency scan)
Qwen 3 32B on · Ray Serve · AWQ-INT4
Why we want thisDistributed-serving patterns differ from tensor-parallel — replicas scale aggregate concurrency, not single-stream model size. The concurrency scan reveals where Ray Serve replicas plateau.
Unlocks/hardware-combos/ray-serve-distributed-multi-node/stacks/distributed-inference-homelabRun this benchmark on your rigMODEL="qwen-3:32b" ./scripts/community-benchmark/run-benchmark.sh
- Mediumtarget: 4-9 tok/s decode (Thunderbolt 5 inter-node)
4× Mac Mini M4 Pro Exo cluster + Llama 3.1 70B (MLX-4bit)
Llama 3.1 70B Instruct on · Exo · MLX-4bit
Why we want thisMulti-Mac Exo clusters are an emerging pattern. The cluster-vs-single-Mac-Studio comparison establishes whether the cluster is ever the right answer outside extreme memory targets.
Unlocks/hardware-combos/quad-mac-mini-m4-pro-exo/stacks/multi-machine-apple-cluster/will-it-run/combo/quad-mac-mini-m4-pro-exoRun this benchmark on your rigMODEL="llama-3.1:70b" ./scripts/community-benchmark/run-benchmark.sh
- Mediumtarget: 20-28 tok/s decode (with thinking-mode bloat)
4× RTX 3090 + DeepSeek R1 Distill Llama 70B (vLLM TP-4)
DeepSeek R1 Distill Llama 70B on · vLLM · AWQ-INT4
Why we want thisQuad-3090 is the prosumer-ceiling stack. R1 reasoning workloads are a high-traffic use case but the thinking-mode token bloat changes the throughput calculation — needed to set realistic operator expectations.
Unlocks/hardware-combos/quad-rtx-3090/stacks/quad-3090-workstation/will-it-run/combo/quad-rtx-3090Run this benchmark on your rigMODEL="deepseek-r1:70b" ./scripts/community-benchmark/run-benchmark.sh
- Lowtarget: 10-16 tok/s (asymmetric layer-split)
llama.cpp layer-split + Mixtral 8x22B (mixed 4090+3090)
Mixtral 8x22B Instruct on · llama.cpp · Q4_K_M
Why we want thisResearch-tier benchmark — mixed-GPU is editorial-discouraged for new builds. Useful for users who already own asymmetric pairs.
Unlocks/hardware-combos/mixed-4090-3090/stacks/mixed-4090-3090-workstationRun this benchmark on your rigMODEL="mixtral-8x22b:22b" ./scripts/community-benchmark/run-benchmark.sh
Need a measurement we don't cover?
Got a measurement? Submit it directly — editorial adds the corresponding opportunity row and credits the gap to you.
Need one measured? Request it. Editorial reviews and surfaces accepted requests in the section above.