Help improve the benchmark corpus
Reproduction is the moat. Every benchmark below would measurably improve the dataset if you reproduced it on your rig. Prefill links open the submission form with model, hardware, quant, and context already populated.
Total public reproductions on file: 0. Total queue items below: 8.
Editorial-priority opportunities
High/critical-priority benchmark gaps from the editorial roadmap.
Snapdragon 8 Elite is the mid-2025 flagship for Android on-device LLM inference. Establishing the NPU-vs-GPU-fallback tradeoff numbers is critical for the Android-on-device guidance.
Unlocks: /hardware/snapdragon-8-elite, /tools/qualcomm-ai-hub, /stacks/android-on-device-ai
Mobile on-device LLM viability is the most-asked question in the iPhone-developer ecosystem in 2026. A measured tok/s + battery drain + thermal throttling curve answers 'can I ship this in my app?'
Unlocks: /hardware/apple-a18-pro, /systems/mobile-local-ai, /stacks/iphone-on-device-ai
The single-5090 baseline is the comparison anchor for every multi-GPU recommendation on this site. Without it, the 'should I just buy one bigger card?' question can't be answered with confidence.
Unlocks: /hardware/rtx-5090, /guides/choosing-a-gpu-for-local-ai-2026, /guides/running-local-ai-on-multiple-gpus-2026
DeepSeek V4 Flash with the MTP head is claimed to be the throughput leader. Verifying the MTP advantage on production hardware is high-value for V4-Pro-vs-V4-Flash decision-making.
Unlocks: /hardware-combos/vllm-tensor-parallel-h100-workstation, /stacks/h100-tensor-parallel-workstation, /models/deepseek-v4-flash
The Apple-vs-NVIDIA comparison at the frontier-MoE tier is the most-asked question for Mac Studio buyers. Editorial estimate is 25-30% of NVIDIA throughput; measured value would close the loop.
Unlocks: /hardware-combos/mac-studio-m3-ultra-192gb, /stacks/apple-silicon-ai, /will-it-run/combo/mac-studio-m3-ultra-192gb
The frontier-MoE production reference. Organizations weighing $200k+ DGX-class purchases vs cloud rental need measured throughput to model cost-per-million-tokens accurately.
Unlocks: /hardware-combos/vllm-tensor-parallel-h100-workstation, /stacks/h100-tensor-parallel-workstation, /will-it-run/combo/vllm-tensor-parallel-h100-workstation
- moderateLlama 3.3 70B Instruct on (any)
Pairs with the dual-3090 measurement to quantify the NVLink-vs-PCIe penalty. The 4090 NVLink absence is the single most-misunderstood spec gap; a measured comparison ends the speculation.
Unlocks: /hardware-combos/dual-rtx-4090, /stacks/dual-4090-workstation, /will-it-run/combo/dual-rtx-4090
- moderateLlama 3.3 70B Instruct on (any)
The dual-3090 NVLink build is the most-recommended prosumer multi-GPU configuration on this site. Without a measured benchmark, the 25-32 tok/s estimate carries editorial-only confidence — operators making $1,500+ buying decisions deserve real numbers.
Unlocks: /hardware-combos/dual-rtx-3090, /stacks/dual-3090-workstation, /will-it-run/combo/dual-rtx-3090
How reproduction lifts confidence
The four-tier ladder runs editorial-driven, never automatic.
±15% match. Your measurement and the original within 15% on tok/s + matching quant + matching context bucket triggers a confidence lift on the original.
Two independent reproducers. Two distinct operators reproduce the same benchmark → the badge upgrades to independently-reproduced.
Editorial review. Submissions are never auto-published. Editorial reviews each within 1-7 days. Read the trust standards for the full process.