Community submitted(Help-improve queue)

Help improve the benchmark corpus

Reproduction is the moat. Every benchmark below would measurably improve the dataset if you reproduced it on your rig. Prefill links open the submission form with model, hardware, quant, and context already populated.

Total public reproductions on file: 0. Total queue items below: 8.

Editorial-priority opportunities

High/critical-priority benchmark gaps from the editorial roadmap.

moderate
Phi-3.5 Mini Instruct on Qualcomm Snapdragon 8 Elite
Snapdragon 8 Elite is the mid-2025 flagship for Android on-device LLM inference. Establishing the NPU-vs-GPU-fallback tradeoff numbers is critical for the Android-on-device guidance.
Unlocks: /hardware/snapdragon-8-elite, /tools/qualcomm-ai-hub, /stacks/android-on-device-ai
Reproduce →View original
moderate
Llama 3.2 3B Instruct on Apple A18 Pro
Mobile on-device LLM viability is the most-asked question in the iPhone-developer ecosystem in 2026. A measured tok/s + battery drain + thermal throttling curve answers 'can I ship this in my app?'
Unlocks: /hardware/apple-a18-pro, /systems/mobile-local-ai, /stacks/iphone-on-device-ai
Reproduce →View original
hard
Qwen 3 Coder 32B on NVIDIA GeForce RTX 5090
The single-5090 baseline is the comparison anchor for every multi-GPU recommendation on this site. Without it, the 'should I just buy one bigger card?' question can't be answered with confidence.
Unlocks: /hardware/rtx-5090, /guides/choosing-a-gpu-for-local-ai-2026, /guides/running-local-ai-on-multiple-gpus-2026
Reproduce →View original
moderate
DeepSeek V4 Flash (284B MoE) on (any)
DeepSeek V4 Flash with the MTP head is claimed to be the throughput leader. Verifying the MTP advantage on production hardware is high-value for V4-Pro-vs-V4-Flash decision-making.
Unlocks: /hardware-combos/vllm-tensor-parallel-h100-workstation, /stacks/h100-tensor-parallel-workstation, /models/deepseek-v4-flash
Reproduce →View original
moderate
Qwen 3.5 235B-A17B (MoE) on (any)
The Apple-vs-NVIDIA comparison at the frontier-MoE tier is the most-asked question for Mac Studio buyers. Editorial estimate is 25-30% of NVIDIA throughput; measured value would close the loop.
Unlocks: /hardware-combos/mac-studio-m3-ultra-192gb, /stacks/apple-silicon-ai, /will-it-run/combo/mac-studio-m3-ultra-192gb
Reproduce →View original
moderate
Qwen 3.5 235B-A17B (MoE) on (any)
The frontier-MoE production reference. Organizations weighing $200k+ DGX-class purchases vs cloud rental need measured throughput to model cost-per-million-tokens accurately.
Unlocks: /hardware-combos/vllm-tensor-parallel-h100-workstation, /stacks/h100-tensor-parallel-workstation, /will-it-run/combo/vllm-tensor-parallel-h100-workstation
Reproduce →View original
moderate
Llama 3.3 70B Instruct on (any)
Pairs with the dual-3090 measurement to quantify the NVLink-vs-PCIe penalty. The 4090 NVLink absence is the single most-misunderstood spec gap; a measured comparison ends the speculation.
Unlocks: /hardware-combos/dual-rtx-4090, /stacks/dual-4090-workstation, /will-it-run/combo/dual-rtx-4090
Reproduce →View original
moderate
Llama 3.3 70B Instruct on (any)
The dual-3090 NVLink build is the most-recommended prosumer multi-GPU configuration on this site. Without a measured benchmark, the 25-32 tok/s estimate carries editorial-only confidence — operators making $1,500+ buying decisions deserve real numbers.
Unlocks: /hardware-combos/dual-rtx-3090, /stacks/dual-3090-workstation, /will-it-run/combo/dual-rtx-3090
Reproduce →View original

How reproduction lifts confidence

The four-tier ladder runs editorial-driven, never automatic.

±15% match. Your measurement and the original within 15% on tok/s + matching quant + matching context bucket triggers a confidence lift on the original.

Two independent reproducers. Two distinct operators reproduce the same benchmark → the badge upgrades to independently-reproduced.

Editorial review. Submissions are never auto-published. Editorial reviews each within 1-7 days. Read the trust standards for the full process.

Next recommended step

See the public benchmark roadmap

OrSee cohort coverage Read the reproduction guide