Community submitted

Editorial benchmark

DeepSeek R1 Distill Qwen 32B on NVIDIA GeForce RTX 4090

Measured this month.

Why trust this benchmark?

Measurement

tok/s: 32.5
TTFT: 165 ms
VRAM used: 22.4 GB
RAM used: 5.1 GB
Power: 384 W
Quant: AWQ-INT4
Context: 32K
Run date: 2026-05-06
Source: community

V36.52 rigor detail

Protocol →

Efficiency: 0.085 tok/W
Runs captured: 1
Scenario: Single-stream

Editorial notes

DeepSeek R1 Distill Qwen 32B AWQ-INT4 on RTX 4090 via vLLM. Reasoning-token emission is the operationally-significant detail — a query with 2000 thinking tokens before the answer adds ~50-60 seconds before user-visible response.

Why this confidence tier?

Moderate confidence

Confidence is rule-based. Every factor below contributed to the tier. We never expose a single numeric score; the tier label is auditable through this explanation alone.

Factors

+Source: community submission

How to improve this benchmark's confidence

Reproduce this benchmark →An independent reproduction with matching numbers lifts the tier and reduces single-source risk.
Read the confidence methodology →Full editorial standards for tiering.
Why we don't use percentages →Tier labels — auditable, no opaque score.

Cohort intelligence

How this measurement compares to the rest of the corpus. Only comparable rows (same model + hardware first, with relaxations labelled) are used. We never average across runtimes or quant formats unless explicitly told to.

Insufficient comparison data. Insufficient cohort (0 comparable measurements). Outlier detection requires ≥5.

Same hardware, different model

6 matching rows

What else this rig can run at the same quant bucket.

Median tok/s

37.4

Spread

8.0 – 150.0

CoV

99%

36.5 tok/srtx-4090AWQ-INT4Editorial
38.2 tok/srtx-4090AWQ-INT4Editorial
38.2 tok/srtx-4090AWQ-INT4Editorial
14.8 tok/srtx-4090Q4_K_MEditorial
8.0 tok/srtx-4090Q4_K_MEditorial
+1 more

Reproduce this benchmark

Got the same model + hardware combo? Run the same measurement and submit your numbers. We'll pre-fill model, hardware, quant, and context — you just add your tok/s, VRAM, runtime version. If your numbers match within ±15%, this benchmark gets a confidence lift and a reproduction badge.

Reproduce this benchmark →

Drill into the entity pages for this measurement.

DeepSeek R1 Distill Qwen 32B model page

NVIDIA GeForce RTX 4090 hardware page

All measurements for this exact pair

Try NVIDIA GeForce RTX 4090 in the build engine

Cite or export

Reference this benchmark in your work. Multiple formats; CC-BY attribution required.

Cite this benchmark or paste it into a README. Copy-to-clipboard; license is CC-BY-4.0 (attribution to RunLocalAI required).

OG card (PNG)

1200x630, social-preview ready

Download SVG

vector card, scales cleanly

Embed this benchmark

Paste into a Reddit thread, blog post, or README — attribution baked in.

<a href="https://runlocalai.co/benchmarks/336" rel="noopener">RunLocalAI: DeepSeek R1 Distill Qwen 32B on NVIDIA GeForce RTX 4090 — 32.5 tok/s</a>

Direct download: .json · .md · .bib · .svg

Next recommended step

Got the same model + hardware? Run it and submit your numbers — successful reproductions lift this benchmark's confidence tier.

Reproduce this benchmark

OrCompare other measurements for DeepSeek R1 Distill Qwen 32B on NVIDIA GeForce RTX 4090 See the benchmark roadmap

Measurement

V36.52 rigor detail

Why this confidence tier?

Cohort intelligence

Same hardware, different model

Reproduce this benchmark

Related

Cite or export

Next recommended step