Field note · Multi-GPU

Dual 3090 vs single 5090 — the deployment we actually built

We needed 70B inference for a small team. Spent three weekends comparing two paths. Shipped the one the spec sheet said was second-best. Here's why.

The problem we were solving

Small team — four developers — needed local inference on Llama 3.3 70B Q4_K_M for an internal RAG pipeline over a document corpus. Cloud was out for compliance reasons. The workload pattern was: 2-3 of us actively using it during work hours, mostly chat-shaped queries with 4-12K context. Tail case was a long-context RAG batch maybe twice a week, running unattended overnight.

Budget ceiling was $3,000 for the GPU(s). Existing chassis (a Fractal Design Define R5) had room for two GPUs but only if we pulled the upper drive cage. PSU was a 1000W Seasonic Prime PX, comfortable for a single 4090 but tight for dual 3090s.

The two paths that fit the budget were:

Two used RTX 3090s, ~$800 each, 48 GB combined VRAM with NVLink-equivalent tensor-parallel performance via NCCL over PCIe.
One new RTX 5090, ~$2,200 retail, 32 GB on one card, no multi-GPU complexity.

The spec sheet said dual 3090 wins on combined VRAM (48 vs 32) and total throughput on tensor-parallel workloads. Single 5090 wins on simplicity, power efficiency, and bandwidth per card. We figured a weekend of testing would settle it.

Weekend one: built the dual-3090 rig

Bought two 3090s — one EVGA FTW3 (the same one I covered in the six-month used-3090 field note) and an Asus Strix from a different seller. Pulled the upper drive cage. Pulled an optical drive bay we hadn't used since 2014. Both cards fit. Barely. The Asus pulled clearance from the upper PCIe slot to the chassis side panel down to about 4mm.

Power delivery went OK. The 1000W PSU has dual EPS rails and five PCIe 8-pin connectors. Both cards needed two each, so the four PCIe rails handled it with one to spare. PSU sag under sustained dual-card load was ~3% on the 12V rail — within spec but noticeable.

Then came the actual problem: thermal. With both cards installed, the upper card (Asus Strix) was getting ~85% of its intake air pre-heated by the lower card (EVGA FTW3). Under sustained 70B inference, the upper card hit 82°C and started clock-throttling to 1,400 MHz. Bottom card held 76°C. Throughput on tensor-parallel decode was about 18 tok/s sustained, but the upper card was the bottleneck.

We added a 140mm intake fan to the side panel after pulling the cover off and cutting a hole. That's how serious we got about it. Brought the upper card down to 78°C. Throughput rose to 21 tok/s sustained. PSU sag stayed at 3-4%.

Weekend two: built the single-5090 rig

Returned one of the 3090s (within the seller's window). Bought a 5090. Installed it in the same chassis. Power delivery via the included 12VHPWR adapter. No thermal engineering required. Card runs hot — 78°C sustained — but in spec, with no compromise from a sibling card.

Throughput on Llama 3.3 70B Q4 with offload (because the model + KV cache + embedding model exceeds 32 GB at our target context window) was 16-18 tok/s. The offload to system RAM was the bottleneck, not the GPU. The 5090's bandwidth advantage didn't matter when we were waiting on DDR5 to feed it tensors.

We tried Q3_K_M to fit the model fully on-card. Throughput rose to 22-24 tok/s. Quality on the long-context RAG queries dropped noticeably — citations became less accurate, factual recall on the document corpus measurably worse. We didn't ship Q3 because the quality regression was too obvious to the team.

Weekend three: the decision we made

We shipped the dual 3090 rig. Slower per-card peak throughput, more thermal engineering, more complexity to maintain — but the 48 GB combined VRAM let us run Llama 3.3 70B Q4 fully on-GPU at the context length we actually wanted. The 5090's bandwidth advantage was irrelevant because the workload bottleneck was VRAM capacity, not memory bandwidth.

Sustained throughput in production: ~21 tok/s on dual 3090, vs ~17 tok/s on the 5090 with offload. That's a 25% difference in raw tok/s. Per-user latency was within margin of error — both setups put first token in the user's chat within 800ms. The throughput difference mattered for the weekly batch RAG runs (50 queries × ~60s on dual 3090 vs ~75s on single 5090, so an hour saved per batch). Over 18 months that's about 35 hours, which we don't think about much but is real.

Total cost: $1,640 for two 3090s + $40 for the case fan + $30 in zip ties + cable extensions, vs $2,200 for the 5090 alone. Dual 3090 was 25% cheaper AND faster on our specific workload.

What the spec sheet got least right

We expected the bandwidth difference to dominate (5090 has about 1.8 TB/s vs 3090's 936 GB/s — almost 2× per card). On a fits-fully-in-VRAM benchmark, the 5090 absolutely is faster per card. But our actual workload's bottleneck was upstream of bandwidth — we were waiting on system RAM to page model weights into the GPU because we couldn't fit the full model on a single card. Bandwidth doesn't help when you're waiting on DDR5.

The lesson — which I'd known abstractly but hadn't internalized — is that spec-sheet bandwidth advantages assume your workload fits in VRAM. The moment you're paging through system RAM, the GPU bandwidth conversation is over.

What we'd do differently

Skip the PSU we already had. The 1000W Seasonic was within spec but not comfortable for dual 3090s under sustained load. We'd buy a 1200W or 1300W PSU on day one for the dual-card path. Cost delta: $80-120.
Plan thermal from the start. The two weekends of "let's see if the existing chassis works" cost us a Saturday of cutting case panels with a Dremel. A chassis with proper dual-GPU airflow (Fractal Meshify, NZXT H7 Flow) avoids that. $120-150 vs the time we spent.
Run a fits-in-VRAM check before the decision. The single-5090 path's offload bottleneck was predictable from the model + context + KV cache math. We could have modeled it on a spreadsheet before buying anything. We didn't because we were confident the spec-sheet bandwidth would dominate. It didn't, because our workload was VRAM- capacity-bound. The will-it-run checker we now ship would have told us this in 30 seconds.
Don't return the second 3090 for testing. We went through the eBay return process to test the 5090, then re-bought a second 3090 from a different seller. Lost $40 on shipping and three days of waiting. If we had it to do over we'd have rented cloud GPU time for the 5090 test ($15 in vast.ai credit) and avoided the return loop entirely.

Who this story doesn't apply to

If your workload does fit comfortably on a single 24 GB card — most 14B-32B Q4 use cases — the dual-3090 complexity buys you nothing. A single used 3090 or single 4090 is the saner answer. Multi-GPU is a workload-specific decision; the deployment story above only matters when the single-card option puts you into offload territory.

If your workload is image generation rather than LLM inference, the bandwidth conversation reverses. Image diffusion is compute-bound; the 5090's tensor cores beat the 3090's by a wider margin than the spec sheet suggests. For pure image-gen production at our budget, single 5090 would have been the right call.

If you're staffing a team that needs to maintain the rig and you're not the one doing it, the dual-card complexity adds operational overhead for whoever inherits it. Single-card builds are easier to support. We made our choice partly because I was going to maintain it personally.

This is a field note — a specific deployment story. The full head-to-head with spec tables is at dual RTX 3090 vs RTX 5090. Our broader buyer-guide framework is at best GPU for local AI. Editorial principles at editorial philosophy.