Hardware tier reality check
Editorial

Laptop vs consumer GPU vs workstation vs homelab vs rack

Hardware tiers are not a continuum — each step up changes the operator complexity in a discrete way. This matrix surfaces what each tier can actually run day-to-day, what breaks first, and what it costs you in ongoing maintenance, not just purchase price.

Dimension
Laptop iGPU
M-series / Strix Halo
Consumer GPU
RTX 4070-4090, 7900 XTX
Workstation
RTX 6000 / dual 4090
Homelab rack
2-4× consumer + UPS
Datacenter
H100/H200/B200
Largest model (4-bit) practical
Roughly the biggest model you can actually use day-to-day.
Limited
≤32B at Q4 on 64GB unified memory; smaller is better.
Strong
70B at Q4 fits a single 24-32GB card with tight context.
Excellent
120B at Q4; 70B at Q5 with full context.
Excellent
405B at Q4 with tensor parallel across 4 cards.
Excellent
Frontier-scale models, full FP16, multi-tenant.
Sustained vs burst speed
Tok/s under continuous load (matters for agents and long contexts).
Limited
Throttles within minutes; sustained ≈40-60% of burst.
Strong
Holds 90%+ if cooled; 4090 known to hit thermal cap on hot days.
Excellent
Designed for sustained load; near-100% indefinitely.
Excellent
If your room AC can handle it; otherwise thermal-bound.
Excellent
Sustained is the design point.
Power draw (typical inference)
Wall-power during a normal workload.
Excellent
30-80 W. Works on battery for short runs.
Strong
200-450 W per GPU + 100 W system.
Acceptable
300-500 W (RTX 6000 Ada is more efficient than 4090).
Limited
1-2 kW with 4 cards; needs dedicated circuit.
Limited
Per-GPU 700+ W; rack-scale planning.
What breaks first
The failure mode that ends your weekend.
Acceptable
Thermal throttle after 20-30 min sustained; battery wear if plugged in 24/7.
Acceptable
Driver mismatch + Windows update + CUDA version drift.
Strong
Same software issues as consumer; thermal is rarely the limit.
Limited
PSU + circuit breaker + summer thermal; SSD wear from constant model loads.
Strong
Hardware-managed; software is the operator's problem.
Multi-user serving
Concurrent inference for a small team.
Poor
Single-user only.
Limited
2-4 concurrent on vLLM; quality of service degrades fast.
Acceptable
10-20 concurrent on RTX 6000; production-borderline.
Strong
vLLM tensor-parallel across 4 cards; 30-60 concurrent feasible.
Excellent
Hundreds to thousands; the design point.
Operator complexity
Hours per month maintaining the rig.
Excellent
Effectively zero. macOS or Windows handles it.
Strong
1-3 hours/month on driver/runtime updates.
Strong
Same as consumer + occasional ECC investigation.
Limited
5-15 hours/month: cooling, restarts, kernel pinning, SSH access.
Limited
Full SRE responsibility; you have a job now.
Privacy / offline capability
Can you run with the network unplugged?
Excellent
Yes; smaller models work fine offline.
Excellent
Yes; the design case for owning a GPU.
Excellent
Yes.
Excellent
Yes; airgap a real option for sensitive work.
Limited
Network-dependent unless you own the rack.
$ to entry
Realistic 2026 acquisition cost.
Strong
$1.5-3.5k for a usable Apple Silicon; AMD Strix Halo similar.
Strong
$600-2.5k per card; full system $1.5-4k.
Limited
$5-12k system.
Limited
$8-20k+ depending on cards + cooling + UPS.
Poor
$30k+ per H100; rack-scale 5-7 figures.

Tier-jump tipping points

Laptop → consumer: you want a model larger than 32B, or you need sustained tok/s for agents that run for an hour at a time.

Consumer → workstation: you're running production inference for paying users, or you've had three driver-related Saturdays in a row.

Workstation → homelab: you want a model that needs >48 GB VRAM, or you're serving a small team and need vLLM tensor-parallel.

Homelab → datacenter: you have an actual SLA, or you're training, or you're running 405B+ frontier models. Otherwise stay homelab.