Laptop vs consumer GPU vs workstation vs homelab vs rack

Hardware tiers are not a continuum — each step up changes the operator complexity in a discrete way. This matrix surfaces what each tier can actually run day-to-day, what breaks first, and what it costs you in ongoing maintenance, not just purchase price.

Dimension	Laptop iGPU M-series / Strix Halo	Consumer GPU RTX 4070-4090, 7900 XTX	Workstation RTX 6000 / dual 4090	Homelab rack 2-4× consumer + UPS	Datacenter H100/H200/B200
Largest model (4-bit) practical Roughly the biggest model you can actually use day-to-day.	Limited ≤32B at Q4 on 64GB unified memory; smaller is better.	Strong 70B at Q4 fits a single 24-32GB card with tight context.	Excellent 120B at Q4; 70B at Q5 with full context.	Excellent 405B at Q4 with tensor parallel across 4 cards.	Excellent Frontier-scale models, full FP16, multi-tenant.
Sustained vs burst speed Tok/s under continuous load (matters for agents and long contexts).	Limited Throttles within minutes; sustained ≈40-60% of burst.	Strong Holds 90%+ if cooled; 4090 known to hit thermal cap on hot days.	Excellent Designed for sustained load; near-100% indefinitely.	Excellent If your room AC can handle it; otherwise thermal-bound.	Excellent Sustained is the design point.
Power draw (typical inference) Wall-power during a normal workload.	Excellent 30-80 W. Works on battery for short runs.	Strong 200-450 W per GPU + 100 W system.	Acceptable 300-500 W (RTX 6000 Ada is more efficient than 4090).	Limited 1-2 kW with 4 cards; needs dedicated circuit.	Limited Per-GPU 700+ W; rack-scale planning.
What breaks first The failure mode that ends your weekend.	Acceptable Thermal throttle after 20-30 min sustained; battery wear if plugged in 24/7.	Acceptable Driver mismatch + Windows update + CUDA version drift.	Strong Same software issues as consumer; thermal is rarely the limit.	Limited PSU + circuit breaker + summer thermal; SSD wear from constant model loads.	Strong Hardware-managed; software is the operator's problem.
Multi-user serving Concurrent inference for a small team.	Poor Single-user only.	Limited 2-4 concurrent on vLLM; quality of service degrades fast.	Acceptable 10-20 concurrent on RTX 6000; production-borderline.	Strong vLLM tensor-parallel across 4 cards; 30-60 concurrent feasible.	Excellent Hundreds to thousands; the design point.
Operator complexity Hours per month maintaining the rig.	Excellent Effectively zero. macOS or Windows handles it.	Strong 1-3 hours/month on driver/runtime updates.	Strong Same as consumer + occasional ECC investigation.	Limited 5-15 hours/month: cooling, restarts, kernel pinning, SSH access.	Limited Full SRE responsibility; you have a job now.
Privacy / offline capability Can you run with the network unplugged?	Excellent Yes; smaller models work fine offline.	Excellent Yes; the design case for owning a GPU.	Excellent Yes.	Excellent Yes; airgap a real option for sensitive work.	Limited Network-dependent unless you own the rack.
$ to entry Realistic 2026 acquisition cost.	Strong $1.5-3.5k for a usable Apple Silicon; AMD Strix Halo similar.	Strong $600-2.5k per card; full system $1.5-4k.	Limited $5-12k system.	Limited $8-20k+ depending on cards + cooling + UPS.	Poor $30k+ per H100; rack-scale 5-7 figures.

Tier-jump tipping points

Laptop → consumer: you want a model larger than 32B, or you need sustained tok/s for agents that run for an hour at a time.

Consumer → workstation: you're running production inference for paying users, or you've had three driver-related Saturdays in a row.

Workstation → homelab: you want a model that needs >48 GB VRAM, or you're serving a small team and need vLLM tensor-parallel.

Homelab → datacenter: you have an actual SLA, or you're training, or you're running 405B+ frontier models. Otherwise stay homelab.

Next steps

Compare local vs cloud inference

OrCalculate operator total cost Browse hardware reviews