Capability notes
Open-weight LLMs at the 7B scale produce functional chat and single-step instruction-following at 50–100 tok/s on consumer GPUs but degrade below 85% on TruthfulQA and fail multi-step reasoning. The 32B class ([Qwen 3 32B](/models/qwen-3-32b), [Command R+ 08-2024](/models/command-r-plus-08-2024), [Gemma 4 31B](/models/gemma-4-31b)) is the practical sweet spot — strong instruction-following across 20+ languages, solid JSON-mode reliability, and 32K–128K context handling. The 70B dense class ([Llama 3.3 70B](/models/llama-3-3-70b)) dominates the cost/capability Pareto frontier at 94.5% IFEval reliability with mature tooling across every inference engine. The frontier MoE tier ([DeepSeek V4](/models/deepseek-v4), [DeepSeek V3](/models/deepseek-v3), [Qwen 3 235B-A22B](/models/qwen-3-235b-a22b)) pushes into GPT-4-class capability at 10–50× lower active-parameter cost per token but introduces MoE routing latency variance and serving complexity.
Quality differentiators that matter operationally: instruction-following reliability (IFEval score above 94% for 70B+), long-context degradation measured by RULER at 32K/64K/128K, structured output reliability as % valid JSON under grammar constraints, and multilingual benchmarks like MMLU-pro and Global MMLU. The gap between 70B dense and frontier MoE narrows yearly. [Llama 3.3 70B](/models/llama-3-3-70b) in 2026 performs at roughly GPT-4-0314 levels on LMSys Chatbot Arena ELO (~1250), while frontier MoE models score within 20 ELO points of GPT-5 mini on reasoning benchmarks. For general text workloads — email drafting, summarization, data extraction, Q&A, content generation — a well-tuned 70B model with 16K+ context covers 90% of production use cases. Frontier MoE matters for competitive reasoning benchmarks and when prompt complexity demands above-70B reasoning depth.
Specialized instruction-tuned variants ([Aya Expanse 32B](/models/aya-expanse-32b) for 23-language support, [Command R 35B](/models/command-r-35b) for tool-use alignment) outperform general models within narrow domains but underperform on broad benchmarks. The operational reality: run the smallest model that meets your quality threshold. Each parameter class above your need costs 2–3× in VRAM and latency with diminishing quality returns.
If you just want to try this
Lowest-friction path to a working setup.
Download [Ollama](/tools/ollama) (ollama.com/download for Windows/macOS/Linux), install it, then open a terminal and run:
```
ollama run llama3.3:70b-instruct-q4_K_M
```
This pulls [Llama 3.3 70B](/models/llama-3-3-70b) at 4-bit quantization (~40 GB). You need a GPU with at least 16 GB VRAM — the model will partially offload to GPU with the remainder on CPU/RAM, producing roughly 8–15 tok/s on an [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) or [RTX 4070](/hardware/rtx-4070). On an [RTX 4090](/hardware/rtx-4090) with 24 GB VRAM, the full model fits and you get 22–28 tok/s. On a [MacBook Pro 16" M4 Max](/hardware/macbook-pro-16-m4-max) at 128 GB unified, the model runs entirely on the SoC at 25–35 tok/s.
If 70B is too large for your hardware, step down to the 32B tier:
```
ollama run qwen3:32b
```
[Qwen 3 32B](/models/qwen-3-32b) at Q4 uses ~19 GB and runs at 40–55 tok/s on any GPU with 12 GB+ VRAM. The quality gap from 32B to 70B is noticeable on complex multi-step prompts and non-English languages but the 32B model handles 80% of general chat tasks.
For a GUI, install [LM Studio](/tools/lm-studio), search "llama-3.3-70b" or "qwen-3-32b" in the in-app model browser, pick a GGUF quant (Q4_K_M for the VRAM/quality best-fit), and start chatting. LM Studio handles GPU offload automatically and shows VRAM usage before loading.
Avoid the 7B tier for general text generation unless your hardware is strictly limited to 6–8 GB VRAM. The quality gap from 7B to 32B is large enough that spending $300 on a used [RTX 3060 12GB](/hardware/rtx-3060-12gb) to comfortably run 32B models pays for itself in output quality.
For production deployment
Operator-grade recommendation.
Production text generation serving uses [vLLM](/tools/vllm) (0.7.0+) with FP8 quantization for 70B dense models or FP8 MoE routing for [DeepSeek V4](/models/deepseek-v4)-class frontier models. Throughput numbers on [H100 PCIe](/hardware/nvidia-h100-pcie) at batch=1: [Llama 3.3 70B](/models/llama-3-3-70b) at FP8 delivers ~140 tok/s per stream, [Qwen 3 32B](/models/qwen-3-32b) at FP8 delivers ~280 tok/s, and [Qwen 3 30B-A3B](/models/qwen-3-30b-a3b) at FP8 delivers ~420 tok/s. Under continuous batching with 16 concurrent requests, aggregate throughput on a single H100 PCIe hits 2500–3500 tok/s for 70B FP8.
Multi-tenant isolation requires separating inference pools. Run distinct vLLM instances per tenant on dedicated GPU groups (e.g., 2× H100 PCIe per tenant for 70B models) rather than multiplexing a single instance across tenants with API-key routing. The latter leaks prompt timing information and risks cross-tenant KV cache interference under memory pressure. vLLM's prefix caching reduces latency on multi-tenant workloads with shared system prompts — a 30–40% latency reduction on customer-facing chatbots with common preamble text.
Use [SGLang](/tools/sglang) when you need structured output (grammar-constrained generation with FSM-based decoding) or radix-aware prefix caching that outperforms vLLM's prefix cache by 10–20% on workloads with high prompt overlap. Use [TensorRT-LLM](/tools/tensorrt-llm) when serving at NVIDIA-datacenter scale with inflight batching on [H200](/hardware/nvidia-h200) or [B200](/hardware/nvidia-b200) — TensorRT-LLM squeezes 15–25% more throughput than vLLM on the same NVIDIA hardware through engine-level optimizations, at the cost of a complex build pipeline and ~4-hour model compilation time.
When NOT to self-host: if your QPS is under 5 and latency tolerance is above 2 seconds, API providers (OpenRouter, Together, Anthropic) are cheaper than provisioning a dedicated GPU. At 5–20 QPS with bursty traffic, a single H100 PCIe at ~$2.50/hr on Runpod breaks even with API costs. Above 20 sustained QPS, self-hosted with reserved instances (~$1.80/hr on Lambda Labs) wins on $/token by 30–60% vs API pricing for 70B-class models.
Critical operational metrics to monitor: time-to-first-token (TTFT) p50/p95/p99, inter-token latency variance (jitter above 50ms degrades streaming UX), KV cache hit rate (above 60% for shared-prompt workloads), and GPU memory pressure (keep VRAM below 85% utilization to avoid OOM spikes under load bursts). Set max-model-len to your actual maximum prompt+completion to avoid overallocation. Disable vLLM's default automatic prefix caching on models under 10K tokens context — the overhead exceeds the benefit.
What breaks
Failure modes operators see in the wild.
**KV cache exhaustion under load.** Symptom: OOM crashes or 10–30× latency spikes when concurrent requests exceed provisioned KV cache slots. Cause: each request reserves KV blocks proportional to max-model-len, not actual prompt length. Mitigation: set max-model-len to actual maximum expected prompt+completion (not 128K), enforce per-request token limits via API gateway, and provision KV cache at 1.2× peak concurrent request count × average tokens/request.
**Repetition loops.** Symptom: model outputs the same phrase or paragraph continuously without progressing. Cause: temperature=0 with greedy decoding hitting a local maximum in token probability space. Mitigation: use temperature=0.1 minimum for production, apply repetition penalty 1.05–1.10, and set frequency penalty 0.1–0.3. Greedy-only (temperature=0) deterministic sampling creates brittle output paths.
**Context window overflow.** Symptom: model ignores or contradicts content in the second half of long prompts. Cause: after 50–70% of context window, most models exhibit "lost-in-the-middle" attention degradation where middle-passage tokens receive ~40% lower attention weight than beginning or end tokens. Mitigation: place critical instructions at the start AND end of prompts, trim context to 80% of nominal window, and use [RULER](https://github.com/hsiehjackson/RULER) evaluation to measure your specific model's context utilization curve.
**Prompt injection at scale.** Symptom: user inputs override system instructions (e.g., "ignore all previous instructions"). Cause: LLMs treat all input tokens with equal authority — no architectural boundary exists between system prompt and user input. Mitigation: wrap user inputs in XML-style delimiters (`<user_input>...</user_input>`), apply a separate classification model to detect injection patterns before the generation model receives input, and never concatenate raw user text into system prompts.
**Temperature=0 non-determinism.** Symptom: identical prompts produce different outputs across identical model versions and hardware. Cause: floating-point non-determinism from CUDA kernel scheduling, batch-size-dependent attention path differences, and GPU architecture-specific floating-point reduction ordering. Mitigation: accept that true determinism is not achievable at reasonable throughput. For compliance, implement output caching on prompt + model hash and validate post-hoc rather than relying on runtime determinism.
Hardware guidance
**Hobbyist tier ($300–900 GPU).** [RTX 3060 12GB](/hardware/rtx-3060-12gb) runs 7B–13B models fully on GPU at 50–80 tok/s. [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) runs 32B Q4 with partial offload at 15–25 tok/s. [Intel Arc B580](/hardware/intel-arc-b580) at 12 GB runs 7B–13B via [llama.cpp](/tools/llama-cpp) with SYCL backend — functional but 30–40% slower than equivalent NVIDIA at similar VRAM. [MacBook Pro 16" M4 Max](/hardware/macbook-pro-16-m4-max) at 64 GB unified runs 70B Q4 fully on SoC at 25–35 tok/s — the only laptop-class device that genuinely runs 70B.
**SMB tier ($1,500–3,000 GPU).** [RTX 4090](/hardware/rtx-4090) at 24 GB runs 70B Q4 at 22–28 tok/s with moderate offload, or 32B FP16 fully on GPU at 45–60 tok/s. [RTX 5090](/hardware/rtx-5090) at 32 GB with 1.79 TB/s bandwidth runs 70B Q4 fully on GPU (no offload) at 40–55 tok/s — the fastest single-consumer-card 70B inference path. Dual [RTX 3090](/hardware/rtx-3090) (2× 24 GB = 48 GB pool) runs 70B Q4 with no offload at 25–35 tok/s aggregate — roughly $1,500 used for a 48 GB VRAM pool.
**Enterprise tier ($8,000–25,000 per card).** [NVIDIA L40S](/hardware/nvidia-l40s) at 48 GB with 864 GB/s bandwidth runs 70B FP8 at 80–100 tok/s in a datacenter form factor. [RTX 6000 Ada](/hardware/rtx-6000-ada) at 48 GB with 960 GB/s is the workstation-datacenter hybrid for on-prem 70B serving at 90–110 tok/s. [NVIDIA A100 80GB SXM](/hardware/nvidia-a100-80gb-sxm) at 2.0 TB/s runs 70B FP16 at 130–150 tok/s — the baseline enterprise 70B serving card.
**Frontier tier ($25,000+ per card).** [H100 PCIe](/hardware/nvidia-h100-pcie) at 80 GB with 2.0 TB/s runs 70B FP8 at 140–170 tok/s and frontier MoE models at 40–70 tok/s per stream. [H200](/hardware/nvidia-h200) at 141 GB with 4.8 TB/s bandwidth is the current 2026 throughput king for text generation — 200–250 tok/s on 70B FP8, and the 141 GB VRAM enables full MoE models at FP8 without offload. [B200](/hardware/nvidia-b200) at 180 GB with 8 TB/s doubles H200 throughput for MoE workloads. [AMD MI300X](/hardware/amd-mi300x) at 192 GB with 5.3 TB/s is the VRAM ceiling pick — fits every open-weight model at FP8/FP16, throughput competitive with H100 via vLLM's ROCm backend. [Apple M3 Ultra](/hardware/apple-m3-ultra) with 192–256 GB unified memory runs 70B FP16 entirely on SoC at 30–50 tok/s — a niche for researchers needing memory ceiling without a datacenter.
Runtime guidance
For interactive use with zero config: [Ollama](/tools/ollama). One binary handles model download, GPU detection, quantization, and OpenAI-compatible API. Tradeoff: no continuous batching, no prefix caching, single concurrent request default, throughput caps at ~60% of [vLLM](/tools/vllm) on the same hardware.
For production serving: [vLLM](/tools/vllm) is the default. PagedAttention KV cache, continuous batching, FP8 quantization, and prefix caching deliver 2–3× Ollama throughput on identical hardware. Supports FP8 on NVIDIA Ada/Hopper/Blackwell and AMD MI300X. Setup is ~30 minutes vs Ollama's 2 minutes. Multi-node tensor parallelism and pipeline parallelism are vLLM-native.
For structured output: [SGLang](/tools/sglang) outperforms vLLM on grammar-constrained decoding (FSM-based) and radix-aware prefix caching — it matches prefix substrings across requests, not just full prompts. Better for JSON-mode extraction with schema constraints. Tradeoff: smaller ecosystem, thinner documentation, model support lags vLLM by 3–6 months.
For CPU-only or hybrid CPU/GPU: [llama.cpp](/tools/llama-cpp) with GGUF quantization. Runs on any CPU (x86, ARM, Apple Silicon) with GPU offload via CUDA/Metal/Vulkan/SYCL. CPU-only 7B Q4 = 15–25 tok/s; 70B Q4 = 3–8 tok/s. No continuous batching, single-request-per-process. Use when GPU hardware is unavailable or for broadest platform support.
[tensorrt-llm](/tools/tensorrt-llm) is the highest-throughput option for NVIDIA-only datacenter — 15–25% faster than vLLM via engine-level graph optimizations. Tradeoff: 2–6 hour model compilation, GPU-architecture-specific engines, NVIDIA Container Toolkit + TensorRT SDK toolchain. Use only when the throughput gain justifies the complexity — typically 50+ concurrent streams across 4+ GPUs.