Capability notes
Browser-side inference via WebGPU runs transformer models directly in a browser tab — no install, no server, no API key. Three frameworks dominate: **WebLLM** (Apache TVM + WebGPU), **Transformers.js** (ONNX Runtime Web + WebGPU), and **MLC-LLM** (WebGPU-native compilation). Chrome 130+, Edge 130+, and Opera ship WebGPU; Firefox is behind a flag; Safari support is incomplete.
**What works.** 1-3B parameter models at Q4 run at usable speeds on any WebGPU-capable GPU. On an [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) or [Intel Arc A770 16GB](/hardware/intel-arc-a770-16gb), Llama-3.2-3B Q4 via WebLLM reaches 40-55 tok/s. Integrated graphics deliver 15-25 tok/s — slower but functional. Context up to 4K is practical; 8K pushes against browser tab memory limits (~4GB WebGPU allocation on 16GB system RAM). Once downloaded, models cache in IndexedDB — subsequent loads initialize in 5-15 seconds. CDN caching via Cache-Control headers is the primary operator lever for improving multi-user first-load times.
**What does not work.** Models above ~3B. A 7B Q4 (4.5GB) plus KV cache exceeds the browser's GPU memory budget and triggers OOM. Vision-language models at usable resolution require excess GPU memory. Multi-tab inference crashes the second tab because tabs share a GPU memory pool. WebGPU runs at 60-80% of native Vulkan/CUDA on the same hardware — the 40% slowdown on 7B (10 tok/s vs 18 tok/s native) makes the browser path unusable for real-time chat above 3B.
**First-load model download** is the UX bottleneck. A 3B Q4 model is 1.8-2.5GB. Download on 50 Mbps takes 5-8 minutes; 10 Mbps takes 25-40 minutes. The entire model must download before inference starts — no streaming. WebLLM caches weights in IndexedDB (Chrome storage quota is 60% of free disk). The privacy model: inference is client-side only, verifiable via Chrome DevTools Network tab. For regulated environments that ban native app installs but permit browser access, WebGPU is the only zero-friction AI path.
If you just want to try this
Lowest-friction path to a working setup.
Open Chrome 130+ and navigate to the [WebLLM demo](https://chat.webllm.ai/). No install, no terminal, no API keys.
Step 1: Select **Llama-3.2-3B (Q4f16)** from the model picker — the WebLLM team's recommended starter at 1.9GB. Chrome shows a download progress bar. Wait 5-8 minutes on a 50 Mbps connection.
Step 2: After download, WebLLM compiles WebGPU shaders. First compile takes 10-20 seconds. Subsequent visits use cached shaders and load in under 5 seconds.
Step 3: Type a prompt. Inference runs locally. Verify offline: disconnect Wi-Fi, reload, load the model from cache, send a prompt — it works with zero network.
Alternate: [Transformers.js playground](https://huggingface.co/docs/transformers.js) uses ONNX Runtime Web with WebGPU. Token rates are 10-20% lower than WebLLM on equivalent models due to ONNX overhead, but the Hugging Face catalog is much larger.
What you get: a browser-tab chatbot. Response quality on Llama-3.2-3B is identical to native — same weights, same logic. Speed is 60-80% of native and context caps at 4K vs 8K+ native. For single-turn Q&A and short summarization, the experience is indistinguishable from cloud chatbots.
Hardware: any GPU with Chrome WebGPU support — NVIDIA GTX 10-series+, AMD RDNA 1+, [Intel Arc](/hardware/intel-arc-a770-16gb), [Apple M1+](/hardware/apple-m4-pro), Intel UHD 630+. Integrated graphics deliver 12-18 tok/s on 3B Q4. Phones and tablets are not supported; WebLLM falls back to WebAssembly CPU on mobile (10-20x slower, unusable).
For production deployment
Operator-grade recommendation.
Browser AI strategy for zero-infrastructure, zero-install, zero-privacy-risk AI access. WebGPU inference is viable under specific constraints.
**When browser inference is correct.** Training environments needing instant AI without IT-administered software. Internal tools where a browser tab is the universal interface. Low-stakes tasks (summarization, classification, rewriting) where 3B quality is sufficient. Air-gapped environments where the browser is the only permitted runtime. Economics: zero server cost, zero API cost, zero maintenance.
**Model serving.** Host weights on a CDN (Cloudflare R2, S3 + CloudFront) with aggressive Cache-Control (immutable, max-age=31536000). WebLLM handles IndexedDB caching. For 500+ users, 2GB × 500 = 1TB bandwidth, negligible cost. Implement CSP headers to lock the page: no external scripts, no analytics. Serve the WebLLM library from your CDN for supply-chain integrity. Serve everything from controlled origins.
**Browser compatibility.** Chrome 130+ and Edge 130+ support WebGPU with full compute shaders (subgroups, fp16 storage). Firefox 135+ has WebGPU behind a flag — do not depend on Firefox for production. Safari 18+ has WebGPU but missing subgroups and fp16 storage buffer support breaks WebLLM's attention path. Chrome/Edge only for production in 2026.
**Tab background throttling.** Chrome throttles background tab timers to 1 Hz after 5 minutes. While WebGPU dispatch isn't directly throttled, the JavaScript event loop that feeds prompts is. Inference in a background tab stalls. Resolution: keep the inference tab as the active window. Do not architect for background-tab inference unless using a Chrome extension with persistent background permission.
**Operational limits.** Maximum model size: ~3B Q4 on 16GB systems, ~1.5B on 8GB. 7B models OOM in-browser; for 7B+ native inference see [RTX 5090](/hardware/rtx-5090) or [H100](/hardware/nvidia-h100-pcie). Token rate drops 20-40% when the user scrolls or interacts during inference (compositor thread shares GPU). For max token rate, disable UI during inference.
**Enterprise deployment.** Host a static SPA. WebLLM loads from your CDN, model weights from your CDN, all inference client-side. Use a Service Worker to pre-cache weights on first visit. Set Clear-Site-Data to never. Total cost: static hosting + CDN bandwidth — zero per-query. For 5,000 users at 20 queries/day, savings vs GPT-4-mini API (~$0.15/1M tokens) is ~$5,500/month.
What breaks
Failure modes operators see in the wild.
- **First-load model download time.** A 2GB model takes 4-15 minutes on a home connection. No streaming — full download before inference. Symptom: 30-40% abandon rate on loading screen. Mitigation: Service Worker pre-caching on first visit. Display estimated download time. For corporate deployments, serve weights from LAN CDN for sub-minute download.
- **WebGPU shader compilation stalls.** First inference compiles compute shaders synchronously on the main thread for 5-20 seconds. Symptom: page appears frozen after model loads; browser may show "page unresponsive." Mitigation: run a 1-token warm-up inference before showing chat UI. Display a "compiling shaders" progress indicator.
- **Browser tab background throttling.** Chrome throttles background tab timers to 1 Hz after 5 minutes. Symptom: user switches tabs during inference, returns 10 minutes later to find 2 tokens generated. Mitigation: design for foreground-only inference. If background is required, use a Chrome extension with persistent service worker — increases deployment complexity.
- **Safari WebGPU incomplete.** Missing subgroups and fp16 storage buffer support. Symptom: WebLLM fails with validation error or runs at <5% Chrome performance. Mitigation: detect Safari and display "Switch to Chrome/Edge." Safari's WebGPU team targets subgroups for Safari 20 (late 2026).
- **Shared GPU memory with browser rendering.** Compositor thread competes with WebGPU compute for GPU time. Symptom: token rate drops 20-40% when user scrolls during streaming. Mitigation: batch token rendering (5 at a time). Pause rendering during scroll. Disable UI interactivity during inference for max throughput.
- **IndexedDB storage eviction.** Chrome evicts IndexedDB when disk space is low (<2GB). Symptom: cached model deleted without warning; user faces re-download. Mitigation: request persistent storage via StorageManager.persist(). Verify model checksum on page load; auto-re-download if cache missing.
Hardware guidance
**Hobbyist: Integrated graphics laptop (Intel UHD 630+, AMD Vega+)**
WebGPU runs on integrated graphics. On Intel UHD 630 (8th-gen Core i5, 8GB), Llama-3.2-1B Q4 runs at 10-15 tok/s. 3B models are borderline (5-8 tok/s) and may OOM on 8GB. Hardware floor: a Chromebook-class laptop runs an LLM in-browser. Education and personal experimentation only.
**Hobbyist: Entry dGPU (RTX 3050, RX 6500M, Arc A370M)**
Any 4GB+ dGPU delivers 25-40 tok/s on 3B Q4. Dedicated VRAM eliminates system RAM contention. An [Intel Arc A770 16GB](/hardware/intel-arc-a770-16gb) runs 3B Q4 at 40-50 tok/s. This is the enthusiast sweet spot — in-browser AI feels like a native app.
**SMB: Standardized dGPU laptop fleet**
Deploy WebLLM as an internal web app on Chrome. IT provisions nothing beyond the browser. Push URL via group policy. Each user downloads the 2GB model once; subsequent accesses hit IndexedDB. Infrastructure: one static page server + one CDN bucket. For 50-200 users with dGPU laptops, this is the cheapest possible AI deployment. Constraint: all users need dGPUs and Chrome 130+.
**Enterprise: Air-gapped browser AI**
For classified/air-gapped environments where no software install is permitted but Chrome is available. Host WebLLM + weights on internal origin inside the enclave. Service Worker pre-caches models. All inference client-side; no data crosses the network boundary. Satisfies "no executable code installation" policy — JavaScript runs in the browser sandbox. Quality ceiling is 3B — sufficient for document summarization and classification in contexts where local AI beats zero AI.
**Frontier: Not applicable**
WebGPU cannot run 70B+ models. The ~4GB GPU memory ceiling makes >3B impractical. Frontier inference lives on [RTX 5090](/hardware/rtx-5090) or [H100](/hardware/nvidia-h100-pcie). WebGPU is the zero-install complement for small, private inference — not a frontier platform.
Runtime guidance
**Best token rate on Chrome/Edge with minimal setup? → WebLLM**
[WebLLM](https://github.com/mlc-ai/web-llm) uses Apache TVM compiled to WebGPU. TVM optimizes attention kernels for WebGPU constraints (workgroup size, shared memory). Token rates: 40-55 tok/s on [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb), 25-35 tok/s on [Intel Arc A770 16GB](/hardware/intel-arc-a770-16gb). Curated catalog of ~30 popular models (Llama, Mistral, Phi, Gemma, Qwen). Automatic download, IndexedDB caching, and shader compilation. For most use cases, WebLLM balances performance and ease of deployment.
**Hugging Face model catalog with WebGPU? → Transformers.js**
[Transformers.js](https://github.com/huggingface/transformers.js) wraps Hugging Face Transformers via ONNX Runtime Web. Any ONNX-exported model works in-browser. Token rates 10-20% below WebLLM due to ONNX-to-WebGPU overhead. Benefit: 1,000+ models including specialized architectures (BERT, T5, Whisper, CLIP) that WebLLM lacks. For multi-model pipelines (embedding + generation + classification), Transformers.js is the only unified WebGPU framework.
**Maximum optimization + Firefox/Safari roadmap? → MLC-LLM**
MLC-LLM is the native compilation engine behind WebLLM. Direct use gives model compilation control: specify workgroup sizes, tune kernel fusion, produce per-model WebGPU binaries. Highest achievable token rate. MLC team works directly with browser vendors on WebGPU spec compliance — the most active pathway toward Firefox and Safari support.
**Comparison:**
- Token rate (3B Q4, RTX 4060 Ti): WebLLM 40-55 tok/s, Transformers.js 30-45, MLC-LLM 45-60
- Model catalog: WebLLM ~30, Transformers.js 1,000+, MLC-LLM (compile any model)
- Setup: WebLLM 1-line import, Transformers.js 3-line pipeline, MLC-LLM compile + configure
- Library size: WebLLM ~15MB, Transformers.js ~8MB, MLC-LLM ~12MB
- Browser: Chrome/Edge for all; partial Firefox/Safari CPU fallback for Transformers.js