WebGPU AI

Capability notes

Browser-side inference via WebGPU runs transformer models directly in a browser tab — no install, no server, no API key. Three frameworks dominate: **WebLLM** (Apache TVM + WebGPU), **Transformers.js** (ONNX Runtime Web + WebGPU), and **MLC-LLM** (WebGPU-native compilation). Chrome 130+, Edge 130+, and Opera ship WebGPU; Firefox is behind a flag; Safari support is incomplete. **What works.** 1-3B parameter models at Q4 run at usable speeds on any WebGPU-capable GPU. On an [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) or [Intel Arc A770 16GB](/hardware/intel-arc-a770-16gb), Llama-3.2-3B Q4 via WebLLM reaches 40-55 tok/s. Integrated graphics deliver 15-25 tok/s — slower but functional. Context up to 4K is practical; 8K pushes against browser tab memory limits (~4GB WebGPU allocation on 16GB system RAM). Once downloaded, models cache in IndexedDB — subsequent loads initialize in 5-15 seconds. CDN caching via Cache-Control headers is the primary operator lever for improving multi-user first-load times. **What does not work.** Models above ~3B. A 7B Q4 (4.5GB) plus KV cache exceeds the browser's GPU memory budget and triggers OOM. Vision-language models at usable resolution require excess GPU memory. Multi-tab inference crashes the second tab because tabs share a GPU memory pool. WebGPU runs at 60-80% of native Vulkan/CUDA on the same hardware — the 40% slowdown on 7B (10 tok/s vs 18 tok/s native) makes the browser path unusable for real-time chat above 3B. **First-load model download** is the UX bottleneck. A 3B Q4 model is 1.8-2.5GB. Download on 50 Mbps takes 5-8 minutes; 10 Mbps takes 25-40 minutes. The entire model must download before inference starts — no streaming. WebLLM caches weights in IndexedDB (Chrome storage quota is 60% of free disk). The privacy model: inference is client-side only, verifiable via Chrome DevTools Network tab. For regulated environments that ban native app installs but permit browser access, WebGPU is the only zero-friction AI path.

If you just want to try this

Lowest-friction path to a working setup.

Open Chrome 130+ and navigate to the [WebLLM demo](https://chat.webllm.ai/). No install, no terminal, no API keys. Step 1: Select **Llama-3.2-3B (Q4f16)** from the model picker — the WebLLM team's recommended starter at 1.9GB. Chrome shows a download progress bar. Wait 5-8 minutes on a 50 Mbps connection. Step 2: After download, WebLLM compiles WebGPU shaders. First compile takes 10-20 seconds. Subsequent visits use cached shaders and load in under 5 seconds. Step 3: Type a prompt. Inference runs locally. Verify offline: disconnect Wi-Fi, reload, load the model from cache, send a prompt — it works with zero network. Alternate: [Transformers.js playground](https://huggingface.co/docs/transformers.js) uses ONNX Runtime Web with WebGPU. Token rates are 10-20% lower than WebLLM on equivalent models due to ONNX overhead, but the Hugging Face catalog is much larger. What you get: a browser-tab chatbot. Response quality on Llama-3.2-3B is identical to native — same weights, same logic. Speed is 60-80% of native and context caps at 4K vs 8K+ native. For single-turn Q&A and short summarization, the experience is indistinguishable from cloud chatbots. Hardware: any GPU with Chrome WebGPU support — NVIDIA GTX 10-series+, AMD RDNA 1+, [Intel Arc](/hardware/intel-arc-a770-16gb), [Apple M1+](/hardware/apple-m4-pro), Intel UHD 630+. Integrated graphics deliver 12-18 tok/s on 3B Q4. Phones and tablets are not supported; WebLLM falls back to WebAssembly CPU on mobile (10-20x slower, unusable).

For production deployment

Operator-grade recommendation.

Browser AI strategy for zero-infrastructure, zero-install, zero-privacy-risk AI access. WebGPU inference is viable under specific constraints. **When browser inference is correct.** Training environments needing instant AI without IT-administered software. Internal tools where a browser tab is the universal interface. Low-stakes tasks (summarization, classification, rewriting) where 3B quality is sufficient. Air-gapped environments where the browser is the only permitted runtime. Economics: zero server cost, zero API cost, zero maintenance. **Model serving.** Host weights on a CDN (Cloudflare R2, S3 + CloudFront) with aggressive Cache-Control (immutable, max-age=31536000). WebLLM handles IndexedDB caching. For 500+ users, 2GB × 500 = 1TB bandwidth, negligible cost. Implement CSP headers to lock the page: no external scripts, no analytics. Serve the WebLLM library from your CDN for supply-chain integrity. Serve everything from controlled origins. **Browser compatibility.** Chrome 130+ and Edge 130+ support WebGPU with full compute shaders (subgroups, fp16 storage). Firefox 135+ has WebGPU behind a flag — do not depend on Firefox for production. Safari 18+ has WebGPU but missing subgroups and fp16 storage buffer support breaks WebLLM's attention path. Chrome/Edge only for production in 2026. **Tab background throttling.** Chrome throttles background tab timers to 1 Hz after 5 minutes. While WebGPU dispatch isn't directly throttled, the JavaScript event loop that feeds prompts is. Inference in a background tab stalls. Resolution: keep the inference tab as the active window. Do not architect for background-tab inference unless using a Chrome extension with persistent background permission. **Operational limits.** Maximum model size: ~3B Q4 on 16GB systems, ~1.5B on 8GB. 7B models OOM in-browser; for 7B+ native inference see [RTX 5090](/hardware/rtx-5090) or [H100](/hardware/nvidia-h100-pcie). Token rate drops 20-40% when the user scrolls or interacts during inference (compositor thread shares GPU). For max token rate, disable UI during inference. **Enterprise deployment.** Host a static SPA. WebLLM loads from your CDN, model weights from your CDN, all inference client-side. Use a Service Worker to pre-cache weights on first visit. Set Clear-Site-Data to never. Total cost: static hosting + CDN bandwidth — zero per-query. For 5,000 users at 20 queries/day, savings vs GPT-4-mini API (~$0.15/1M tokens) is ~$5,500/month.

What breaks

Failure modes operators see in the wild.

- **First-load model download time.** A 2GB model takes 4-15 minutes on a home connection. No streaming — full download before inference. Symptom: 30-40% abandon rate on loading screen. Mitigation: Service Worker pre-caching on first visit. Display estimated download time. For corporate deployments, serve weights from LAN CDN for sub-minute download. - **WebGPU shader compilation stalls.** First inference compiles compute shaders synchronously on the main thread for 5-20 seconds. Symptom: page appears frozen after model loads; browser may show "page unresponsive." Mitigation: run a 1-token warm-up inference before showing chat UI. Display a "compiling shaders" progress indicator. - **Browser tab background throttling.** Chrome throttles background tab timers to 1 Hz after 5 minutes. Symptom: user switches tabs during inference, returns 10 minutes later to find 2 tokens generated. Mitigation: design for foreground-only inference. If background is required, use a Chrome extension with persistent service worker — increases deployment complexity. - **Safari WebGPU incomplete.** Missing subgroups and fp16 storage buffer support. Symptom: WebLLM fails with validation error or runs at <5% Chrome performance. Mitigation: detect Safari and display "Switch to Chrome/Edge." Safari's WebGPU team targets subgroups for Safari 20 (late 2026). - **Shared GPU memory with browser rendering.** Compositor thread competes with WebGPU compute for GPU time. Symptom: token rate drops 20-40% when user scrolls during streaming. Mitigation: batch token rendering (5 at a time). Pause rendering during scroll. Disable UI interactivity during inference for max throughput. - **IndexedDB storage eviction.** Chrome evicts IndexedDB when disk space is low (<2GB). Symptom: cached model deleted without warning; user faces re-download. Mitigation: request persistent storage via StorageManager.persist(). Verify model checksum on page load; auto-re-download if cache missing.

Hardware guidance

**Hobbyist: Integrated graphics laptop (Intel UHD 630+, AMD Vega+)** WebGPU runs on integrated graphics. On Intel UHD 630 (8th-gen Core i5, 8GB), Llama-3.2-1B Q4 runs at 10-15 tok/s. 3B models are borderline (5-8 tok/s) and may OOM on 8GB. Hardware floor: a Chromebook-class laptop runs an LLM in-browser. Education and personal experimentation only. **Hobbyist: Entry dGPU (RTX 3050, RX 6500M, Arc A370M)** Any 4GB+ dGPU delivers 25-40 tok/s on 3B Q4. Dedicated VRAM eliminates system RAM contention. An [Intel Arc A770 16GB](/hardware/intel-arc-a770-16gb) runs 3B Q4 at 40-50 tok/s. This is the enthusiast sweet spot — in-browser AI feels like a native app. **SMB: Standardized dGPU laptop fleet** Deploy WebLLM as an internal web app on Chrome. IT provisions nothing beyond the browser. Push URL via group policy. Each user downloads the 2GB model once; subsequent accesses hit IndexedDB. Infrastructure: one static page server + one CDN bucket. For 50-200 users with dGPU laptops, this is the cheapest possible AI deployment. Constraint: all users need dGPUs and Chrome 130+. **Enterprise: Air-gapped browser AI** For classified/air-gapped environments where no software install is permitted but Chrome is available. Host WebLLM + weights on internal origin inside the enclave. Service Worker pre-caches models. All inference client-side; no data crosses the network boundary. Satisfies "no executable code installation" policy — JavaScript runs in the browser sandbox. Quality ceiling is 3B — sufficient for document summarization and classification in contexts where local AI beats zero AI. **Frontier: Not applicable** WebGPU cannot run 70B+ models. The ~4GB GPU memory ceiling makes >3B impractical. Frontier inference lives on [RTX 5090](/hardware/rtx-5090) or [H100](/hardware/nvidia-h100-pcie). WebGPU is the zero-install complement for small, private inference — not a frontier platform.

Runtime guidance

**Best token rate on Chrome/Edge with minimal setup? → WebLLM** [WebLLM](https://github.com/mlc-ai/web-llm) uses Apache TVM compiled to WebGPU. TVM optimizes attention kernels for WebGPU constraints (workgroup size, shared memory). Token rates: 40-55 tok/s on [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb), 25-35 tok/s on [Intel Arc A770 16GB](/hardware/intel-arc-a770-16gb). Curated catalog of ~30 popular models (Llama, Mistral, Phi, Gemma, Qwen). Automatic download, IndexedDB caching, and shader compilation. For most use cases, WebLLM balances performance and ease of deployment. **Hugging Face model catalog with WebGPU? → Transformers.js** [Transformers.js](https://github.com/huggingface/transformers.js) wraps Hugging Face Transformers via ONNX Runtime Web. Any ONNX-exported model works in-browser. Token rates 10-20% below WebLLM due to ONNX-to-WebGPU overhead. Benefit: 1,000+ models including specialized architectures (BERT, T5, Whisper, CLIP) that WebLLM lacks. For multi-model pipelines (embedding + generation + classification), Transformers.js is the only unified WebGPU framework. **Maximum optimization + Firefox/Safari roadmap? → MLC-LLM** MLC-LLM is the native compilation engine behind WebLLM. Direct use gives model compilation control: specify workgroup sizes, tune kernel fusion, produce per-model WebGPU binaries. Highest achievable token rate. MLC team works directly with browser vendors on WebGPU spec compliance — the most active pathway toward Firefox and Safari support. **Comparison:** - Token rate (3B Q4, RTX 4060 Ti): WebLLM 40-55 tok/s, Transformers.js 30-45, MLC-LLM 45-60 - Model catalog: WebLLM ~30, Transformers.js 1,000+, MLC-LLM (compile any model) - Setup: WebLLM 1-line import, Transformers.js 3-line pipeline, MLC-LLM compile + configure - Library size: WebLLM ~15MB, Transformers.js ~8MB, MLC-LLM ~12MB - Browser: Chrome/Edge for all; partial Firefox/Safari CPU fallback for Transformers.js

Setup walkthrough

Open Chrome 113+ or Edge 113+ (WebGPU support required). No installation needed.
Visit webllm.mlc.ai (WebLLM demo — runs models in-browser via WebGPU).
Select a model (e.g., "Llama-3.2-3B-Instruct-q4f16_1-MLC") and click "Download." First download is ~2 GB, cached in IndexedDB for future visits.
Once loaded, type "What is the capital of Mongolia?" First response in 3-10 seconds — entirely local, zero server calls.
For embedding your own site: npm install @mlc-ai/web-llm — 10 lines of JS to add in-browser chat to any web app.
For Transformers.js: npm install @xenova/transformers — runs Whisper, embeddings, and small LLMs in-browser via WebGPU backend.
Check WebGPU support: visit webgpureport.org. If your browser supports WebGPU, you're good to go.

The cheap setup

WebGPU AI is the cheapest local AI entry point — it runs on the hardware you already own. Any laptop with integrated graphics from 2020+ (Intel Iris Xe, AMD Radeon 680M, Apple M1) runs 3B models at 10-30 tok/s in Chrome. No GPU purchase needed. A used ThinkPad T14s Gen 2 (AMD Ryzen 5 PRO 5650U, 16 GB, ~$300) runs Llama 3.2 3B in-browser at 15-25 tok/s. The limiting factor is system RAM (shared with GPU) — 16 GB minimum for 3B models, 32 GB preferred. WebGPU democratizes AI access — if you can browse the web, you can run local AI.

The serious setup

MacBook Pro 16" M4 Max (see /hardware/macbook-pro-16-m4-max) — WebGPU on Apple Silicon is blazing fast. 40-core GPU runs Llama 3.2 3B at 50-80 tok/s in Chrome, Qwen 2.5 7B at 15-25 tok/s. The unified memory architecture (up to 128 GB) means WebGPU gets massive VRAM — can run 32B models in-browser at 5-10 tok/s. Total: $3,500. For non-Apple: any gaming laptop with RTX 4060 8 GB ($1,000) runs WebGPU inference at desktop speeds. WebGPU on discrete GPUs is still maturing but improving rapidly.

Common beginner mistake

The mistake: Trying to run a 7B model via WebGPU on a laptop with 8 GB RAM, then wondering why the tab crashes or inference takes 60 seconds per token. Why it fails: WebGPU shares system RAM — there's no dedicated VRAM. A 7B Q4 model needs 4.5 GB for weights + 1-2 GB for KV cache + 2 GB for the browser/OS. That's 7.5+ GB. On 8 GB total, the OS starts swapping to disk, and inference drops to ~1 tok/s. The fix: Use smaller models. Llama 3.2 3B Q4 (2 GB) runs perfectly on 8 GB. For 7B models, you need 16 GB RAM minimum. Monitor your system RAM usage (Task Manager / Activity Monitor) — if you're above 90%, switch to a smaller model. WebGPU memory is system memory.

Recommended setup for webgpu ai

Recommended hardware

Best GPU for local AI →

All workloads ranked across VRAM tiers.

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Budget build

AI PC under $1,000 →

Best GPU for this task

Best GPU for local AI →

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

Buying for spec-sheet VRAM without modeling KV cache + activation overhead
Underestimating quantization quality loss below Q4
Skipping flash-attention support (real perf gap on long context)
Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running webgpu ai locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle webgpu ai before committing money.