Local AI on Windows 11 — operator's guide (May 2026)
The honest Windows 11 local-AI operating manual. WSL2 vs native Windows, NVIDIA driver setup, ROCm-on-Windows reality, DirectML for Intel/AMD, Ollama and LM Studio installs, llama.cpp builds, and the WSL2 GPU-passthrough breakage that hits after every Windows update.
Why Windows is a hybrid local-AI platform
Windows 11 in 2026 is genuinely viable for local AI — much more so than it was three years ago. NVIDIA's CUDA-on-WSL2 path is mature, Ollama and LM Studio ship native Windows installers, and DirectML gives Intel and AMD users a path that doesn't depend on ROCm-on-Windows (which is still incomplete). What Windows isn't is the production-deployment platform for serious serving — that role still belongs to Linux. See /systems/linux-local-ai for the production architecture; this page is for operators who want to run local AI on their actual Windows desktop or laptop.
The honest tier framing for Windows:
- Hobbyist / single-user chat / dev environment: native Windows is fine. Ollama, LM Studio, and llama.cpp builds all work without WSL2.
- Serious development against vLLM / SGLang / CUDA Python tooling: WSL2 is required. Most of the Linux-first runtime ecosystem doesn't support native Windows.
- Production multi-tenant serving: deploy to Linux, not Windows. WSL2 is a development tool, not a deployment substrate.
WSL2 vs native Windows — architecture and tradeoff
The architectural question every Windows local-AI user has to answer first.
Native Windows means running an AI runtime built for Windows directly: Ollama's MSI installer, LM Studio, a llama.cpp build with the Windows toolchain (MSVC or MinGW). The GPU is accessed via the Windows driver model (WDDM); CUDA on NVIDIA, DirectML on Intel/AMD, ROCm-on-Windows where it exists.
WSL2 runs a real Linux kernel inside Windows, with GPU passthrough from the Windows driver to the Linux side. From inside WSL2, you install vLLM, SGLang, llama.cpp, Python, conda — all the Linux-first tooling — and they run as if on bare-metal Linux, with measurable but small performance overhead.
The decision criteria:
- Pick native Windows if you want minimum setup cost, you're running Ollama or LM Studio for chat, you're on Intel or AMD GPU and want DirectML, or your only requirement is “a working local LLM on this laptop”.
- Pick WSL2 if you want vLLM / SGLang / TensorRT- LLM, your tooling is Python-heavy and Linux-first, you want the same scripts to work on a Linux server, or you need recent CUDA / cuDNN versions that take longer to land on Windows native.
NVIDIA stack on Windows — driver, CUDA, and the WSL2 path
NVIDIA support on Windows in 2026 is the strongest of any vendor. Two viable paths:
Native Windows + CUDA: install the latest Game Ready or Studio driver, install the CUDA Toolkit for Windows, build or install your runtime. llama.cpp ships pre-built Windows binaries with CUDA support. Ollama's Windows installer auto-detects CUDA.
WSL2 + CUDA-on-WSL: NVIDIA's WSL2 driver extension exposes the Windows-side GPU to the Linux kernel inside WSL2 with no changes to the Linux CUDA toolkit. From the WSL2 side you apt install cuda-toolkit-12-X and your scripts work as on a Linux box. This is the canonical path for vLLM / SGLang on a Windows workstation.
The footnote nobody tells you: WSL2 on NVIDIA is robust until a major Windows update lands. Driver-update flow runs Windows-side, not WSL2-side, and the WSL2 GPU passthrough can briefly stop working until you wsl --shutdown and reopen the WSL distro. Save your work before driver updates.
AMD on Windows — ROCm status, DirectML fallback, what actually runs
AMD on Windows is the messiest story in the local-AI Venn diagram and worth being honest about.
ROCm-on-Windows: AMD has been shipping ROCm-on- Windows previews since 2023 and the situation in 2026 is partial. A subset of HIP / ROCm tooling works. PyTorch ROCm Windows builds exist, with caveats. The set of cards officially supported on Windows is narrower than on Linux, and the user-experience cost is real: less community knowledge, fewer tutorials, bugs that only show up on Windows. See ROCm operational review; the Linux path is still recommended for serious AMD work.
DirectML: Microsoft's DirectX-based ML backend. Vendor-agnostic — runs on AMD, Intel, NVIDIA. Used by ONNX Runtime, TensorFlow-DirectML, and a number of LLM tools. Slower than CUDA or ROCm-native but actually works across the AMD-on-Windows fleet. See DirectML operational review.
Vulkan via llama.cpp: vendor-agnostic GPU compute. AMD users on Windows running llama.cpp with the Vulkan backend get GPU acceleration without ROCm. Not the fastest path on AMD, but the most portable.
The honest pick for a Windows-only AMD operator in 2026 is: llama.cpp Vulkan or DirectML for chat, dual-boot Linux if you need ROCm seriously. Some users have success with WSL2 + ROCm; the configuration is more fragile than NVIDIA WSL2.
Intel on Windows — DirectML, OpenVINO, IPEX-LLM
Intel Arc (A770, B580) and Lunar Lake / Meteor Lake NPUs on Windows are well-supported in 2026:
- IPEX-LLM: Intel's LLM-specific extension for PyTorch. Native Windows installer, Arc and Lunar Lake support, accepts most Hugging Face models. The right pick for Intel GPU operators on Windows.
- OpenVINO: Intel's general inference runtime. Windows-native, works with Arc and Intel iGPU. Stable, mature.
- DirectML: works on Intel as on AMD. Good fallback when you don't want to deal with IPEX-LLM's Python stack.
Ollama Windows installer — the easy default
For most Windows users asking “how do I run a local LLM”, the answer is Ollama for Windows. Download the MSI from ollama.com, install, open PowerShell or cmd, run ollama pull llama3.2 then ollama run llama3.2. That's it. NVIDIA is auto-detected; AMD support shipped via Vulkan in 2024. The OpenAI- compatible API runs at http://localhost:11434.
The Ollama Windows workflow is the lowest-friction starting point and we recommend it for anyone whose answer to “why local AI” is privacy or offline use rather than performance optimization. For development against the same API you'll deploy to Linux, this is also fine — Ollama's API surface is identical across platforms.
LM Studio — the desktop GUI workflow
LM Studio ships a Windows installer with a polished GUI for downloading models, switching between them, and chatting. It runs llama.cpp under the hood and exposes an OpenAI-compatible API. The right pick for non-developer Windows users who want a desktop app, not a terminal.
Operationally, LM Studio and Ollama overlap a lot. Use LM Studio when GUI matters; use Ollama when you want CLI + scripting. Both are appropriate for the “local LLM on my Windows machine” use case.
llama.cpp builds on Windows — native vs WSL2
For operators who want to build llama.cpp from source on Windows, two paths:
Native MSVC build: install Visual Studio Build Tools, clone the repo, build with CMake. Works for CPU, CUDA, Vulkan, OpenCL backends. The CMake invocations are well-documented in the llama.cpp README.
WSL2 build: git clone + make inside WSL2. Easier than MSVC if you're already comfortable with Linux toolchains. Same binary you'd build on a Linux server.
Pre-built Windows binaries are also published on llama.cpp's releases page; for most users they're sufficient.
WSL2 GPU passthrough breakage and how to recover
The Windows-specific failure mode worth flagging on its own. WSL2 GPU passthrough depends on a coordinated set of components: Windows graphics driver, NVIDIA WSL2 extension (or the AMD/Intel equivalent), the WSL2 kernel, and the GPU's state on the Windows side.
Common breakages:
- Major Windows update lands and
nvidia-smiinside WSL2 returns nothing. Recovery:wsl --shutdown, reopen WSL, retry. If still broken, reinstall the NVIDIA driver Windows-side. - NVIDIA driver update Windows-side leaves WSL2 with a stale CUDA library set. Recovery: rebuild the runtime (vLLM, llama.cpp) inside WSL2 against the matching CUDA toolkit version.
- WSL2 kernel update breaks GPU access. Recovery:
wsl --updateon PowerShell, thenwsl --shutdown. - GPU is visible but compute fails silently. Sometimes a Windows-side reboot is the only fix. Save first.
The pattern is: every major Windows update is a potential WSL2 GPU regression. If you depend on this for work, defer Windows updates until you've verified the WSL2 stack still works, and keep notes on which driver version your WSL2 environment last worked with.
Windows vs Linux vs macOS for local AI
The honest cross-platform tier ranking:
- Linux: production-default. Every runtime supported. Every deployment pattern documented. See /systems/linux-local-ai.
- macOS (Apple Silicon): best-in-class for unified memory + Metal. MLX, llama.cpp Metal, Ollama Metal all ship. See /systems/macos-local-ai.
- Windows + WSL2: viable development environment. Closest you can get to Linux without leaving Windows.
- Windows native: hobbyist tier. Ollama and LM Studio are excellent here; serious server-style deployment belongs on Linux or WSL2.
Common failure modes
- Ollama loads but inference is CPU-only. Driver mismatch — Windows GPU driver version is below Ollama's CUDA / ROCm requirement. Update the Windows driver.
- Antivirus blocks llama.cpp binary. Defender or third-party AV occasionally false-positives compiled inference binaries. Add an exception or rebuild from source under your own signing key.
- Long path errors on model download. Hugging Face cache uses long paths. Enable Windows long-path support (
git config --system core.longpaths trueand the registry setting documented by Microsoft). - WSL2 model file corruption on hibernate. Sleeping the laptop with WSL2 holding open file descriptors occasionally corrupts the WSL2 filesystem. Stop runtimes before hibernating.
- DirectML inference is much slower than expected. DirectML is meaningfully slower than vendor-native CUDA / ROCm paths. If throughput matters and your hardware supports a vendor-native runtime, switch.
- Model doesn't fit but no clear OOM error. Windows' GPU memory accounting differs from Linux — model allocation can succeed and then fail at first inference. Check Task Manager → GPU → Dedicated memory, and reduce model size or quant.
Production hardening on Windows (when you actually need it)
If you really must run a local-AI service on Windows in production (rare, but happens — corporate IT may mandate Windows), the hardening checklist:
- Run the runtime as a Windows Service (NSSM is the standard wrapper for non-service binaries).
- Pin driver versions; defer Windows Update for the workstation; test driver upgrades in a staging machine.
- Use WSL2 with a fixed kernel version;
wsl --versionis the version surface that matters. - Reverse-proxy via Caddy or IIS for TLS; never expose port 11434 (Ollama) or 8000 (vLLM via WSL2) directly to the network.
- Resource limits: WSL2 has a global memory limit setting in
%USERPROFILE%\.wslconfig; cap it well below physical RAM to keep Windows responsive. - For real production: deploy to Linux instead. Windows production local-AI serving is a constrained-by-policy choice, not a performance choice.
Going deeper
- Local AI on Linux — production-default OS guide.
- Local AI on macOS — Apple Silicon path.
- Setup path-finder — pick OS + hardware.
- Runtime compatibility matrix.
- Runtime health dashboard.
- Common local AI setup mistakes.
Next step on Windows
Lowest-friction path to a working local LLM on a Windows machine.