Build an RTX 4090 AI workstation stack (May 2026)
A general-purpose AI workstation built around a single RTX 4090 24GB — runs a 32B-class coding model, a 14B chat model, and serves agent workloads to a small team on the same box.
- 01HardwareGPU (the hardware that defines this stack)rtx-4090
24GB VRAM is the first-class consumer tier in May 2026 — 4080 16GB doesn't have headroom for 32K context on 32B models; 5090 helps but is 2-3x the price for ~30% more throughput. The 4090 stays the sweet spot until 5090 supply normalises.
- 02ToolInference engine (production-grade serving)vllm
vLLM over Ollama for the production-serving role on this box — continuous batching matters when 3-5 users hit the same model concurrently, and the OpenAI-compatible endpoint makes Open WebUI / AnythingLLM / OpenHands plug in without adapter code. Keep Ollama installed alongside for ad-hoc model swaps.
- 03ToolModel-swap layer (ad-hoc experiments)ollama
Ollama lives next to vLLM, not as competition: it owns the 'I want to try a new model right now' surface. One-line model pulls beat re-rendering vLLM Docker configs every time. Run on a different port (11434) to avoid clashes.
- 04ModelCoding model (32B class)qwen-2.5-coder-32b-instruct
Qwen 2.5 Coder 32B AWQ-INT4 is the strongest model that fits 24GB with real context room — beats DeepSeek Coder V2 Lite on coding benchmarks at the same VRAM budget. Reserve 8-10GB of VRAM for KV cache; 32K context is the sweet spot.
- 05ModelChat model (low-latency general-purpose)qwen-3-14b
Qwen 3 14B at FP16 fits with massive headroom; serves chat, summaries, and tool-call workloads at 60+ tok/s with single-digit-ms TTFT on warm prefix. The right default when you don't need coding-class reasoning.
- 06ToolTeam chat frontendopenwebui
Open WebUI over AnythingLLM for the chat-frontend role on a workstation: better multi-user ergonomics, cleaner pipelines for tool calls. AnythingLLM wins for RAG-first workspaces; Open WebUI wins when you want a polished chat UI for a small team.
- 07ToolRAG workspace frontendanythingllm
Pairs with Open WebUI on the same box — different roles. AnythingLLM owns the 'chat with my documents' workflow; Open WebUI owns 'chat with the model directly.' Each runs as its own Docker container and points at the same vLLM endpoint.
Why this stack on this hardware
The 4090's 24GB VRAM creates a specific architectural window. 16GB cards cannot run a 32B-class coding model with real context room; 48GB+ cards (L40S / 6000 Ada / 5090 paired) shift you into a different cost tier. The 4090 sits in the middle and rewards a stack that respects both constraints — fits the model AND keeps headroom for batch serving across multiple users.
The headline architectural choice this stack makes: vLLM and Ollama coexist on the same machine, serving different workflows. Most guides treat them as competitors; on a 24GB workstation they're complementary — vLLM owns the “production endpoint we serve to frontends” role; Ollama owns the “I just want to try this model right now” role. Different ports, different lifecycle expectations, different model rotation rates.
Step-by-step setup
1. Bring up vLLM as the production endpoint
# Run vLLM on port 8000 — the production-facing endpoint
docker run --gpus all --rm -d --name vllm \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
--restart unless-stopped \
vllm/vllm-openai:v0.17.1 \
--model Qwen/Qwen2.5-Coder-32B-Instruct-AWQ \
--gpu-memory-utilization 0.85 \
--max-model-len 32768 \
--enable-chunked-prefillNote --gpu-memory-utilization 0.85 rather than 0.9 — leaving 15% (3.6GB) headroom for Ollama to coexist on the same card. Ollama doesn't pre-reserve VRAM the way vLLM does, so the same allocation that worked at 0.9 in a single-runtime setup will OOM the moment Ollama loads something.
2. Add Ollama on a different port for ad-hoc work
# Ollama on its default port (11434) — no Docker, native install
curl -fsSL https://ollama.com/install.sh | sh
# Pull a chat model that's distinct from the coding model in vLLM
ollama pull qwen3:14b
# Verify both runtimes are alive on different ports
curl http://localhost:8000/v1/models # vLLM (Qwen Coder 32B)
curl http://localhost:11434/api/tags # Ollama (Qwen 3 14B)Both can run concurrently because they hold different model weights — vLLM has Qwen Coder loaded; Ollama loads Qwen 3 14B on demand and unloads it when idle. Total VRAM under load: ~22GB; idle Ollama drops to ~12GB.
3. Wire Open WebUI as the team frontend
docker run -d --name open-webui \
-p 3000:8080 \
--restart unless-stopped \
-v open-webui:/app/backend/data \
-e OPENAI_API_BASE_URLS="http://host.docker.internal:8000/v1" \
-e OPENAI_API_KEYS="any-string" \
-e ENABLE_OLLAMA_API=true \
-e OLLAMA_BASE_URLS="http://host.docker.internal:11434" \
ghcr.io/open-webui/open-webui:latestOpen WebUI sees both endpoints and presents them as model options. Users pick “Qwen Coder 32B (vLLM)” for coding tasks and “Qwen 3 14B (Ollama)” for chat. Same UI; different runtimes; the model switcher is transparent.
4. Add AnythingLLM for the RAG-workspace surface
docker run -d --name anythingllm \
-p 3001:3001 \
--restart unless-stopped \
--cap-add SYS_ADMIN \
-v anythingllm-storage:/app/server/storage \
-e LLM_PROVIDER="generic-openai" \
-e GENERIC_OPEN_AI_BASE_PATH="http://host.docker.internal:8000/v1" \
-e GENERIC_OPEN_AI_MODEL_PREF="Qwen/Qwen2.5-Coder-32B-Instruct-AWQ" \
mintplexlabs/anythingllmAnythingLLM gets the same vLLM endpoint as Open WebUI — they share the model, isolate the workspace. Different role: Open WebUI for direct chat, AnythingLLM for “chat with my documents.” Both alive on different ports (3000 and 3001).
OS-level tuning that actually matters
The configuration that affects throughput on this stack (and the configuration that doesn't):
- NVIDIA driver >= 555. Older drivers have FlashAttention-2 kernel selection bugs that silently halve vLLM throughput. Run
nvidia-smi --query-gpu=driver_version --format=csvto verify. nvidia-persistencedrunning. Without it, the GPU re-initializes on every CUDA context create, adding 100-300ms to first-token latency on cold starts. Enable withsudo systemctl enable --now nvidia-persistenced.- NVMe scheduler set to
none. The default mq-deadline scheduler adds latency on model load.echo none | sudo tee /sys/block/nvme0n1/queue/scheduleror pin in/etc/udev/rules.d/. - System RAM 64GB minimum. The OS file cache holds the model weights between vLLM cold starts; 32GB systems re-read from disk on every restart, adding 20-40 seconds to startup.
- Power limit set to 350W for thermal sustainability under continuous load. The 4090's 450W TDP is fine for bursts but reduces card lifetime in sustained inference.
nvidia-smi -pl 350sets it; pin in a systemd service to make it persistent.
What does NOT meaningfully matter on this stack: PCIe bifurcation (single-card workloads don't care), RAM frequency above DDR5-5600 (CPU-side bandwidth isn't the bottleneck), CPU core count above 8 (vLLM threading is GPU-bound).
Failure modes you'll hit
- OOM when both vLLM and Ollama load. vLLM at
--gpu-memory-utilization 0.9+ Ollama loading a 14B model = OOM. Drop vLLM to 0.85 (recommended above) or run Ollama withOLLAMA_KEEP_ALIVE=0so it unloads aggressively. - Open WebUI can't see vLLM models.
host.docker.internaldoesn't resolve on Linux by default. Either run with--add-host=host.docker.internal:host-gatewayor use the host network mode (--network=host). - Coil whine on light load. 4090s sing audibly under low-utilization GPU load. Power-limiting to 350W usually cures it; if it doesn't, the card is within spec (Nvidia's position) but you may want to RMA. Most stack-builders accept it as the price of consumer-tier hardware.
- Thermal throttling at 30+ minutes of sustained load. Stock 4090 cooling handles bursts but a tight chassis with one 120mm exhaust will hit 87°C and throttle. Verify with
nvidia-smi --query-gpu=temperature.gpu --format=csv -l 1during a long generation; add chassis fans or undervolt if it climbs past 80°C. - Open WebUI persistent-volume corruption. Killing the container during write can corrupt the SQLite db. Mitigate with
--restart unless-stopped(above) and an explicit volume backup before anydocker rm.
Variations and alternatives
5090 swap. If you have a 5090, the architectural shape doesn't change — same vLLM + Ollama + frontends pattern. You get ~30% more throughput and FP4 support; not enough to justify the price-jump on its own, but if you already own one, no reconfiguration needed.
Multi-GPU 4090 variation. 2x 4090 with NVLink isn't a thing on consumer SKUs (NVIDIA disabled NVLink on Ada consumer); 2x over PCIe loses 30-40% of throughput to interconnect bandwidth. Usually not worth it unless the model genuinely won't fit. See /systems/distributed-inference for the math.
SGLang variation. If your team workflow is heavy on agent loops with stable system prompts (10+ tool calls per task on a fixed prefix), SGLang can replace vLLM for 1.3-1.7x aggregate throughput. The frontends and Ollama side stay the same.
Linux vs Windows-WSL2. Both work. Linux is ~5% faster on raw throughput due to lower CUDA driver overhead and better filesystem performance for the model cache. WSL2 catches up on most workloads; the only place it regresses meaningfully is rapid model-swap workflows where the file cache matters most.
Going deeper
- RTX 4090 catalog entry — VRAM math, thermal characteristics, the long-tail of quirks and how to tune them out.
- vLLM operational review — the runtime-specific operator detail behind the production endpoint pick.
- Inference runtime ecosystem map — full landscape of what could replace vLLM or Ollama in this stack.
- Local coding-agent stack — the specialised workstation variation focused on coding.