SGLang server not responding — debug the most common hangs
SGLang server hangs on startup or stops responding mid-load mostly trace to: request batching saturation, KV cache miss-sizing, scheduler deadlock, or a runtime-CUDA mismatch. Here's the order.
Diagnostic order — most likely first
KV cache size too aggressive for your VRAM
Server starts, accepts a few requests, then hangs. `nvidia-smi` shows VRAM 99% full. Subsequent requests queue indefinitely.
Lower `--kv-cache-mem-fraction` (default 0.95). Try 0.85 first. If still tight, also lower `--max-running-requests` from default 128 to 32 or 64.
Scheduler deadlock from oversubscribed batches
Multiple long-context requests in flight simultaneously. Server logs show pending queue growing while throughput drops to zero.
Set `--schedule-policy fcfs` (first-come-first-serve) to avoid prefill / decode starvation. Or cap concurrent requests with `--max-running-requests 16`.
Compilation phase taking longer than the client timeout
First request after startup hangs for 60-120 seconds. Subsequent requests are fast.
SGLang JIT-compiles kernels on first use. Either bump client timeout (`timeout=300`) on the first call, or enable `--disable-cuda-graph` to skip the slowest compilation pass (small perf cost).
FlashInfer / Triton version mismatch
Logs show `RuntimeError: CUDA kernel compilation failed` or 'flashinfer not installed.' SGLang needs FlashInfer for its attention kernels.
Reinstall FlashInfer matching your CUDA: `pip install flashinfer -i https://flashinfer.ai/whl/cu124/torch2.5/`. Match the cu1xx + torch2.x suffix to your environment.
Model architecture not supported by SGLang yet
Loading a brand-new architecture (Qwen3, DeepSeek V3 variants). Server starts but errors at first inference.
Check SGLang's supported model list. New architectures lag vLLM by a few weeks. For unsupported models, fall back to vLLM or llama.cpp until SGLang adds support.
Frequently asked questions
When should I use SGLang instead of vLLM?
SGLang wins on: structured generation (JSON-constrained outputs, RegEx grammars), multi-turn workflows where prefix caching matters, agent-style requests with shared system prompts. vLLM wins on: maturity, broader model coverage, simpler deployment. Both have similar raw throughput.
What hardware fits SGLang for production?
Any CUDA card with 16+ GB VRAM works for 7B-32B models. For 70B+ at production throughput: 24+ GB single card or multi-GPU with tensor parallelism. SGLang's KV cache reuse pays back hardest on 24+ GB cards with mixed-context workloads.
Can I run SGLang on AMD or Apple Silicon?
AMD/ROCm: experimental support, lagging features. Apple Silicon: not supported as of 2026. For non-NVIDIA, use llama.cpp or MLX (Apple) for serving.
Related troubleshooting
vLLM ships pre-built wheels against specific CUDA versions. When your system CUDA differs, you get cryptic kernel-image errors. Here's the version matrix and the fix order.
FlashAttention 2 / 3 require specific compute capabilities. Older GPUs and consumer Pascal/Turing cards don't support it. Here's the support matrix and the runtime fallbacks.
Why CUDA OOM happens during local LLM inference and image gen, how to confirm the real cause, and the four real fixes (smaller quant, shorter context, gradient checkpointing, or more VRAM).
When the fix is hardware
A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time: