Very slow first token / OOM only at long prompts
Cause
Prefill (the prompt-processing phase) is compute-bound and scales roughly quadratically with prompt length without Flash Attention, linearly with it. A 64K prompt is 32× longer than a 2K prompt — but the prefill cost is 32–1000× higher depending on attention implementation.
A second cause: KV cache for the long prompt may not fit, triggering OOM only when the prompt grows past a threshold even though shorter prompts are fine.
Solution
1. Enable Flash Attention (most runners support it; many don't enable by default on older GPUs):
# llama.cpp
./llama-server -m model.gguf -fa on # or --flash-attn
# vLLM (default on Ampere+)
vllm serve <model> --enforce-eager false
# Transformers
model = AutoModelForCausalLM.from_pretrained(name, attn_implementation="flash_attention_2")
2. Use prefix caching if the long context is repeated across requests (system prompt, RAG context):
vllm serve <model> --enable-prefix-caching
First request pays prefill, subsequent matching prefixes skip it.
3. Quantize the KV cache to fit longer context in the same VRAM:
./llama-server -m model.gguf -c 65536 --cache-type-k q8_0 --cache-type-v q8_0
4. Chunk the prompt. If you're feeding a 200K-token document, summarize segments first or use a model architecture designed for long context (Llama 4 Scout's 10M context was trained for it; Llama 3 at 128K via YaRN was extended and degrades past 32K in practice).
5. Confirm the model's attention is GPU-resident. A "tail" of CPU layers (-ngl 28 on a 32-layer model) gives correct output but kills prefill speed.
Related errors
Did this fix it?
If your case was different, email support@runlocalai.co with what you saw and we'll update the page. If it worked but took different commands on your platform, we want to know that too.