FlashAttention not supported — fall back, or upgrade
FlashAttention 2 / 3 require specific compute capabilities. Older GPUs and consumer Pascal/Turing cards don't support it. Here's the support matrix and the runtime fallbacks.
Diagnostic order — most likely first
GPU compute capability below FlashAttention 2 requirement
`nvidia-smi --query-gpu=compute_cap --format=csv,noheader` shows < 8.0. FlashAttention 2 needs sm_80+ (Ampere or newer). FA3 needs sm_90+ (Hopper) or sm_100+ (Blackwell).
Use the standard PyTorch attention (still functional, just slower). In Transformers: `attn_implementation='eager'`. In vLLM: it falls back automatically. To get FA2 perf, you need a 30-series card or newer.
FlashAttention not installed in the environment
GPU is supported but `import flash_attn` errors. Logs say 'flash-attention not available, falling back.'
Install: `pip install flash-attn --no-build-isolation`. This compiles against your CUDA — takes 5-15 minutes the first time. Pre-built wheels exist for common configs (CUDA 12.4 + Python 3.11 + Torch 2.5).
FlashAttention version doesn't support your model architecture
FA installs and loads but errors on a specific model: 'FlashAttention does not support this attention mask shape.'
Some models (sliding window attention, custom masks) need FA 2.5+ or specific patches. Upgrade flash-attn (`pip install --upgrade flash-attn`). For models still unsupported, fall back to eager attention.
AMD GPU expecting FlashAttention (it's NVIDIA-only)
Running on RX 7900 XTX or similar. FlashAttention has no ROCm support upstream.
Use the AMD-compatible alternatives: xformers (partial ROCm support), Triton-based attention, or fall back to PyTorch SDPA which has a memory-efficient path on ROCm.
Frequently asked questions
How much faster is FlashAttention vs standard attention?
1.5-3x throughput on long-context inference, 2-4x on training. Memory savings are larger: FA2 reduces attention memory from O(n²) to O(n). For 16K+ context windows, the difference is 'fits or doesn't.'
Do I need FlashAttention for local AI to work?
No. It's an optimization. Models work with standard attention; they just use more VRAM and run slower at long context. For 4K-8K typical usage on a 24 GB card, the difference is small. For 32K+ context, FA is closer to mandatory.
What about FlashAttention 3?
FA3 launched in 2024 for Hopper (H100). It uses async tensor cores and FP8 paths to push another 1.5-2x over FA2. Consumer Blackwell (RTX 5090) gets partial FA3 support; Hopper is the reference platform.
Can I use FlashAttention on an RTX 3060 or 3070?
Yes — RTX 30-series (Ampere) has compute capability sm_86 (3070 Ti/3080/3090) or sm_86 (3060/3070), both ≥ sm_80 required by FlashAttention 2. The 3060 12 GB is actually a great budget card for FA2-enabled inference because the 12 GB buffer lets you run larger context windows that benefit most from FA2's memory savings.
Is SDPA (PyTorch's built-in) enough if I can't run FlashAttention?
For short-context inference (≤4K tokens): yes. SDPA with `attn_implementation='sdpa'` is within 10-20% of FlashAttention on most consumer GPUs at short context. The gap widens significantly at 8K+ contexts where SDPA's O(n²) memory scaling starts hitting VRAM. For training: no — SDPA doesn't have the memory-efficient backward pass that FA2 provides. For training at scale, FA2 is effectively mandatory above 4K context.
Can I compile FlashAttention from source for my specific GPU?
Yes, but it's a heavy build. `pip install flash-attn --no-build-isolation` compiles against your installed CUDA + PyTorch. The build takes 5-15 minutes and needs significant RAM (16+ GB recommended, 32+ GB for multi-arch builds). Pass `TORCH_CUDA_ARCH_LIST='8.6'` (for 3060/3070) or `'8.9'` (for 4090) to target only your GPU — cuts build time by 60-70%. Pre-built wheels exist for CUDA 12.4 + Python 3.10-3.12 + Torch 2.4-2.5 combos on Linux; pip should pull them automatically if your combo matches.
Related troubleshooting
Why CUDA OOM happens during local LLM inference and image gen, how to confirm the real cause, and the four real fixes (smaller quant, shorter context, gradient checkpointing, or more VRAM).
vLLM ships pre-built wheels against specific CUDA versions. When your system CUDA differs, you get cryptic kernel-image errors. Here's the version matrix and the fix order.
When the fix is hardware
A surprising fraction of troubleshooting tickets resolve to: this card doesn't have enough VRAM for what you're asking it to do. If you're hitting OOM after every reasonable fix, or your GPU genuinely can't fit the model you need, it's upgrade time: