Sliding Window Attention (SWA)
Also known as: swa, sliding-attention
Sliding Window Attention (SWA) is an attention pattern where each token only attends to a fixed-size window of nearby tokens rather than the full context. It bounds the per-token compute and KV cache at a constant rather than scaling linearly with context length. Mistral 7B introduced SWA at scale in open-weight models; subsequent layers can still propagate information across longer distances via the stacking of windows.
Deeper dive
Standard self-attention scales O(N²) in compute and O(N) per-token in cache for a context of N tokens. SWA caps the receptive field at W (typically 4,096 or 8,192) — every token attends to the last W tokens. Compute per token becomes O(W) instead of O(N), and the KV cache for a layer is bounded at W tokens. Multiple SWA layers stacked together extend the effective receptive field: with W=4096 and L=32 layers, information from the very first token can reach the last token through L hops of length W, an effective range of L × W. The tradeoff is that this is approximate full-context attention — long-range dependencies are not as faithfully preserved as in full attention, and benchmarks on long-context reasoning often show measurable regression versus models with full attention plus aggressive KV-cache management.
Practical example
Mistral 7B v0.1 ships with W=4096 and 32 layers. At a 16K context, a full-attention model would need ~2 GB of KV cache (FP16, 32 heads); the SWA variant caps its cache at 4K tokens regardless of how much context is fed in, which keeps it under ~512 MB. The cost shows up on tasks where the model needs to recall a specific fact from the front of a long document: full-attention models tend to score higher on needle-in-a-haystack evaluations than SWA-only models at equivalent parameter count.
Workflow example
Operators don't enable SWA — it's a baked-in architecture choice. The relevance is in model selection: when comparing two models for a long-context workload, check if either uses SWA (the model card or config.json sliding_window field will say so). For a workload where the model must reliably surface details from anywhere in the context, prefer a model with full attention plus GQA (Llama 3.1's approach) over one with SWA. For a workload where the model needs to keep recent state but won't be tested on distant recall, SWA is the more VRAM-efficient choice.
Related terms
Reviewed by Fredoline Eruo. See our editorial policy.