RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Glossary / Transformer & LLM components / Sliding Window Attention (SWA)
Transformer & LLM components

Sliding Window Attention (SWA)

Also known as: swa, sliding-attention

Sliding Window Attention (SWA) is an attention pattern where each token only attends to a fixed-size window of nearby tokens rather than the full context. It bounds the per-token compute and KV cache at a constant rather than scaling linearly with context length. Mistral 7B introduced SWA at scale in open-weight models; subsequent layers can still propagate information across longer distances via the stacking of windows.

Deeper dive

Standard self-attention scales O(N²) in compute and O(N) per-token in cache for a context of N tokens. SWA caps the receptive field at W (typically 4,096 or 8,192) — every token attends to the last W tokens. Compute per token becomes O(W) instead of O(N), and the KV cache for a layer is bounded at W tokens. Multiple SWA layers stacked together extend the effective receptive field: with W=4096 and L=32 layers, information from the very first token can reach the last token through L hops of length W, an effective range of L × W. The tradeoff is that this is approximate full-context attention — long-range dependencies are not as faithfully preserved as in full attention, and benchmarks on long-context reasoning often show measurable regression versus models with full attention plus aggressive KV-cache management.

Practical example

Mistral 7B v0.1 ships with W=4096 and 32 layers. At a 16K context, a full-attention model would need ~2 GB of KV cache (FP16, 32 heads); the SWA variant caps its cache at 4K tokens regardless of how much context is fed in, which keeps it under ~512 MB. The cost shows up on tasks where the model needs to recall a specific fact from the front of a long document: full-attention models tend to score higher on needle-in-a-haystack evaluations than SWA-only models at equivalent parameter count.

Workflow example

Operators don't enable SWA — it's a baked-in architecture choice. The relevance is in model selection: when comparing two models for a long-context workload, check if either uses SWA (the model card or config.json sliding_window field will say so). For a workload where the model must reliably surface details from anywhere in the context, prefer a model with full attention plus GQA (Llama 3.1's approach) over one with SWA. For a workload where the model needs to keep recent state but won't be tested on distant recall, SWA is the more VRAM-efficient choice.

Related terms

Context WindowFlash AttentionMistral

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →