Transformer & LLM components
SwiGLU
SwiGLU is a gated feed-forward activation: (W1·x ⊙ swish(W2·x)) · W3, replacing the standard MLP's GELU/ReLU in modern transformers. Used in PaLM, Llama, Mistral, Qwen, DeepSeek.
The gating gives a small but consistent perplexity improvement vs GELU MLP at the same parameter budget. The cost is one extra weight matrix (W2), making SwiGLU FFN blocks ~50% larger in parameter count than GELU MLP — which is why Llama-style architectures size the FFN dimension at ~2.67× hidden instead of the classic 4×.
For inference, SwiGLU adds one more matmul per layer; modern attention/FFN fused kernels handle it natively.
Related terms
Reviewed by Fredoline Eruo. See our editorial policy.