Large language models

LoRA (Low-Rank Adaptation)

LoRA is a parameter-efficient fine-tuning technique that adapts a large pre-trained model by training small low-rank matrices added alongside the original weights, leaving the base model frozen.

Instead of updating all 8 billion parameters of a Llama 3.1 8B model (which needs 60+ GB of VRAM and hours of compute), LoRA trains rank-16 or rank-32 adapter matrices — typically 1-2% of original parameter count. You can fine-tune a 7B model on a single 16 GB GPU in an afternoon.

LoRA adapters are tiny (~50-200 MB) and stack: you can load a base model once and swap LoRA adapters at inference time for different tasks. QLoRA combines LoRA with 4-bit quantization of the base model, dropping VRAM requirements another 4× — making 70B fine-tuning possible on a single 24 GB consumer card.

stub

Fine-tuning

LoRA (Low-Rank Adaptation)

Related terms

See also