Large language models

QLoRA

QLoRA combines LoRA fine-tuning with 4-bit quantization of the base model. Introduced by Tim Dettmers in 2023, it dropped the VRAM cost of fine-tuning by ~4× and made consumer-GPU fine-tuning of 70B models possible for the first time.

The technique: the base model is loaded in 4-bit NF4 (a normalized float-4 format), only the LoRA adapter weights are kept in FP16, and gradients flow only through the adapter. Forward passes dequantize on-the-fly.

Practical impact: a Llama 3.1 70B QLoRA fine-tune fits on a single RTX 4090 (24 GB) where full fine-tuning would need 8× A100s. Tools like Unsloth optimize QLoRA further, achieving 2× speed over the reference HuggingFace implementation.

QLoRA

Related terms

See also