Large language models

ORPO (Odds Ratio Preference Optimization)

Also known as: odds ratio preference optimization

ORPO (Odds Ratio Preference Optimization) is a fine-tuning method that combines supervised fine-tuning (SFT) and preference alignment into a single training stage. Standard alignment pipelines run SFT first, then DPO or RLHF as a second pass. ORPO collapses both objectives into one loss function — the model learns to follow instructions AND to prefer good responses over bad ones simultaneously, removing the need for a separate reward model and reducing total training compute meaningfully.

Deeper dive

The ORPO loss adds an odds-ratio term to the standard cross-entropy SFT loss: alongside maximizing the log-probability of the chosen response, it minimizes the log-odds-ratio between the chosen and rejected response. Unlike DPO, ORPO does not require a separately-trained reference model — the same model being trained acts as its own reference. This removes a memory pressure point (DPO must hold the reference model in memory in addition to the policy) and removes the SFT pre-stage requirement. The result is a one-shot preference fine-tune that operators can run on a single 24 GB consumer GPU for 7-8B models with LoRA, where the equivalent DPO pipeline would need either more VRAM or a smaller batch size. Quality at the same compute budget is generally competitive with DPO; the main caveats are that the preference signal is weaker (only one term in a combined loss, vs. DPO's full preference focus) and that ORPO is newer with less mature recipe-discovery.

Practical example

An operator fine-tuning Llama 3.1 8B for a domain task — say, a customer-support assistant — could collect ~5,000 (prompt, chosen, rejected) triples by editing model outputs and use ORPO to bake the preferences in alongside the instruction-following training. On an RTX 4090 with QLoRA, the training run might take a few hours, where the equivalent SFT-then-DPO pipeline would require two separate training stages and roughly double the wall-clock time. The resulting model encodes both the task formatting (from the chosen responses) and the rejection signal (from the contrast against rejected ones) without ever loading a separate reward model into memory.

Workflow example

ORPO is supported in Hugging Face's TRL library as ORPOTrainer. The setup mirrors SFTTrainer plus a paired-preference dataset format: each row needs prompt, chosen, and rejected fields. The training command is similar to standard TRL fine-tuning: python orpo_train.py --model meta-llama/Llama-3.1-8B-Instruct --dataset my-preferences --output-dir ./orpo-out. Checkpoints save in HF format and can be loaded with AutoModelForCausalLM, then quantized to GGUF for local Ollama / llama.cpp inference using the standard convert-hf-to-gguf path.

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides

When it doesn't work

Deeper dive

Practical example

Workflow example

Related terms