RUNLOCALAIv38
→WILL IT RUNBEST GPUCOMPARETROUBLESHOOTSTARTPULSEMODELSHARDWARETOOLSBENCH
RUNLOCALAI

Operator-grade instrument for local-AI hardware intelligence. Hand-written verdicts. Real benchmarks. Reproducible commands.

OP·Fredoline Eruo
DIR
  • Models
  • Hardware
  • Tools
  • Benchmarks
  • Will it run?
GUIDES
  • Best GPU
  • Best laptop
  • Best Mac
  • Best used GPU
  • Best budget GPU
  • Best GPU for Ollama
  • Best GPU for SD
  • AI PC build $2K
  • CUDA vs ROCm
  • 16 vs 24 GB
  • Compare hardware
  • Custom compare
REF
  • Systems
  • Ecosystem maps
  • Pillar guides
  • Methodology
  • Glossary
  • Errors KB
  • Troubleshooting
  • Resources
  • Public API
EDITOR
  • About
  • About the author
  • Changelog
  • Latest
  • Updates
  • Submit benchmark
  • Send feedback
  • Trust
  • Editorial policy
  • How we make money
  • Contact
LEGAL
  • Privacy
  • Terms
  • Sitemap
MAIL · MONTHLY DIGEST
Get monthly local AI changes
Monthly recap. No spam.
DISCLOSURE

Some links on this site are affiliate links (Amazon Associates and other first-class retailers). When you buy through them, we earn a small commission at no extra cost to you. Affiliate links do not influence our verdicts — there are cards we rate highly that we don't have affiliate relationships with, and cards that sell well that we refuse to recommend. Read more →

SYS · ONLINEUPTIME · 100%2026 · operator-owned
RUNLOCALAI · v38
Glossary / Large language models / ORPO (Odds Ratio Preference Optimization)
Large language models

ORPO (Odds Ratio Preference Optimization)

Also known as: odds ratio preference optimization

ORPO (Odds Ratio Preference Optimization) is a fine-tuning method that combines supervised fine-tuning (SFT) and preference alignment into a single training stage. Standard alignment pipelines run SFT first, then DPO or RLHF as a second pass. ORPO collapses both objectives into one loss function — the model learns to follow instructions AND to prefer good responses over bad ones simultaneously, removing the need for a separate reward model and reducing total training compute meaningfully.

Deeper dive

The ORPO loss adds an odds-ratio term to the standard cross-entropy SFT loss: alongside maximizing the log-probability of the chosen response, it minimizes the log-odds-ratio between the chosen and rejected response. Unlike DPO, ORPO does not require a separately-trained reference model — the same model being trained acts as its own reference. This removes a memory pressure point (DPO must hold the reference model in memory in addition to the policy) and removes the SFT pre-stage requirement. The result is a one-shot preference fine-tune that operators can run on a single 24 GB consumer GPU for 7-8B models with LoRA, where the equivalent DPO pipeline would need either more VRAM or a smaller batch size. Quality at the same compute budget is generally competitive with DPO; the main caveats are that the preference signal is weaker (only one term in a combined loss, vs. DPO's full preference focus) and that ORPO is newer with less mature recipe-discovery.

Practical example

An operator fine-tuning Llama 3.1 8B for a domain task — say, a customer-support assistant — could collect ~5,000 (prompt, chosen, rejected) triples by editing model outputs and use ORPO to bake the preferences in alongside the instruction-following training. On an RTX 4090 with QLoRA, the training run might take a few hours, where the equivalent SFT-then-DPO pipeline would require two separate training stages and roughly double the wall-clock time. The resulting model encodes both the task formatting (from the chosen responses) and the rejection signal (from the contrast against rejected ones) without ever loading a separate reward model into memory.

Workflow example

ORPO is supported in Hugging Face's TRL library as ORPOTrainer. The setup mirrors SFTTrainer plus a paired-preference dataset format: each row needs prompt, chosen, and rejected fields. The training command is similar to standard TRL fine-tuning: python orpo_train.py --model meta-llama/Llama-3.1-8B-Instruct --dataset my-preferences --output-dir ./orpo-out. Checkpoints save in HF format and can be loaded with AutoModelForCausalLM, then quantized to GGUF for local Ollama / llama.cpp inference using the standard convert-hf-to-gguf path.

Related terms

Fine-tuningRLHF (Reinforcement Learning from Human Feedback)AlignmentDirect Preference Optimization (DPO)

Reviewed by Fredoline Eruo. See our editorial policy.

Buyer guides
  • Best GPU for local AI →
  • Best laptop for local AI →
  • Best Mac for local AI →
When it doesn't work
  • CUDA out of memory →
  • Ollama running slowly →
  • ROCm not detected →