Document Reranking

Capability notes

Reranking is the second stage of two-stage retrieval: fast first-stage retriever (embedding + vector search) returns 50–200 candidates with high recall but moderate precision, then a reranker cross-encodes each (query, document) pair and assigns a relevance score — reordering by precision. [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) (BAAI, 568M params, 8192 token context, MIT license) is the canonical open-weight reranker. The accuracy gain is substantial. [BGE-M3](/models/bge-m3) first-stage dense search NDCG@10 = 65.8. Adding BGE Reranker V2 M3 on top-100 raises NDCG@10 to 71.4 — an 8.5% relative gain, moving retrieval from "good enough for casual search" to "production-grade for legal/medical/financial retrieval." The reranker catches false positives the embedder's cosine similarity misranks — documents topically adjacent but irrelevant to the query. When reranking matters: (1) precision-sensitive applications (legal — missing a case is malpractice risk; medical — missing contraindication study is liability), (2) high-document-count retrieval (1M+ documents where cosine similarity clusters thousands around common topics), (3) complex queries where embedders struggle with multi-constraint semantics. When reranking doesn't help: simple keyword queries where first-stage already returns rank-1, corpora under 1,000 documents, or latency-critical applications where reranker's 10–30ms per candidate is too slow. The reranker-embedder relationship: a reranker trained on different data than the embedder can disagree in ways that degrade retrieval. BGE Reranker V2 M3 is trained to complement [BGE-M3](/models/bge-m3) specifically — using them together is the designed path. Mixing OpenAI embeddings with BGE Reranker works but creates edge cases where the reranker disagrees with first-stage results inconsistently.

If you just want to try this

Lowest-friction path to a working setup.

Deploy [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) with one Docker command via [Text Embeddings Inference (TEI)](/tools/text-embeddings-inference): ```bash docker run -p 8080:80 --gpus all -e MODEL_ID=BAAI/bge-reranker-v2-m3 ghcr.io/huggingface/text-embeddings-inference:latest ``` Send query + candidates to `/rerank`: ```bash curl http://localhost:8080/rerank -X POST -H "Content-Type: application/json" -d '{"query": "What is the warranty period?", "texts": ["2-year warranty covers...", "Shipping takes 3-5 days...", "Returns within 30 days..."]}' ``` Returns scored, sorted results: ```json [{"index": 0, "score": 0.92}, {"index": 2, "score": 0.45}, {"index": 1, "score": 0.12}] ``` Hardware: 568M params (~1.1 GB VRAM FP16). Any GPU with 4 GB+ VRAM ([RTX 3060 12GB](/hardware/rtx-3060-12gb), [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb)). On CPU: 5–15 rerankings/sec — viable for low-volume with <50 candidates per query. For a complete Python RAG pipeline with reranking: ```python import httpx async def rerank(query, docs): async with httpx.AsyncClient() as client: resp = await client.post("http://localhost:8080/rerank", json={"query": query, "texts": docs}) return resp.json() candidates = vector_db.search(query_embed, top_k=100) # After first-stage reranked = await rerank(query, [c.text for c in candidates]) top5 = sorted(reranked, key=lambda x: x["score"], reverse=True)[:5] ``` The reranker adds ~10–30ms per document. Top-100 candidates: 1–3 seconds on CPU or 100–300ms on GPU (batch inference). Acceptable when quality improvement justifies latency.

For production deployment

Operator-grade recommendation.

Production two-stage retrieval: fast first-stage retriever + reranker for precision, with a latency budget constraining reranking depth. **Pipeline architecture.** Stage 1: embed query via [BGE-M3](/models/bge-m3) on TEI (5–20ms) → HNSW vector search for top-100 (1–10ms). Stage 2: rerank top-100 via [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) on TEI (100–300ms GPU batch=100). Total: 150–400ms. NDCG@10 improves from 65.8 to 71.4 — 8.5% gain. **Latency budget.** Interactive (<200ms total): rerank top-20 (20 × 10ms = 200ms). Batch (<2s): rerank top-200. Index-time enrichment (no constraint): rerank top-1,000 per document, store reranked order. Cap reranking at 50 for interactive, 200 for batch unless benchmarks show quality improvement beyond those depths. **Batch reranking.** TEI supports batch reranking — multiple (query, docs) pairs in one request for max GPU utilization. Queue individual requests for 10–50ms to batch into single inference pass. 10 queries × 100 candidates = 1,000 pairs processed in ~200–400ms on [RTX 4090](/hardware/rtx-4090) — effective throughput of 2,500–5,000 pairs/sec. **Score calibration.** Reranker scores are not calibrated probabilities — 0.8 doesn't mean "80% chance of relevance." Scores are relative within a batch. For applications needing consistent thresholds, calibrate against labeled dataset: run reranker on labeled query-document corpus, map raw scores to precision-at-k curves, set per-application thresholds based on observed precision. **Two-tier reranking for scale (100M+ documents).** Stage 1: dense embedding → top-1,000. Stage 2: lightweight lexical (BM25 via [BGE-M3](/models/bge-m3) sparse embeddings) → top-100. Stage 3: cross-encoder reranker → top-10. NDCG@10 of ~73 (1.6 points above two-stage) for high-scale retrieval. **When NOT to rerank.** Skip when: corpus under 10,000 docs (first-stage precise enough), queries are keyword/exact-match, latency budget under 50ms, or first-stage quality meets needs (general FAQ, internal wiki). Measure first-stage NDCG before engineering a reranking stage.

What breaks

Failure modes operators see in the wild.

**Reranker latency bottleneck.** Symptom: adding reranker increases retrieval from 20ms to 500ms+. Cause: reranking 200 candidates synchronously one at a time (200 × 10ms = 2,000ms). Reranker's cross-encoder processes each pair through full transformer — no benefit from pre-computed embeddings. Mitigation: use TEI batch inference — send all candidates in one request, processed in parallel on GPU. Caps 200-document reranking at 100–300ms. Cap candidate count at 50 for interactive. For extreme latency-sensitivity, lightweight bi-encoder reranker (DistilBERT-based) at 1–2ms per document — 80% quality for 20% latency. **Score calibration drift.** Symptom: same query-document pair scores 0.85 today, 0.72 tomorrow after model update or candidate pool change. Cause: reranker scores are relative to batch — adding highly-relevant documents pushes down moderate scores. Mitigation: never rely on absolute scores for thresholds. Use rank ordering (top-k). If thresholds required, calibrate against fixed reference set scored periodically. **Cross-encoder context window truncation.** Symptom: reranker assigns low scores to clearly relevant documents because key passage truncated at 8192-token limit. Cause: query + document combined exceed 8192 — document truncated from end. Mitigation: truncate documents before sending, not after. For long documents, chunk into 3,000-token segments, rerank each, use max segment score as document score. "Max-pooling over chunks" preserves ability to find relevant passages anywhere. **Reranker-embedder model mismatch.** Symptom: reranker and embedder disagree — embedder ranks A above B, reranker flips, final quality degrades. Cause: embedders optimize semantic similarity; rerankers optimize passage relevance — related but distinct objectives. Mitigation: use matched pairs — BGE-M3 + BGE Reranker V2 M3 is the designed pair. If different embedder, evaluate agreement rate (reranker top-10 in embedder top-50). Below 60% indicates mismatch. **Reranking irrelevant candidates.** Symptom: reranker assigns moderate scores (0.6–0.8) to completely irrelevant documents — cross-encoder optimized for relative ordering within batch, not absolute relevance detection. If top-100 are all irrelevant, reranker still assigns "best" to least-worst — cannot detect all candidates wrong. Mitigation: minimum first-stage score threshold — if BGE-M3 cosine similarity below 0.4 (1024-dim), document is too distant regardless of reranker output.

Hardware guidance

Reranker hardware requirements are the lowest after embeddings. BGE Reranker V2 M3 (568M params, ~1.1 GB FP16) runs on CPU at production latency for low volume. **CPU-only ($0).** Throughput on modern desktop: 5–15 pairs/sec sequential, 20–40 pairs/sec with TEI batching. For 10 queries/hour with top-50 reranking each, CPU is fine. 100 queries/hour with top-50: ~20% CPU utilization. CPU viable for batch/preprocessing with flexible latency. [Apple M4 Pro](/hardware/apple-m4-pro) via CoreML: 8–12 pairs/sec. **Entry GPU ($300–600).** [RTX 3060 12GB](/hardware/rtx-3060-12gb): 50–100 pairs/sec at batch=32. [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb): 80–150 pairs/sec. Top-50 reranking: 0.3–1 second per query on $300 GPU — acceptable for interactive search. **SMB tier ($1,500–2,500).** [RTX 4090](/hardware/rtx-4090): 200–400 pairs/sec — top-100 in 250–500ms. [RTX 5080](/hardware/rtx-5080): 150–300 pairs/sec. At this tier, reranking is imperceptible — latency dominated by network, not inference. **Enterprise ($8,000+).** Overkill — 1.1 GB VRAM model on enterprise GPU leaves 95%+ VRAM idle. [L40S](/hardware/nvidia-l40s) at 48 GB: ~500 pairs/sec at 2.3% VRAM utilization. Co-deploy reranker on same GPU as embedding or generation model — reranker's tiny footprint cohabitates without contention. **Co-deployment.** Typical RAG server: BGE-M3 (1.1 GB) + BGE Reranker (1.1 GB) + 7B generation (4–8 GB) on single [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) — 6–10 GB used, 6–10 GB headroom. For 70B RAG, generation dominates VRAM but embedder + reranker add negligible 2.2 GB. Reranking infrastructure cost is near-zero when co-deployed. **Latency scaling.** BGE Reranker V2 M3 per-pair at batch=1: CPU 50–200ms, GPU 5–15ms. At batch=50: CPU sequential 2.5–10 seconds; GPU parallel 50–150ms. GPU advantage is nonlinear — single GPU handles 50–100× more throughput than single CPU core.

Runtime guidance

**Text Embeddings Inference (TEI) with reranker support — the only production reranker serving path.** [Text Embeddings Inference (TEI)](/tools/text-embeddings-inference) serves both embeddings and reranking from one Docker container by switching MODEL_ID. For reranking: `MODEL_ID=BAAI/bge-reranker-v2-m3`. The `/rerank` endpoint accepts `{"query": "...", "texts": [...]}` and returns scored results. Batch inference processes all pairs in parallel — 100 candidates in one request vs 100 sequential requests = 2–3× throughput. TEI is the only production-grade open-weight reranker serving option. No Ollama support for reranking. llama.cpp does not serve rerankers. sentence-transformers can load the model (`CrossEncoder('BAAI/bge-reranker-v2-m3')`) for programmatic use but provides no serving layer — you build the HTTP API yourself. **Vector database integration.** [Qdrant](/tools/qdrant) supports native reranking — pass reranker endpoint URL in config, Qdrant automatically reranks vector search results. Simplest path: deploy Qdrant for search, point at TEI reranker, Qdrant handles two-stage retrieval internally. [pgvector](/tools/pgvector): no native reranker integration. Implement in app code: Postgres `SELECT ... ORDER BY embedding <=> $1 LIMIT 100` → TEI `/rerank` → reorder. Straightforward but requires app-layer orchestration. [Weaviate](/tools/weaviate): reranker modules via module system — configure `reranker-transformers` pointing at TEI in config YAML. Exposes reranked search via GraphQL parameter. **Decision tree.** Simplest production: TEI (reranker) + Qdrant (native integration) — one API call for search + rerank. Infrastructure-minimal: TEI + pgvector — app code orchestrates two-stage pipeline. Multi-tenant hybrid: TEI + Weaviate — native multi-tenancy + hybrid search + reranking in one GraphQL query. **When to add reranking.** Measure first-stage NDCG@10. If quality meets requirements, reranking adds unnecessary complexity. If 5%+ below target, add reranking and measure improvement — typical gain is 5–15% on NDCG@10. Reranking never degrades quality (reorders, doesn't remove). Cost is latency and infrastructure. Deploy when quality gain justifies latency budget and co-deployment on existing GPU infrastructure at near-zero marginal hardware cost.

Setup walkthrough

Install Ollama + pip install chromadb.
Pull a reranker: ollama pull bge-reranker-v2-m3 or use TEI via Docker: docker run -p 8080:80 ghcr.io/huggingface/text-embeddings-inference:latest --model-id BAAI/bge-reranker-v2-m3.
Python pipeline: embed documents with Nomic Embed Text → retrieve top-20 via cosine similarity → rerank top-20 with BGE Reranker V2 M3 → return top-5:

import requests
docs = ["doc1 text...", "doc2 text..."]  # top-20 from vector search
query = "How do I set up local STT?"
resp = requests.post("http://localhost:11434/api/rerank",
    json={"model": "bge-reranker-v2-m3", "query": query, "documents": docs})
top5 = resp.json()["results"][:5]

First reranked result in <500 ms for 20 documents. The reranker reads query+document pairs jointly (cross-encoder) — far more accurate than embedding similarity alone.

The cheap setup

BGE Reranker V2 M3 (568M params) runs on CPU at ~10-30 documents/second — enough for RAG pipelines where you rerank top-20 or top-50 results. Any $300 laptop handles this. For higher throughput: a used GTX 1060 6 GB ($60) runs at 200-500 docs/second. Reranking is lightweight (cross-encoder is one forward pass per query-doc pair) — the bottleneck is usually the upstream embedding retrieval, not the reranker.

The serious setup

Used RTX 3060 12 GB ($200-250, see /hardware/rtx-3060-12gb) handles production reranking at 1,000-2,000 documents/second — can rerank the entire top-1000 for every query in real-time. For enterprise search serving 100s of concurrent users, an RTX 3090 24 GB ($700-900, see /hardware/rtx-3090) with HuggingFace TEI in Docker provides 3,000-5,000 docs/second with batching. Total build: ~$900-1,100. Reranking is light — the same GPU that runs embeddings also runs the reranker.

Common beginner mistake

The mistake: Skipping reranking entirely — retrieving top-5 directly from embedding similarity and accepting the quality hit. Why it fails: Embedding models compress all semantic meaning into a single vector — the query "bank" means either "river bank" or "financial bank," and the embedding can't distinguish at retrieval time. Embedding-only retrieval typically gets the right document in top-20 but not top-5. The fix: Always use a two-stage pipeline: (1) retrieve 20-50 candidates via embedding similarity (fast, cheap), (2) rerank with a cross-encoder (slower but reads query+doc jointly). This gives ~20-40% higher recall@5. The 500ms reranking step on 20 docs is worth the quality jump every time.

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

Buying for spec-sheet VRAM without modeling KV cache + activation overhead
Underestimating quantization quality loss below Q4
Skipping flash-attention support (real perf gap on long context)
Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running document reranking locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle document reranking before committing money.