Capability notes
Reranking is the second stage of two-stage retrieval: fast first-stage retriever (embedding + vector search) returns 50–200 candidates with high recall but moderate precision, then a reranker cross-encodes each (query, document) pair and assigns a relevance score — reordering by precision. [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) (BAAI, 568M params, 8192 token context, MIT license) is the canonical open-weight reranker.
The accuracy gain is substantial. [BGE-M3](/models/bge-m3) first-stage dense search NDCG@10 = 65.8. Adding BGE Reranker V2 M3 on top-100 raises NDCG@10 to 71.4 — an 8.5% relative gain, moving retrieval from "good enough for casual search" to "production-grade for legal/medical/financial retrieval." The reranker catches false positives the embedder's cosine similarity misranks — documents topically adjacent but irrelevant to the query.
When reranking matters: (1) precision-sensitive applications (legal — missing a case is malpractice risk; medical — missing contraindication study is liability), (2) high-document-count retrieval (1M+ documents where cosine similarity clusters thousands around common topics), (3) complex queries where embedders struggle with multi-constraint semantics. When reranking doesn't help: simple keyword queries where first-stage already returns rank-1, corpora under 1,000 documents, or latency-critical applications where reranker's 10–30ms per candidate is too slow.
The reranker-embedder relationship: a reranker trained on different data than the embedder can disagree in ways that degrade retrieval. BGE Reranker V2 M3 is trained to complement [BGE-M3](/models/bge-m3) specifically — using them together is the designed path. Mixing OpenAI embeddings with BGE Reranker works but creates edge cases where the reranker disagrees with first-stage results inconsistently.
If you just want to try this
Lowest-friction path to a working setup.
Deploy [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) with one Docker command via [Text Embeddings Inference (TEI)](/tools/text-embeddings-inference):
```bash
docker run -p 8080:80 --gpus all -e MODEL_ID=BAAI/bge-reranker-v2-m3 ghcr.io/huggingface/text-embeddings-inference:latest
```
Send query + candidates to `/rerank`:
```bash
curl http://localhost:8080/rerank -X POST -H "Content-Type: application/json" -d '{"query": "What is the warranty period?", "texts": ["2-year warranty covers...", "Shipping takes 3-5 days...", "Returns within 30 days..."]}'
```
Returns scored, sorted results:
```json
[{"index": 0, "score": 0.92}, {"index": 2, "score": 0.45}, {"index": 1, "score": 0.12}]
```
Hardware: 568M params (~1.1 GB VRAM FP16). Any GPU with 4 GB+ VRAM ([RTX 3060 12GB](/hardware/rtx-3060-12gb), [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb)). On CPU: 5–15 rerankings/sec — viable for low-volume with <50 candidates per query.
For a complete Python RAG pipeline with reranking:
```python
import httpx
async def rerank(query, docs):
async with httpx.AsyncClient() as client:
resp = await client.post("http://localhost:8080/rerank",
json={"query": query, "texts": docs})
return resp.json()
candidates = vector_db.search(query_embed, top_k=100) # After first-stage
reranked = await rerank(query, [c.text for c in candidates])
top5 = sorted(reranked, key=lambda x: x["score"], reverse=True)[:5]
```
The reranker adds ~10–30ms per document. Top-100 candidates: 1–3 seconds on CPU or 100–300ms on GPU (batch inference). Acceptable when quality improvement justifies latency.
For production deployment
Operator-grade recommendation.
Production two-stage retrieval: fast first-stage retriever + reranker for precision, with a latency budget constraining reranking depth.
**Pipeline architecture.** Stage 1: embed query via [BGE-M3](/models/bge-m3) on TEI (5–20ms) → HNSW vector search for top-100 (1–10ms). Stage 2: rerank top-100 via [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) on TEI (100–300ms GPU batch=100). Total: 150–400ms. NDCG@10 improves from 65.8 to 71.4 — 8.5% gain.
**Latency budget.** Interactive (<200ms total): rerank top-20 (20 × 10ms = 200ms). Batch (<2s): rerank top-200. Index-time enrichment (no constraint): rerank top-1,000 per document, store reranked order. Cap reranking at 50 for interactive, 200 for batch unless benchmarks show quality improvement beyond those depths.
**Batch reranking.** TEI supports batch reranking — multiple (query, docs) pairs in one request for max GPU utilization. Queue individual requests for 10–50ms to batch into single inference pass. 10 queries × 100 candidates = 1,000 pairs processed in ~200–400ms on [RTX 4090](/hardware/rtx-4090) — effective throughput of 2,500–5,000 pairs/sec.
**Score calibration.** Reranker scores are not calibrated probabilities — 0.8 doesn't mean "80% chance of relevance." Scores are relative within a batch. For applications needing consistent thresholds, calibrate against labeled dataset: run reranker on labeled query-document corpus, map raw scores to precision-at-k curves, set per-application thresholds based on observed precision.
**Two-tier reranking for scale (100M+ documents).** Stage 1: dense embedding → top-1,000. Stage 2: lightweight lexical (BM25 via [BGE-M3](/models/bge-m3) sparse embeddings) → top-100. Stage 3: cross-encoder reranker → top-10. NDCG@10 of ~73 (1.6 points above two-stage) for high-scale retrieval.
**When NOT to rerank.** Skip when: corpus under 10,000 docs (first-stage precise enough), queries are keyword/exact-match, latency budget under 50ms, or first-stage quality meets needs (general FAQ, internal wiki). Measure first-stage NDCG before engineering a reranking stage.
What breaks
Failure modes operators see in the wild.
**Reranker latency bottleneck.** Symptom: adding reranker increases retrieval from 20ms to 500ms+. Cause: reranking 200 candidates synchronously one at a time (200 × 10ms = 2,000ms). Reranker's cross-encoder processes each pair through full transformer — no benefit from pre-computed embeddings. Mitigation: use TEI batch inference — send all candidates in one request, processed in parallel on GPU. Caps 200-document reranking at 100–300ms. Cap candidate count at 50 for interactive. For extreme latency-sensitivity, lightweight bi-encoder reranker (DistilBERT-based) at 1–2ms per document — 80% quality for 20% latency.
**Score calibration drift.** Symptom: same query-document pair scores 0.85 today, 0.72 tomorrow after model update or candidate pool change. Cause: reranker scores are relative to batch — adding highly-relevant documents pushes down moderate scores. Mitigation: never rely on absolute scores for thresholds. Use rank ordering (top-k). If thresholds required, calibrate against fixed reference set scored periodically.
**Cross-encoder context window truncation.** Symptom: reranker assigns low scores to clearly relevant documents because key passage truncated at 8192-token limit. Cause: query + document combined exceed 8192 — document truncated from end. Mitigation: truncate documents before sending, not after. For long documents, chunk into 3,000-token segments, rerank each, use max segment score as document score. "Max-pooling over chunks" preserves ability to find relevant passages anywhere.
**Reranker-embedder model mismatch.** Symptom: reranker and embedder disagree — embedder ranks A above B, reranker flips, final quality degrades. Cause: embedders optimize semantic similarity; rerankers optimize passage relevance — related but distinct objectives. Mitigation: use matched pairs — BGE-M3 + BGE Reranker V2 M3 is the designed pair. If different embedder, evaluate agreement rate (reranker top-10 in embedder top-50). Below 60% indicates mismatch.
**Reranking irrelevant candidates.** Symptom: reranker assigns moderate scores (0.6–0.8) to completely irrelevant documents — cross-encoder optimized for relative ordering within batch, not absolute relevance detection. If top-100 are all irrelevant, reranker still assigns "best" to least-worst — cannot detect all candidates wrong. Mitigation: minimum first-stage score threshold — if BGE-M3 cosine similarity below 0.4 (1024-dim), document is too distant regardless of reranker output.
Hardware guidance
Reranker hardware requirements are the lowest after embeddings. BGE Reranker V2 M3 (568M params, ~1.1 GB FP16) runs on CPU at production latency for low volume.
**CPU-only ($0).** Throughput on modern desktop: 5–15 pairs/sec sequential, 20–40 pairs/sec with TEI batching. For 10 queries/hour with top-50 reranking each, CPU is fine. 100 queries/hour with top-50: ~20% CPU utilization. CPU viable for batch/preprocessing with flexible latency. [Apple M4 Pro](/hardware/apple-m4-pro) via CoreML: 8–12 pairs/sec.
**Entry GPU ($300–600).** [RTX 3060 12GB](/hardware/rtx-3060-12gb): 50–100 pairs/sec at batch=32. [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb): 80–150 pairs/sec. Top-50 reranking: 0.3–1 second per query on $300 GPU — acceptable for interactive search.
**SMB tier ($1,500–2,500).** [RTX 4090](/hardware/rtx-4090): 200–400 pairs/sec — top-100 in 250–500ms. [RTX 5080](/hardware/rtx-5080): 150–300 pairs/sec. At this tier, reranking is imperceptible — latency dominated by network, not inference.
**Enterprise ($8,000+).** Overkill — 1.1 GB VRAM model on enterprise GPU leaves 95%+ VRAM idle. [L40S](/hardware/nvidia-l40s) at 48 GB: ~500 pairs/sec at 2.3% VRAM utilization. Co-deploy reranker on same GPU as embedding or generation model — reranker's tiny footprint cohabitates without contention.
**Co-deployment.** Typical RAG server: BGE-M3 (1.1 GB) + BGE Reranker (1.1 GB) + 7B generation (4–8 GB) on single [RTX 4060 Ti 16GB](/hardware/rtx-4060-ti-16gb) — 6–10 GB used, 6–10 GB headroom. For 70B RAG, generation dominates VRAM but embedder + reranker add negligible 2.2 GB. Reranking infrastructure cost is near-zero when co-deployed.
**Latency scaling.** BGE Reranker V2 M3 per-pair at batch=1: CPU 50–200ms, GPU 5–15ms. At batch=50: CPU sequential 2.5–10 seconds; GPU parallel 50–150ms. GPU advantage is nonlinear — single GPU handles 50–100× more throughput than single CPU core.
Runtime guidance
**Text Embeddings Inference (TEI) with reranker support — the only production reranker serving path.**
[Text Embeddings Inference (TEI)](/tools/text-embeddings-inference) serves both embeddings and reranking from one Docker container by switching MODEL_ID. For reranking: `MODEL_ID=BAAI/bge-reranker-v2-m3`. The `/rerank` endpoint accepts `{"query": "...", "texts": [...]}` and returns scored results. Batch inference processes all pairs in parallel — 100 candidates in one request vs 100 sequential requests = 2–3× throughput.
TEI is the only production-grade open-weight reranker serving option. No Ollama support for reranking. llama.cpp does not serve rerankers. sentence-transformers can load the model (`CrossEncoder('BAAI/bge-reranker-v2-m3')`) for programmatic use but provides no serving layer — you build the HTTP API yourself.
**Vector database integration.** [Qdrant](/tools/qdrant) supports native reranking — pass reranker endpoint URL in config, Qdrant automatically reranks vector search results. Simplest path: deploy Qdrant for search, point at TEI reranker, Qdrant handles two-stage retrieval internally. [pgvector](/tools/pgvector): no native reranker integration. Implement in app code: Postgres `SELECT ... ORDER BY embedding <=> $1 LIMIT 100` → TEI `/rerank` → reorder. Straightforward but requires app-layer orchestration. [Weaviate](/tools/weaviate): reranker modules via module system — configure `reranker-transformers` pointing at TEI in config YAML. Exposes reranked search via GraphQL parameter.
**Decision tree.** Simplest production: TEI (reranker) + Qdrant (native integration) — one API call for search + rerank. Infrastructure-minimal: TEI + pgvector — app code orchestrates two-stage pipeline. Multi-tenant hybrid: TEI + Weaviate — native multi-tenancy + hybrid search + reranking in one GraphQL query.
**When to add reranking.** Measure first-stage NDCG@10. If quality meets requirements, reranking adds unnecessary complexity. If 5%+ below target, add reranking and measure improvement — typical gain is 5–15% on NDCG@10. Reranking never degrades quality (reorders, doesn't remove). Cost is latency and infrastructure. Deploy when quality gain justifies latency budget and co-deployment on existing GPU infrastructure at near-zero marginal hardware cost.