Capability notes
Enterprise search indexes internal data sources — Confluence, Slack, Google Drive, SharePoint, GitHub, Notion, Jira — into a unified searchable index answering natural language queries across organizational knowledge. The core challenge is not retrieval quality (solved with [BGE-M3](/models/bge-m3)) but **connector coverage** (indexing 10+ heterogeneous sources with different APIs, rate limits, formats) and **permission-aware retrieval** (employees only see results from documents they're authorized to access).
**Hybrid retrieval** combines dense embeddings (semantic similarity via [BGE-M3](/models/bge-m3)) with sparse lexical retrieval (BM25 or SPLADE) to handle both conversational queries ("how do I request PTO") and precise queries ("INC-2024-0842 resolution notes"). Hybrid retrieval improves recall 20-40% over dense-only or sparse-only for enterprise queries spanning conversational to precise.
**Connector coverage**: [AnythingLLM](https://anythingllm.com) supports ~20 connectors (Confluence, Slack, Google Drive, GitHub, Notion, YouTube, web scraping). For unsupported sources: custom connector authenticating to source API, polling for changes, extracting text, chunking, embedding, upserting into vector store. Writing a production connector takes 1-3 days per source — API integration is simple; the hard parts are incremental sync, rate limit handling, and permission mapping.
**Access-aware search**: Different employees have access to different documents — HR sees personnel files, engineering sees repos, execs see financial reports. Naive search that indexes everything and returns by relevance leaks permissions. The index must store document-level ACLs and filter results per-user based on ACL membership. This requires connectors pulling permissions alongside content, the index storing permissions per chunk, and query-time permission filtering.
**Open-source vs commercial**: Open-source (custom LangChain/LlamaIndex + [Qdrant](/tools/qdrant) or [pgvector](/tools/pgvector) + [BGE-M3](/models/bge-m3) + [Llama 3.3 70B](/models/llama-3-3-70b)) provides full control, zero per-seat pricing, auditability — at 2-6 months engineering cost. Commercial (Glean, Algolia, Elastic Workplace Search) provides turnkey deployment with 100+ connectors, built-in permission filtering, polished UX — at $10-50/user/month. Under 500 employees: commercial is cheaper. Over 5,000: self-hosted amortizes engineering over larger base.
If you just want to try this
Lowest-friction path to a working setup.
Download [AnythingLLM](https://anythingllm.com) — the simplest multi-source local search tool. Desktop app (Windows/Mac/Linux) bundles local vector database (LanceDB), embedding model (BGE-M3 compatible), and LLM integration into a single application. 5-10 minutes from download to first search.
Add data sources: AnythingLLM supports Confluence (API token), Slack (OAuth), GitHub repos (PAT), Google Drive (OAuth), Notion (API), web scraping (URLs). For each source, authenticate, select which spaces/channels/folders to index. Background sync — 5,000 documents takes 10-30 minutes on a modern laptop. Embeddings compute locally on CPU (no GPU required — [BGE-M3](/models/bge-m3) runs at 50-150 docs/sec on CPU).
For the LLM backend: [Ollama](/tools/ollama) with [Llama 3.3 70B](/models/llama-3-3-70b) Q4 if you have ~40 GB combined memory, or [Qwen 3 32B](/models/qwen-3-32b) Q4 (~20 GB). The LLM answers questions using retrieved document chunks as context. Without a GPU, configure AnythingLLM to use cloud LLM API (OpenAI/Anthropic) while keeping document indexing and storage local — embeddings stay local (privacy-safe), LLM inference goes to cloud (API cost).
First searches feel underwhelming. The index needs tuning — adjust chunk sizes (500-2000 tokens), overlap (10-20%), chunks retrieved per query (5-20). Start with defaults. If answers miss relevant information: increase chunks retrieved. If answers are unfocused: decrease chunks. Budget 1-2 hours tuning for your specific document corpus.
For production deployment
Operator-grade recommendation.
Production enterprise search for 10,000+ employees across 10+ data sources requires a custom pipeline — AnythingLLM isn't designed for this scale (single-machine, no permission filtering, limited customization). Four components: connectors, indexing pipeline, retrieval engine, search UI.
**Connector architecture**: Plugin system — each data source gets a connector class: `authenticate()`, `list_documents(since_timestamp)`, `get_document(id)`, `get_permissions(id)`, `subscribe_to_changes(webhook)`. Scheduled via Airflow or Temporal with retry logic. Each connector writes documents + permissions to Kafka/Redis Streams.
**Indexing pipeline**: Consume from queue → chunk with type-aware strategy. Confluence: by heading section. Slack: by thread. Code repos: by function/class (tree-sitter AST). Embed with [BGE-M3](/models/bge-m3) via [Text Embeddings Inference](/tools/text-embeddings-inference). Store in vector DB with metadata: source, URL, last_modified, permissions (ACL), document_type, chunk_index. Versioned index — old chunks marked stale on update.
**Retrieval engine**: Hybrid — dense in [Qdrant](/tools/qdrant) or [pgvector](/tools/pgvector) + lexical in [Elasticsearch](https://www.elastic.co/) or [OpenSearch](https://opensearch.org/). Query flow: embed → parallel dense + lexical search (top-50 each) → merge + deduplicate → rerank with [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) (top-50→top-20) → permission filter → return top-10 to LLM for synthesis.
**Permission-aware retrieval**: Pre-filter (add ACL filter clause, 5-20ms latency, zero leakage) vs post-filter (retrieve top-N by relevance, then filter by permissions, faster for users in many groups). Combine: coarse pre-filter by department-level group, fine-grained post-filter for document-level permissions.
**Scale to millions of documents**: 10M docs → 50-100M chunks. Vector index: 50M × 4.5 KB (embedding + metadata) ≈ 225 GB + HNSW overhead = 280-340 GB. Shard by org unit — Engineering docs in shard A, HR in shard B. Route queries to relevant shards by user org or intent detection. Sharding also simplifies permission filtering.
**Cost model** (5,000 employees, 1M docs, 10 connectors): Embedding server ([NVIDIA L4](/hardware/nvidia-l4)) $3K hardware or $0.50-1/hr cloud. LLM inference ([RTX A6000](/hardware/rtx-a6000)) $5-8K hardware. Vector DB + Elasticsearch server $5-15K hardware. Engineering: 3-6 months (2-4 engineers). Annual total (hardware amortized 3yr): $50-150K vs Glean at $10-50/user/month = $600K-3M for 5,000 users.
What breaks
Failure modes operators see in the wild.
- **Permission leakage.** Index returns document chunks the user cannot access — Slack private channel message, restricted Confluence page, HR document with salary data. Highest-severity failure — it's a data breach. Mitigation: always apply permission filter before LLM sees chunks. Regular permission audit (re-sync every 24 hours). Log which chunks were included in context, verify permissions post-hoc. Never cache search results across users.
- **Stale connector sync.** Document updated in Confluence but connector hasn't synced yet (15-minute polling cycle). User searches "latest API changes," gets last week's version. Mitigation: webhook-based change detection for real-time updates (Confluence, GitHub, Slack all support webhooks). Show "last indexed" timestamp on results. For critical documents, re-sync on every query (just-in-time).
- **Query intent mismatch.** "Q4 roadmap" in June 2026 returns Q4 2025 roadmap — search doesn't understand temporal intent. Mitigation: date-boosted ranking, query expansion ("Q4 roadmap" → "Q4 2026 roadmap"), LLM-based query understanding before retrieval.
- **Ranking decay on large indices.** 50M+ chunks — top candidates from vector search include moderately-relevant but not highly-relevant documents. Perfect document buried at rank 30-50. Mitigation: multi-stage retrieval — coarse ANN → reranker → fine. Index partitioning by document type. [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) cross-encoder reranks top-100 to top-20, adds 20-50ms but worth it.
- **Connector API rate limiting.** Confluence sync of 10,000 pages hits rate limit (200 req/min). Initial sync takes 6 hours instead of 30 minutes. Mitigation: respect Retry-After headers, exponential backoff with jitter, request temporary rate limit increases for initial sync. Prioritize recently-modified documents.
- **LLM hallucination in synthesized answers.** Retrieved correct documents but LLM adds hallucinated details — specific number, person's name, date absent from context. Mitigation: force inline citations linking to source. Display source chunks alongside answer. Run NLI verification on LLM claims against retrieved chunks before serving.
Hardware guidance
Indexing pipeline (CPU + RAM + storage) and query-serving pipeline (GPU for LLM + vector search) scale independently.
**Hobbyist (<10K docs, personal)**: Any modern laptop 16+ GB RAM. [AnythingLLM](https://anythingllm.com) runs entirely locally — LanceDB on NVMe, [BGE-M3](/models/bge-m3) on CPU, [Ollama](/tools/ollama) with [Qwen 3 32B](/models/qwen-3-32b) on [RTX 3060 12GB](/hardware/rtx-3060-12gb). Indexing 5K docs in 10-30min. Query latency 5-15s. [MacBook Pro 16 M4 Max 64GB](/hardware/macbook-pro-16-m4-max) — best single-device, unified memory handles full stack.
**SMB (10K-100K docs, 10-50 people)**: Indexing server: 32-64 GB RAM, NVMe 2+ GB/s, 8-16 cores. [NVIDIA L4](/hardware/nvidia-l4) for [Text Embeddings Inference](/tools/text-embeddings-inference) — GPU embedding 5-10× faster than CPU. Inference: [RTX 4090 24GB](/hardware/rtx-4090) for [Llama 3.3 70B](/models/llama-3-3-70b) Q4. 50 concurrent users, <10s latency.
**Enterprise (100K-10M docs, 50-5,000 people)**: Indexing server: 128-256 GB RAM, NVMe RAID 5+ GB/s, 16-32 cores. [NVIDIA L40S](/hardware/nvidia-l40s) 48 GB for 1000+ docs/sec embedding. Query server: [RTX A6000](/hardware/rtx-a6000) 48 GB for Llama 3.3 70B Q8. Vector DB on separate memory-optimized server (256-512 GB RAM for in-memory Qdrant). Elasticsearch/OpenSearch on separate server for lexical. 4-6 servers total. 500-5K concurrent, sub-5s latency.
**Frontier (10M+ docs, 5K+ employees)**: Horizontally scale — shard vector DB by org unit. Multiple LLM inference servers behind load balancer. [NVIDIA H100 PCIe](/hardware/nvidia-h100-pcie) for [Qwen 3 235B-A22B](/models/qwen-3-235b-a22b) synthesis. CDN for search UI static assets.
**Storage**: Vector index = N_docs × avg_chunks × (4096 bytes embedding + 500 bytes metadata). 1M docs: ~45 GB. 10M docs: ~450 GB + 20-50% index overhead = 550-675 GB. Fits single server with 1 TB RAM. Beyond 20M docs: sharding more cost-effective than scaling to 2 TB RAM.
Runtime guidance
**If individual/small team (<10 people, <10K docs)** → [AnythingLLM](https://anythingllm.com) desktop or self-hosted Docker. Simplest multi-source search. ~20 connectors, local embedding, local LLM via [Ollama](/tools/ollama). Limitations: single-machine, no permission filtering, limited connector customization. For team of 5 with shared access, limitations don't matter.
**If team of 10-500 needing pipeline control** → Custom [LangChain](https://www.langchain.com/) or [LlamaIndex](https://www.llamaindex.ai/) pipeline. Both provide document loaders, chunking strategies, embedding. Use [Qdrant](/tools/qdrant) (better 100K+ doc performance) or [pgvector](/tools/pgvector) (simpler if already on PostgreSQL). [BGE-M3](/models/bge-m3) via [Text Embeddings Inference](/tools/text-embeddings-inference). [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) for reranking. [vLLM](/tools/vllm) for LLM synthesis.
**If needing permission-aware search** → [OpenSearch](https://opensearch.org/) with document-level security plugin. Attribute-based access control natively — queries auto-filter by authenticated user attributes. Pair with [Qdrant](/tools/qdrant) for semantic search with per-chunk permission metadata and post-retrieval filtering.
**If wanting commercial (no engineering)** → Glean, Algolia, Elastic Workplace Search. Glean is enterprise standard — 100+ connectors, permission filtering, polished UI, LLM synthesis. $10-50/user/month. Tradeoff: data leaves your environment. For regulated industries, self-hosted OpenSearch + Qdrant + custom pipeline for data residency.
**If needing hybrid retrieval** → [Elasticsearch](https://www.elastic.co/) 8.x+ with native dense vector support. Hybrid via `knn` query combining vector similarity with BM25 via Reciprocal Rank Fusion. Single-engine eliminates Elasticsearch + Qdrant complexity. Tradeoff: vector performance at 10M+ vectors lags dedicated DBs — plan 2-4× more RAM.
**Deployment checklist**: Connectors with webhook + polling, exponential backoff. [BGE-M3](/models/bge-m3) via [TEI](/tools/text-embeddings-inference) on [NVIDIA L4](/hardware/nvidia-l4). [Qdrant](/tools/qdrant) scalar quantization (4× compression, <1% recall loss). [OpenSearch](https://opensearch.org/) document-level security. [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) on [RTX 4090](/hardware/rtx-4090). [Llama 3.3 70B](/models/llama-3-3-70b) Q4 on [vLLM](/tools/vllm). Monitoring: query latency p50/p95, recall@10, index freshness, permission audit logs.