RAG & Search
company search
internal search
knowledge search

Enterprise Search

Search across enterprise data sources (Confluence, Slack, Drive, internal docs) with permissions awareness. Self-hosted is the privacy wedge.

Capability notes

Enterprise search indexes internal data sources — Confluence, Slack, Google Drive, SharePoint, GitHub, Notion, Jira — into a unified searchable index answering natural language queries across organizational knowledge. The core challenge is not retrieval quality (solved with [BGE-M3](/models/bge-m3)) but **connector coverage** (indexing 10+ heterogeneous sources with different APIs, rate limits, formats) and **permission-aware retrieval** (employees only see results from documents they're authorized to access). **Hybrid retrieval** combines dense embeddings (semantic similarity via [BGE-M3](/models/bge-m3)) with sparse lexical retrieval (BM25 or SPLADE) to handle both conversational queries ("how do I request PTO") and precise queries ("INC-2024-0842 resolution notes"). Hybrid retrieval improves recall 20-40% over dense-only or sparse-only for enterprise queries spanning conversational to precise. **Connector coverage**: [AnythingLLM](https://anythingllm.com) supports ~20 connectors (Confluence, Slack, Google Drive, GitHub, Notion, YouTube, web scraping). For unsupported sources: custom connector authenticating to source API, polling for changes, extracting text, chunking, embedding, upserting into vector store. Writing a production connector takes 1-3 days per source — API integration is simple; the hard parts are incremental sync, rate limit handling, and permission mapping. **Access-aware search**: Different employees have access to different documents — HR sees personnel files, engineering sees repos, execs see financial reports. Naive search that indexes everything and returns by relevance leaks permissions. The index must store document-level ACLs and filter results per-user based on ACL membership. This requires connectors pulling permissions alongside content, the index storing permissions per chunk, and query-time permission filtering. **Open-source vs commercial**: Open-source (custom LangChain/LlamaIndex + [Qdrant](/tools/qdrant) or [pgvector](/tools/pgvector) + [BGE-M3](/models/bge-m3) + [Llama 3.3 70B](/models/llama-3-3-70b)) provides full control, zero per-seat pricing, auditability — at 2-6 months engineering cost. Commercial (Glean, Algolia, Elastic Workplace Search) provides turnkey deployment with 100+ connectors, built-in permission filtering, polished UX — at $10-50/user/month. Under 500 employees: commercial is cheaper. Over 5,000: self-hosted amortizes engineering over larger base.

If you just want to try this

Lowest-friction path to a working setup.

Download [AnythingLLM](https://anythingllm.com) — the simplest multi-source local search tool. Desktop app (Windows/Mac/Linux) bundles local vector database (LanceDB), embedding model (BGE-M3 compatible), and LLM integration into a single application. 5-10 minutes from download to first search. Add data sources: AnythingLLM supports Confluence (API token), Slack (OAuth), GitHub repos (PAT), Google Drive (OAuth), Notion (API), web scraping (URLs). For each source, authenticate, select which spaces/channels/folders to index. Background sync — 5,000 documents takes 10-30 minutes on a modern laptop. Embeddings compute locally on CPU (no GPU required — [BGE-M3](/models/bge-m3) runs at 50-150 docs/sec on CPU). For the LLM backend: [Ollama](/tools/ollama) with [Llama 3.3 70B](/models/llama-3-3-70b) Q4 if you have ~40 GB combined memory, or [Qwen 3 32B](/models/qwen-3-32b) Q4 (~20 GB). The LLM answers questions using retrieved document chunks as context. Without a GPU, configure AnythingLLM to use cloud LLM API (OpenAI/Anthropic) while keeping document indexing and storage local — embeddings stay local (privacy-safe), LLM inference goes to cloud (API cost). First searches feel underwhelming. The index needs tuning — adjust chunk sizes (500-2000 tokens), overlap (10-20%), chunks retrieved per query (5-20). Start with defaults. If answers miss relevant information: increase chunks retrieved. If answers are unfocused: decrease chunks. Budget 1-2 hours tuning for your specific document corpus.

For production deployment

Operator-grade recommendation.

Production enterprise search for 10,000+ employees across 10+ data sources requires a custom pipeline — AnythingLLM isn't designed for this scale (single-machine, no permission filtering, limited customization). Four components: connectors, indexing pipeline, retrieval engine, search UI. **Connector architecture**: Plugin system — each data source gets a connector class: `authenticate()`, `list_documents(since_timestamp)`, `get_document(id)`, `get_permissions(id)`, `subscribe_to_changes(webhook)`. Scheduled via Airflow or Temporal with retry logic. Each connector writes documents + permissions to Kafka/Redis Streams. **Indexing pipeline**: Consume from queue → chunk with type-aware strategy. Confluence: by heading section. Slack: by thread. Code repos: by function/class (tree-sitter AST). Embed with [BGE-M3](/models/bge-m3) via [Text Embeddings Inference](/tools/text-embeddings-inference). Store in vector DB with metadata: source, URL, last_modified, permissions (ACL), document_type, chunk_index. Versioned index — old chunks marked stale on update. **Retrieval engine**: Hybrid — dense in [Qdrant](/tools/qdrant) or [pgvector](/tools/pgvector) + lexical in [Elasticsearch](https://www.elastic.co/) or [OpenSearch](https://opensearch.org/). Query flow: embed → parallel dense + lexical search (top-50 each) → merge + deduplicate → rerank with [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) (top-50→top-20) → permission filter → return top-10 to LLM for synthesis. **Permission-aware retrieval**: Pre-filter (add ACL filter clause, 5-20ms latency, zero leakage) vs post-filter (retrieve top-N by relevance, then filter by permissions, faster for users in many groups). Combine: coarse pre-filter by department-level group, fine-grained post-filter for document-level permissions. **Scale to millions of documents**: 10M docs → 50-100M chunks. Vector index: 50M × 4.5 KB (embedding + metadata) ≈ 225 GB + HNSW overhead = 280-340 GB. Shard by org unit — Engineering docs in shard A, HR in shard B. Route queries to relevant shards by user org or intent detection. Sharding also simplifies permission filtering. **Cost model** (5,000 employees, 1M docs, 10 connectors): Embedding server ([NVIDIA L4](/hardware/nvidia-l4)) $3K hardware or $0.50-1/hr cloud. LLM inference ([RTX A6000](/hardware/rtx-a6000)) $5-8K hardware. Vector DB + Elasticsearch server $5-15K hardware. Engineering: 3-6 months (2-4 engineers). Annual total (hardware amortized 3yr): $50-150K vs Glean at $10-50/user/month = $600K-3M for 5,000 users.

What breaks

Failure modes operators see in the wild.

- **Permission leakage.** Index returns document chunks the user cannot access — Slack private channel message, restricted Confluence page, HR document with salary data. Highest-severity failure — it's a data breach. Mitigation: always apply permission filter before LLM sees chunks. Regular permission audit (re-sync every 24 hours). Log which chunks were included in context, verify permissions post-hoc. Never cache search results across users. - **Stale connector sync.** Document updated in Confluence but connector hasn't synced yet (15-minute polling cycle). User searches "latest API changes," gets last week's version. Mitigation: webhook-based change detection for real-time updates (Confluence, GitHub, Slack all support webhooks). Show "last indexed" timestamp on results. For critical documents, re-sync on every query (just-in-time). - **Query intent mismatch.** "Q4 roadmap" in June 2026 returns Q4 2025 roadmap — search doesn't understand temporal intent. Mitigation: date-boosted ranking, query expansion ("Q4 roadmap" → "Q4 2026 roadmap"), LLM-based query understanding before retrieval. - **Ranking decay on large indices.** 50M+ chunks — top candidates from vector search include moderately-relevant but not highly-relevant documents. Perfect document buried at rank 30-50. Mitigation: multi-stage retrieval — coarse ANN → reranker → fine. Index partitioning by document type. [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) cross-encoder reranks top-100 to top-20, adds 20-50ms but worth it. - **Connector API rate limiting.** Confluence sync of 10,000 pages hits rate limit (200 req/min). Initial sync takes 6 hours instead of 30 minutes. Mitigation: respect Retry-After headers, exponential backoff with jitter, request temporary rate limit increases for initial sync. Prioritize recently-modified documents. - **LLM hallucination in synthesized answers.** Retrieved correct documents but LLM adds hallucinated details — specific number, person's name, date absent from context. Mitigation: force inline citations linking to source. Display source chunks alongside answer. Run NLI verification on LLM claims against retrieved chunks before serving.

Hardware guidance

Indexing pipeline (CPU + RAM + storage) and query-serving pipeline (GPU for LLM + vector search) scale independently. **Hobbyist (<10K docs, personal)**: Any modern laptop 16+ GB RAM. [AnythingLLM](https://anythingllm.com) runs entirely locally — LanceDB on NVMe, [BGE-M3](/models/bge-m3) on CPU, [Ollama](/tools/ollama) with [Qwen 3 32B](/models/qwen-3-32b) on [RTX 3060 12GB](/hardware/rtx-3060-12gb). Indexing 5K docs in 10-30min. Query latency 5-15s. [MacBook Pro 16 M4 Max 64GB](/hardware/macbook-pro-16-m4-max) — best single-device, unified memory handles full stack. **SMB (10K-100K docs, 10-50 people)**: Indexing server: 32-64 GB RAM, NVMe 2+ GB/s, 8-16 cores. [NVIDIA L4](/hardware/nvidia-l4) for [Text Embeddings Inference](/tools/text-embeddings-inference) — GPU embedding 5-10× faster than CPU. Inference: [RTX 4090 24GB](/hardware/rtx-4090) for [Llama 3.3 70B](/models/llama-3-3-70b) Q4. 50 concurrent users, <10s latency. **Enterprise (100K-10M docs, 50-5,000 people)**: Indexing server: 128-256 GB RAM, NVMe RAID 5+ GB/s, 16-32 cores. [NVIDIA L40S](/hardware/nvidia-l40s) 48 GB for 1000+ docs/sec embedding. Query server: [RTX A6000](/hardware/rtx-a6000) 48 GB for Llama 3.3 70B Q8. Vector DB on separate memory-optimized server (256-512 GB RAM for in-memory Qdrant). Elasticsearch/OpenSearch on separate server for lexical. 4-6 servers total. 500-5K concurrent, sub-5s latency. **Frontier (10M+ docs, 5K+ employees)**: Horizontally scale — shard vector DB by org unit. Multiple LLM inference servers behind load balancer. [NVIDIA H100 PCIe](/hardware/nvidia-h100-pcie) for [Qwen 3 235B-A22B](/models/qwen-3-235b-a22b) synthesis. CDN for search UI static assets. **Storage**: Vector index = N_docs × avg_chunks × (4096 bytes embedding + 500 bytes metadata). 1M docs: ~45 GB. 10M docs: ~450 GB + 20-50% index overhead = 550-675 GB. Fits single server with 1 TB RAM. Beyond 20M docs: sharding more cost-effective than scaling to 2 TB RAM.

Runtime guidance

**If individual/small team (<10 people, <10K docs)** → [AnythingLLM](https://anythingllm.com) desktop or self-hosted Docker. Simplest multi-source search. ~20 connectors, local embedding, local LLM via [Ollama](/tools/ollama). Limitations: single-machine, no permission filtering, limited connector customization. For team of 5 with shared access, limitations don't matter. **If team of 10-500 needing pipeline control** → Custom [LangChain](https://www.langchain.com/) or [LlamaIndex](https://www.llamaindex.ai/) pipeline. Both provide document loaders, chunking strategies, embedding. Use [Qdrant](/tools/qdrant) (better 100K+ doc performance) or [pgvector](/tools/pgvector) (simpler if already on PostgreSQL). [BGE-M3](/models/bge-m3) via [Text Embeddings Inference](/tools/text-embeddings-inference). [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) for reranking. [vLLM](/tools/vllm) for LLM synthesis. **If needing permission-aware search** → [OpenSearch](https://opensearch.org/) with document-level security plugin. Attribute-based access control natively — queries auto-filter by authenticated user attributes. Pair with [Qdrant](/tools/qdrant) for semantic search with per-chunk permission metadata and post-retrieval filtering. **If wanting commercial (no engineering)** → Glean, Algolia, Elastic Workplace Search. Glean is enterprise standard — 100+ connectors, permission filtering, polished UI, LLM synthesis. $10-50/user/month. Tradeoff: data leaves your environment. For regulated industries, self-hosted OpenSearch + Qdrant + custom pipeline for data residency. **If needing hybrid retrieval** → [Elasticsearch](https://www.elastic.co/) 8.x+ with native dense vector support. Hybrid via `knn` query combining vector similarity with BM25 via Reciprocal Rank Fusion. Single-engine eliminates Elasticsearch + Qdrant complexity. Tradeoff: vector performance at 10M+ vectors lags dedicated DBs — plan 2-4× more RAM. **Deployment checklist**: Connectors with webhook + polling, exponential backoff. [BGE-M3](/models/bge-m3) via [TEI](/tools/text-embeddings-inference) on [NVIDIA L4](/hardware/nvidia-l4). [Qdrant](/tools/qdrant) scalar quantization (4× compression, <1% recall loss). [OpenSearch](https://opensearch.org/) document-level security. [BGE Reranker V2 M3](/models/bge-reranker-v2-m3) on [RTX 4090](/hardware/rtx-4090). [Llama 3.3 70B](/models/llama-3-3-70b) Q4 on [vLLM](/tools/vllm). Monitoring: query latency p50/p95, recall@10, index freshness, permission audit logs.

Setup walkthrough

  1. Install Ollamaollama pull nomic-embed-text (embeddings) + ollama pull llama3.1:8b (generation).
  2. Install a vector DB: docker run -p 6333:6333 qdrant/qdrant (Qdrant) or pip install chromadb.
  3. Index your documents:
import chromadb, requests
client = chromadb.Client()
collection = client.create_collection("enterprise")
for i, doc in enumerate(documents):
    resp = requests.post("http://localhost:11434/api/embeddings",
        json={"model": "nomic-embed-text", "prompt": doc["text"]})
    collection.add(documents=[doc["text"]], embeddings=[resp.json()["embedding"]], ids=[f"doc_{i}"])
  1. Query: embed the query → search top-20 → rerank with LLM → return top-5 with snippets.
  2. For permissions-aware search: add metadata filters (department, classification level) to each document → filter at query time.
  3. First search result in <2 seconds for a 10K-document corpus on CPU.

The cheap setup

Enterprise search is surprisingly cheap. Nomic Embed Text runs on any CPU — a $300 laptop indexes and searches 100K documents at sub-second query latency. ChromaDB or LanceDB (embedded, no Docker needed) handles the vector storage. The generation step (answer synthesis) needs a GPU only if you want LLM-generated answers from search results. For "search and return snippets" (no generation), any $300 machine works great. For LLM-synthesized answers, add a used GTX 1060 6 GB (~$60) — total ~$360.

The serious setup

Used RTX 3090 24 GB (~$700-900, see /hardware/rtx-3090). Runs the full pipeline: BGE-M3 embeddings (1,000+ docs/second), BGE Reranker V2 M3 for precise ranking, Llama 3.3 70B for answer synthesis — all on one GPU. Qdrant/Weaviate for production vector DB with filtering + permissions. Handles 1M+ documents with sub-500ms query latency. Pair with Ryzen 7 7700X + 64 GB DDR5 + 2TB NVMe. Total: ~$1,800-2,200. For enterprise scale (10M+ docs, 100+ concurrent users): add a second GPU and switch to distributed Qdrant.

Common beginner mistake

The mistake: Using a single-stage keyword search (Elasticsearch BM25) for "enterprise AI search" and calling it done. Why it fails: Keyword search can't handle semantic queries — searching "how to request time off" won't find a document titled "PTO Policy" because the words don't match. Users expect Google-level semantic understanding from enterprise search now. The fix: Use hybrid search: BM25 (keywords) + dense embeddings (semantics) combined. Retrieve top-100 via each method, merge (reciprocal rank fusion), then rerank top-50 with a cross-encoder. This gives both exact-match precision AND semantic recall. The embedding model (Nomic/BGE) + reranker add ~200ms latency and ~1 GB VRAM overhead — trivially cheap for the quality improvement.

Recommended setup for enterprise search

Recommended runtimes

Browse all tools for runtimes that fit this workload.

Reality check

Local AI workloads have real hardware constraints that vary by task type. VRAM ceiling decides what model fits; bandwidth decides decode speed; compute decides prefill speed. Pick the GPU tier that fits your actual workload, not the spec sheet.

Common mistakes

  • Buying for spec-sheet VRAM without modeling KV cache + activation overhead
  • Underestimating quantization quality loss below Q4
  • Skipping flash-attention support (real perf gap on long context)
  • Ignoring sustained-load thermals (laptops thermal-throttle within 30 min)

What breaks first

The errors most operators hit when running enterprise search locally. Each links to a diagnose+fix walkthrough.

Before you buy

Verify your specific hardware can handle enterprise search before committing money.

Specialized buyer guides
Updated 2026 roundup