Enterprise Search (Agentic RAG)

"Design a search system that indexes Confluence, SharePoint, Slack, Google Drive, and ServiceNow KB. Users ask natural language queries. Must enforce per-user access permissions. Multi-tenant architecture with <500ms retrieval latency."

Table of Contents

  1. Clarifying Questions & Scope
  2. Back-of-Envelope Estimation
  3. High-Level Architecture
  4. Deep Dive 1: Ingestion Pipeline
  5. Deep Dive 2: Hybrid Search
  6. Deep Dive 3: Permission-Aware Filtering
  7. Scaling & Production
  8. ML & Evaluation
  9. Cheat Sheet

1 Clarifying Questions & Scope

Dimension Clarification Assumption
Document Volume How many docs per customer? 100K - 10M documents per customer
Freshness How quickly must new content be searchable? Fresh within minutes (near real-time)
Access Control How are permissions enforced? ACL-based permissions per document/space
Tenant Isolation How strict is data isolation? Strict isolation — zero cross-tenant leakage
Answer Generation Return docs or generate answers? Generate answers with citations to source docs

2 Back-of-Envelope Estimation

Scale Numbers

  • 350 customers x 1M docs avg = 350M documents
  • 5 chunks per doc x 500 tokens avg = 1.75 billion chunks total
  • 1,536-dim embeddings x 1.75B = ~1TB vector storage
  • Query load: ~30 QPS across all tenants
Metric Target
Retrieval Latency <500ms (vector + keyword + rerank)
End-to-End (with answer) <3 seconds (including LLM generation)
Ingestion Throughput 10K documents/minute per connector
Freshness SLA <5 minutes from source update
Permission Accuracy 100% (zero unauthorized access)

3 High-Level Architecture

Ingestion Pipeline

  DATA SOURCES                    INGESTION PIPELINE                      STORAGE
  +-----------+                                                    +------------------+
  | Confluence|--+                                                 |   Vector DB      |
  +-----------+  |   +-----------+   +---------+   +---------+    |  (Pinecone/      |
  | SharePoint|--+-->| Connectors|-->| Extract |-->|  Chunk  |--+-|>  Qdrant)         |
  +-----------+  |   | REST /    |   |  Text   |   | 500-1K  |  | +------------------+
  |   Slack   |--+   | Graph API |   | (Tika)  |   | tokens  |  |
  +-----------+  |   | Webhooks  |   +---------+   | overlap |  | +------------------+
  |Google Drv |--+   | + Polling |                 | 100 tok |  | | Elasticsearch    |
  +-----------+  |   +-----------+                 +---------+  +-|> (BM25 keyword)  |
  |ServiceNow|--+                                      |         +------------------+
  +-----------+                                   +---------+
                                                  |  Embed  |    +------------------+
                                                  | ada-002 |    |  Metadata Store  |
                                                  +---------+    |  (PostgreSQL)    |
                                                                 +------------------+

  EACH CHUNK STORES:
  text | source_url | author | timestamp | section_title | ACL_metadata | tenant_id

Query Pipeline

  USER QUERY FLOW
  +--------+    +-------+    +----------+    +------------------+    +------------+
  |  User  |--->|  NLU  |--->|  Query   |--->| Hybrid Retrieval |--->| Permission |
  | Query  |    |Intent |    | Expansion|    |                  |    |   Filter   |
  +--------+    +-------+    |Synonyms  |    | Vector (cosine)  |    +-----+------+
                             +----------+    | + BM25 (keyword) |          |
                                             +------------------+          v
                                                                   +------------+
  +--------+    +----------+    +---------+                        |  Re-rank   |
  |  User  |<---|  Answer  |<---|  LLM    |<-----------------------| Cross-enc  |
  |Response|    |  + Cites  |    |Generate |                       |  Top-50    |
  +--------+    +----------+    +---------+                        +------------+

4 Deep Dive 1: Ingestion Pipeline

Source Connectors

Webhook + Polling Hybrid Strategy

Primary: Webhooks for real-time updates (<1 min latency). Each source sends change events to our ingestion queue.
Fallback: Polling every 5 minutes to catch missed webhooks. Reconciliation job compares source timestamps vs. our last-indexed timestamps. This guarantees no document is missed even if webhooks fail silently.

Chunking Strategy

Embedding Generation

5 Deep Dive 2: Hybrid Search

Why Hybrid? Two Failure Modes

Vector Search Wins (Semantic)

Query: "How do I reset my password?"
Matches document titled: "Credential Recovery Procedures"
BM25 would miss this entirely — no keyword overlap. Vector search captures semantic similarity.

BM25 Wins (Exact Match)

Query: "VPN-2847 error code"
Matches document containing: "Error VPN-2847: Certificate expired"
Vector search might return generic VPN docs. BM25 nails the exact error code.

Reciprocal Rank Fusion (RRF)

Combine results from both retrieval methods using RRF:

  RRF Score Formula:
  ─────────────────────────────────────────────────────
  score(doc) = 1/(k + rank_vector) + 1/(k + rank_keyword)

  where k = 60 (standard constant)

  Example:
  ┌──────────┬──────────────┬──────────────┬───────────┐
  │ Document │ Vector Rank  │ BM25 Rank    │ RRF Score │
  ├──────────┼──────────────┼──────────────┼───────────┤
  │ Doc A    │ 1            │ 5            │ 0.0317    │
  │ Doc B    │ 3            │ 2            │ 0.0321    │  ← Winner
  │ Doc C    │ 2            │ 8            │ 0.0309    │
  │ Doc D    │ 10           │ 1            │ 0.0307    │
  └──────────┴──────────────┴──────────────┴───────────┘

  Doc B ranks best overall — good in BOTH systems.

Cross-Encoder Re-ranking

6 Deep Dive 3: Permission-Aware Filtering

CRITICAL REQUIREMENT: A user must NEVER see content they don't have access to in the source system. This is a compliance and trust requirement — a single violation can lose a customer. Permission filtering is not optional, it's the #1 priority.

How It Works

  PERMISSION FILTERING FLOW
  ─────────────────────────────────────────────────────────────

  Query: "How to configure SSO?"  User: jane@acme.com

  ┌──────────────────┐
  │ Hybrid Retrieval │ → Top 200 results (no permission check yet)
  └────────┬─────────┘
           │
  ┌────────v─────────┐
  │ Resolve User ACL │ → jane@acme.com is in: [engineering, sso-admins, all-staff]
  │ (Redis cache 5m) │
  └────────┬─────────┘
           │
  ┌────────v─────────┐
  │ Filter by ACL    │ → 200 results → 47 accessible → Top 10
  │ chunk.acl ∩ user │
  └────────┬─────────┘
           │
  ┌────────v─────────┐
  │ Cross-Encoder    │ → Re-rank top 10 → Final 5 for answer generation
  │ Re-rank          │
  └────────┬─────────┘
           │
  ┌────────v─────────┐
  │ LLM Answer Gen   │ → "To configure SSO, follow these steps... [1][2][3]"
  │ with Citations   │
  └──────────────────┘

Edge Cases

7 Scaling & Production

Sharding Strategy

Caching

Incremental Updates

Multi-Model Embeddings

8 ML & Evaluation

Fine-Tuning

Evaluation Metrics

Metric Target How Measured
MRR (Mean Reciprocal Rank) >0.65 Position of first relevant result in top-10
NDCG@10 >0.70 Graded relevance of top-10 results
Answer Accuracy >90% Human eval of LLM-generated answers (weekly sample)
Citation Accuracy >95% Do citations actually support the generated answer?
Permission Accuracy 100% Automated audit: compare returned results against source ACLs

9 Cheat Sheet

Enterprise Search (Agentic RAG) — Key Numbers

  • 350M docs, 1.75B chunks, ~1TB vectors
  • Hybrid retrieval: Vector (semantic) + BM25 (keyword) + RRF fusion
  • Cross-encoder re-ranks top-50 in ~100ms
  • Permission filtering: over-retrieve 200, filter by ACL, return 10
  • Redis permission cache TTL 5 min
  • <500ms retrieval, <3s end-to-end with answer
  • Shard by tenant_id for isolation + independent scaling
  • Webhook + polling hybrid for <5 min freshness
  • 100-token overlap between chunks preserves context
  • Content hash avoids re-embedding unchanged chunks (80% cost savings)