Enterprise Search (Agentic RAG)

1 Clarifying Questions & Scope

Dimension	Clarification	Assumption
Document Volume	How many docs per customer?	100K - 10M documents per customer
Freshness	How quickly must new content be searchable?	Fresh within minutes (near real-time)
Access Control	How are permissions enforced?	ACL-based permissions per document/space
Tenant Isolation	How strict is data isolation?	Strict isolation — zero cross-tenant leakage
Answer Generation	Return docs or generate answers?	Generate answers with citations to source docs

2 Back-of-Envelope Estimation

        Scale Numbers
        350 customers x 1M docs avg = 350M documents
5 chunks per doc x 500 tokens avg = 1.75 billion chunks total
1,536-dim embeddings x 1.75B = ~1TB vector storage
Query load: ~30 QPS across all tenants

      

Metric	Target
Retrieval Latency	<500ms (vector + keyword + rerank)
End-to-End (with answer)	<3 seconds (including LLM generation)
Ingestion Throughput	10K documents/minute per connector
Freshness SLA	<5 minutes from source update
Permission Accuracy	100% (zero unauthorized access)

3 High-Level Architecture

Ingestion Pipeline

  DATA SOURCES                    INGESTION PIPELINE                      STORAGE
  +-----------+                                                    +------------------+
  | Confluence|--+                                                 |   Vector DB      |
  +-----------+  |   +-----------+   +---------+   +---------+    |  (Pinecone/      |
  | SharePoint|--+-->| Connectors|-->| Extract |-->|  Chunk  |--+-|>  Qdrant)         |
  +-----------+  |   | REST /    |   |  Text   |   | 500-1K  |  | +------------------+
  |   Slack   |--+   | Graph API |   | (Tika)  |   | tokens  |  |
  +-----------+  |   | Webhooks  |   +---------+   | overlap |  | +------------------+
  |Google Drv |--+   | + Polling |                 | 100 tok |  | | Elasticsearch    |
  +-----------+  |   +-----------+                 +---------+  +-|> (BM25 keyword)  |
  |ServiceNow|--+                                      |         +------------------+
  +-----------+                                   +---------+
                                                  |  Embed  |    +------------------+
                                                  | ada-002 |    |  Metadata Store  |
                                                  +---------+    |  (PostgreSQL)    |
                                                                 +------------------+

  EACH CHUNK STORES:
  text | source_url | author | timestamp | section_title | ACL_metadata | tenant_id

Query Pipeline

  USER QUERY FLOW
  +--------+    +-------+    +----------+    +------------------+    +------------+
  |  User  |--->|  NLU  |--->|  Query   |--->| Hybrid Retrieval |--->| Permission |
  | Query  |    |Intent |    | Expansion|    |                  |    |   Filter   |
  +--------+    +-------+    |Synonyms  |    | Vector (cosine)  |    +-----+------+
                             +----------+    | + BM25 (keyword) |          |
                                             +------------------+          v
                                                                   +------------+
  +--------+    +----------+    +---------+                        |  Re-rank   |
  |  User  |<---|  Answer  |<---|  LLM    |<-----------------------| Cross-enc  |
  |Response|    |  + Cites  |    |Generate |                       |  Top-50    |
  +--------+    +----------+    +---------+                        +------------+

4 Deep Dive 1: Ingestion Pipeline

Source Connectors

Confluence: REST API v2 with webhooks for real-time updates. Pagination for full sync. Extract page content, comments, attachments.
SharePoint: Microsoft Graph API with delta queries. Webhooks via subscription API. Handle nested folder permissions.
Slack: Conversations API + Events API (webhooks). Index messages, threads, file shares. Channel-level ACL mapping.
Google Drive: Drive API v3 with push notifications (webhooks). Watch changes endpoint. File-level and folder-level permissions.
ServiceNow: Table API + REST. Webhooks via business rules. KB articles, incidents, change records.

Webhook + Polling Hybrid Strategy

Primary: Webhooks for real-time updates (<1 min latency). Each source sends change events to our ingestion queue.
Fallback: Polling every 5 minutes to catch missed webhooks. Reconciliation job compares source timestamps vs. our last-indexed timestamps. This guarantees no document is missed even if webhooks fail silently.

Chunking Strategy

Chunk size: 500-1,000 tokens per chunk, targeting semantic coherence
Paragraph boundaries: Never split mid-sentence. Respect headings, bullet lists, and code blocks as natural boundaries.
Overlap: 100-token overlap between consecutive chunks to preserve context across boundaries
Metadata per chunk: source_url, author, last_modified, section_title, parent_doc_id, chunk_index, ACL groups
Special handling: Tables chunked as single units. Code blocks kept intact. Images get alt-text or OCR text.

Embedding Generation

Model: OpenAI text-embedding-ada-002 (1,536 dimensions) for baseline; option to fine-tune
Batch processing: Embed in batches of 100 chunks for throughput
Incremental updates: Only re-embed changed chunks (compare content hash)
Storage: Vectors stored with chunk_id as primary key, tenant_id as partition key

5 Deep Dive 2: Hybrid Search

Why Hybrid? Two Failure Modes

Vector Search Wins (Semantic)

Query: "How do I reset my password?"
Matches document titled: "Credential Recovery Procedures"
BM25 would miss this entirely — no keyword overlap. Vector search captures semantic similarity.

BM25 Wins (Exact Match)

Query: "VPN-2847 error code"
Matches document containing: "Error VPN-2847: Certificate expired"
Vector search might return generic VPN docs. BM25 nails the exact error code.

Reciprocal Rank Fusion (RRF)

Combine results from both retrieval methods using RRF:

  RRF Score Formula:
  ─────────────────────────────────────────────────────
  score(doc) = 1/(k + rank_vector) + 1/(k + rank_keyword)

  where k = 60 (standard constant)

  Example:
  ┌──────────┬──────────────┬──────────────┬───────────┐
  │ Document │ Vector Rank  │ BM25 Rank    │ RRF Score │
  ├──────────┼──────────────┼──────────────┼───────────┤
  │ Doc A    │ 1            │ 5            │ 0.0317    │
  │ Doc B    │ 3            │ 2            │ 0.0321    │  ← Winner
  │ Doc C    │ 2            │ 8            │ 0.0309    │
  │ Doc D    │ 10           │ 1            │ 0.0307    │
  └──────────┴──────────────┴──────────────┴───────────┘

  Doc B ranks best overall — good in BOTH systems.

Cross-Encoder Re-ranking

Input: Top 50 results from RRF (after permission filtering)
Model: Cross-encoder (e.g., ms-marco-MiniLM-L-12) that scores query-document pairs jointly
Output: Re-ranked top 10 results with relevance scores
Latency: ~100ms for 50 pairs on GPU (batched inference)
Why not use cross-encoder first? Too expensive for 1.75B chunks. Use cheap retrieval to narrow, then expensive re-ranking to refine.

6 Deep Dive 3: Permission-Aware Filtering

CRITICAL REQUIREMENT: A user must NEVER see content they don't have access to in the source system. This is a compliance and trust requirement — a single violation can lose a customer. Permission filtering is not optional, it's the #1 priority.

How It Works

Step 1 — ACL Metadata at Ingestion: Each chunk stores the list of groups/users who can access it (pulled from the source system's permissions API)
Step 2 — User Identity Resolution: At query time, resolve the user's identity to their group memberships (AD groups, Slack channels, Confluence spaces, etc.)
Step 3 — Over-Retrieve + Filter: Retrieve top 200 results from hybrid search, then filter to only those matching user's group memberships, return top 10
Step 4 — Permission Cache: Cache user's group memberships in Redis with TTL 5 minutes. Avoids re-querying AD/LDAP on every search.
Step 5 — Fallback Verification: For high-sensitivity tenants, verify top-10 results against the source system directly before returning. Adds ~50ms but guarantees accuracy.

  PERMISSION FILTERING FLOW
  ─────────────────────────────────────────────────────────────

  Query: "How to configure SSO?"  User: jane@acme.com

  ┌──────────────────┐
  │ Hybrid Retrieval │ → Top 200 results (no permission check yet)
  └────────┬─────────┘
           │
  ┌────────v─────────┐
  │ Resolve User ACL │ → jane@acme.com is in: [engineering, sso-admins, all-staff]
  │ (Redis cache 5m) │
  └────────┬─────────┘
           │
  ┌────────v─────────┐
  │ Filter by ACL    │ → 200 results → 47 accessible → Top 10
  │ chunk.acl ∩ user │
  └────────┬─────────┘
           │
  ┌────────v─────────┐
  │ Cross-Encoder    │ → Re-rank top 10 → Final 5 for answer generation
  │ Re-rank          │
  └────────┬─────────┘
           │
  ┌────────v─────────┐
  │ LLM Answer Gen   │ → "To configure SSO, follow these steps... [1][2][3]"
  │ with Citations   │
  └──────────────────┘

Edge Cases

Permission changes: When a doc's permissions change in the source, webhooks trigger re-indexing of ACL metadata. Redis cache TTL ensures stale permissions expire within 5 minutes.
Shared links: Publicly shared docs (e.g., "anyone with the link") get tagged with a special "public" ACL group. All users match this.
Deleted documents: Webhook triggers immediate removal from vector DB and Elasticsearch. Tombstone record in metadata store prevents stale cache hits.
Cross-source permissions: User may have access to a Confluence page but not the linked Slack thread. Each chunk is independently permission-checked.

7 Scaling & Production

Sharding Strategy

Shard by tenant_id: Each tenant's vectors in a dedicated namespace/partition. Guarantees zero cross-tenant leakage and enables per-tenant scaling.
Large tenants (>5M docs): Sub-shard by source type (Confluence, Slack, etc.) for parallel retrieval.
Vector DB: Pinecone namespaces or Qdrant collections per tenant. HNSW index with ef_construction=200, M=16.

Caching

Query cache: Cache frequent queries (normalized) with results. TTL 5 minutes. Hit rate ~25-30% for enterprise (repeated questions).
Embedding cache: Cache query embeddings to skip re-embedding identical queries.
Permission cache: Redis with TTL 5 min. Key: user_id, Value: set of group IDs.

Incremental Updates

Content hash: MD5 hash of chunk text. Only re-embed if hash changes. Saves 80%+ embedding API costs on re-indexing.
Delta sync: Track last_sync_timestamp per source per tenant. Only fetch changed documents.
Zero-downtime re-index: Build new index in parallel, swap atomically when ready.

Multi-Model Embeddings

Code-heavy tenants: Use code-specific embeddings (e.g., code-search-ada) for repositories and technical docs.
Multilingual tenants: Use multilingual embedding models for non-English content.
Domain-specific: Fine-tune embeddings on enterprise vocabulary for improved recall.

8 ML & Evaluation

Fine-Tuning

Embedding fine-tuning: Use click-through data to create (query, positive_doc, negative_doc) triplets. Fine-tune embedding model on enterprise-specific vocabulary ("PTO" = "paid time off").
Cross-encoder from click data: Train on (query, clicked_doc) as positive, (query, shown_but_not_clicked) as negative. Improves re-ranking accuracy by 10-15%.
LLM answer generation: Prompt engineering with citation format. Use retrieved chunks as context. Ensure attribution to source documents with clickable links.

Evaluation Metrics

Metric	Target	How Measured
MRR (Mean Reciprocal Rank)	>0.65	Position of first relevant result in top-10
NDCG@10	>0.70	Graded relevance of top-10 results
Answer Accuracy	>90%	Human eval of LLM-generated answers (weekly sample)
Citation Accuracy	>95%	Do citations actually support the generated answer?
Permission Accuracy	100%	Automated audit: compare returned results against source ACLs

9 Cheat Sheet

Enterprise Search (Agentic RAG) — Key Numbers

350M docs, 1.75B chunks, ~1TB vectors
Hybrid retrieval: Vector (semantic) + BM25 (keyword) + RRF fusion
Cross-encoder re-ranks top-50 in ~100ms
Permission filtering: over-retrieve 200, filter by ACL, return 10
Redis permission cache TTL 5 min
<500ms retrieval, <3s end-to-end with answer
Shard by tenant_id for isolation + independent scaling
Webhook + polling hybrid for <5 min freshness
100-token overlap between chunks preserves context
Content hash avoids re-embedding unchanged chunks (80% cost savings)

Table of Contents