10 real-world AI agent architectures covering orchestration, RAG, memory, sandboxed execution, knowledge graphs, and latency optimization — built for Staff-level interviews.
Enterprise-grade designs covering the full spectrum of agentic AI systems — from orchestration to optimization.
Core Product — Design a platform where an AI agent receives natural language requests and executes multi-step workflows across enterprise systems.
"Design a platform where an AI agent receives natural language requests from employees (via Slack/Teams), reasons about what actions to take, executes multi-step workflows across enterprise systems (ServiceNow, Jira, Salesforce, Okta), and returns results. Multi-tenant, reliable, <5 seconds."
350+ customers × 15K employees = ~5M users. 10% DAU = 500K/day. 3 requests each = 1.5M req/day. Peak QPS ~50. Per request: 1 LLM planning (1-3s) + 2-3 tool calls (200ms-2s) + 1 LLM response. Storage: 10TB/year.
User (Slack/Teams)
→ API Gateway (Auth, Rate Limit, WebSocket)
→ Session Manager (Redis: context, history, user profile)
→ REASONING ENGINE
┌───────────────┐
│ Planning (LLM)│ → Decompose request into steps
│ Execution Eng │ → Run tool calls with retry/CB
│ Observation │ → Evaluate, re-plan if needed
└───────────────┘
→ Tool Registry (per tenant)
→ State Manager (Redis + PG)
→ ServiceNow, Jira, Salesforce, Okta
Confirmed Asked — Design a search system indexing Confluence, SharePoint, Slack, and Google Drive with natural language queries and per-user access permissions.
"Design a search system that indexes documents from Confluence, SharePoint, Slack, Google Drive, and ServiceNow KB. Employees search with natural language. Results must respect per-user access permissions. Multi-tenant, <500ms latency."
350 customers × 1M docs = 350M documents. Avg doc = 5 chunks × 500 tokens. Embeddings: 350M × 5 × 1536 dims × 4 bytes ≈ 1TB vector storage. Search QPS ~30. <500ms retrieval, <3s answer.
INGESTION PIPELINE:
Sources → Connectors → Extract Text → Chunk → Embed → Store
(Confluence, Slack, SharePoint) (500-1K tokens) (Vector DB + Elasticsearch)
QUERY PIPELINE:
User Query → NLU → Query Expansion → Hybrid Retrieval → Permission → Re-rank → Answer Gen
(intent) (synonyms) (Vector + BM25) Filter (Cross-encoder) (LLM + citations)
Cody's Expertise — Automatically route IT support tickets to the correct assignment group with >95% accuracy using collective learning.
"Design a system that automatically routes incoming IT support tickets to the correct assignment group (out of hundreds) with >95% accuracy. Must handle small data per customer and learn across organizations."
350 customers × 1K tickets/day = 350K tickets/day. Peak ~15/sec. Inference <100ms (classification, not generation). Nightly batch retraining.
Ticket Created (ServiceNow/Jira)
│
Feature Extraction (ALL fields, not just short desc)
│
┌────▼────┐
│ BERT │ Fine-tuned multi-class classifier
│ Encoder │ Input: concatenated ticket fields
│ │ Output: probability per assignment group
└────┬────┘
│
Confidence Router
├── High (>0.95): Auto-route ✓
├── Medium (0.7-0.95): Route + flag for review
└── Low (<0.7): Human triage queue
│
Feedback Loop → Correct labels → Retrain
Real-World Experience — Design a system where AI agents request approvals from humans, with configurable chains, SLA escalation, and multi-channel delivery.
"Design a system where AI agents can request approvals from humans. Determine approval chain, notify approvers in real-time, track state, escalate overdue approvals, integrate with Slack/Email/Teams."
100K approval requests/day. Avg 2.5 approvers/chain = 250K notifications/day. Delivery <1s. State changes ~500K/day.
Agent Request → Rules Engine → Approval Chain Builder → Kafka Queue
(determine (ordered list │
approval of approvers) ┌────┼────┐
chain) ▼ ▼ ▼
Slack Email Teams
Worker Worker Worker
│
State Tracker (Redis + PG)
│
Escalation Scheduler
(check SLA, notify manager)
Enterprise Experience (MuleSoft/Integration) — Design an extensible platform connecting enterprise systems with different APIs, auth methods, and rate limits.
"Design a platform that lets enterprise customers connect their business systems (ServiceNow, Salesforce, Workday, SAP, Okta) to an AI agent. Different systems, API versions, auth methods. Extensible, reliable, secure."
350 customers × 10 connectors = 3,500 instances. 100 API calls/day each = 350K external calls/day. ~1K credential rotations/day. ~50 schema updates/day.
AI AGENT / PLATFORM CORE
→ CONNECTOR GATEWAY (Tenant Resolver, Request Validator, Routing)
→ CONNECTOR RUNTIME ENGINE
MIDDLEWARE CHAIN: Auth → Rate Limit → Transform → Execute → Retry → Log
→ CONNECTOR REGISTRY (PostgreSQL): Templates, Versions, Schemas
→ CONFIG STORE (PostgreSQL): Per-tenant configs, Field mappings
→ CREDENTIAL VAULT (HashiCorp Vault): OAuth tokens, API keys, mTLS certs
→ EXTERNAL SYSTEMS: ServiceNow, Salesforce, Workday, SAP, Okta
→ CROSS-CUTTING: Rate Limiter (Redis), Circuit Breaker, Metrics (OTel), Audit Log
For ML Staff — Design a system serving multiple LLM models with routing, A/B testing, canary deployments, evaluation, and automatic rollback.
"Design a system that serves multiple LLM models (GPT-4, Claude, open-source) for an enterprise AI agent platform. Support model routing, A/B testing, canary deployments, evaluation, and automatic rollback."
1.5M req/day × 2 LLM calls = 3M calls/day. Avg 700 tokens/call = 2.1B tokens/day. Simple (70%) at $0.002/1K = $2,940/day. Complex (30%) at $0.03/1K = $18,900/day. Total WITH routing: $21,840/day vs WITHOUT: $63,000/day. Savings: $41,160/day (65%).
Agent Request → Model Router → Model Gateway → LLM Provider
(simple/complex) │ (OpenAI/Anthropic/Self-hosted)
┌─────▼──────┐
│ A/B Testing │
│ Framework │
└─────┬──────┘
┌─────▼──────┐
│ Evaluation │
│ Pipeline │
└─────┬──────┘
Metrics + Alerts → Auto-rollback
3-Tier Model Routing:
Fast Tier (Llama-3, Mistral): ~70% traffic
Mid Tier (Claude Haiku): ~10% traffic
Power Tier (GPT-4, Claude Opus): ~20% traffic
In JD: 'Agent Memory' — Design a memory system with short-term, working, and long-term memory for an AI agent. Fast, scalable, privacy-compliant.
"Design a memory system for an AI agent that maintains short-term memory (conversation), working memory (intermediate results), and long-term memory (preferences, past interactions, learned patterns). Fast, scalable, privacy-compliant."
500K DAU × 3 conversations × 10 messages = 15M messages/day. 500K active sessions at peak. 5M users × 100 entries = 500M memory records. ~50 req/sec working memory writes. Redis: 24.8 GB. PostgreSQL: 0.93 TB text. Vector DB: 2.85 TB.
┌──────────────────────────────────────────┐ │ MEMORY SYSTEM │ │ ┌─────────────┐ ┌──────────────────┐ │ │ │ Short-term │ │ Working Memory │ │ │ │ (Redis) │ │ (Redis) │ │ │ │ Session ctx │ │ Task state, │ │ │ │ TTL: 24hr │ │ intermediate │ │ │ │ │ │ results TTL: 1hr │ │ │ └─────────────┘ └──────────────────┘ │ │ ┌──────────────────────────────────┐ │ │ │ Long-term Memory (PostgreSQL + │ │ │ │ pgvector) │ │ │ │ • User preferences │ │ │ │ • Past interaction summaries │ │ │ │ • Learned patterns │ │ │ │ • Semantic search via embeddings │ │ │ └──────────────────────────────────┘ │ └──────────────────────────────────────────┘
In JD: 'Sandboxed Code Execution' — Design a system where an AI agent generates and executes code on behalf of enterprise users. Sandboxed, time-limited, auditable.
"Design a system where an AI agent can generate and execute code (Python, SQL, shell) on behalf of enterprise users. Sandboxed (can't access other users' data or host), time-limited, auditable."
50K executions/day. Avg 5s execution. Peak concurrent: 200 sandboxes. Storage: ~50MB/day. Container spin-up target: <2s.
Agent generates code → Code Validator → Sandbox Pool → Execute → Return Output
(security scan) (pre-warmed (timeout (text/table/
containers) enforced) chart/file)
In JD: 'Knowledge Graphs' — Design a knowledge graph of org structure, systems, and relationships that the AI agent uses for decision-making.
"Design a knowledge graph representing org structure (people, teams, roles), systems (apps, permissions, data), and relationships. AI agent uses this graph for decisions."
350 customers × 50K nodes = 17.5M total nodes. 10 edges/node = 175M edges. 1M graph queries/day (agent lookups). <50ms for 2-hop traversals.
Data Sources → Sync Pipeline → Graph Store → Query Engine → Agent (AD, Workday, (CDC/webhooks (Neo4j or (Cypher/ (Context for Okta, CMDB) + batch sync) Neptune) GraphQL) reasoning) Example Graph: (John)─[MANAGES]→(Alice)─[HAS_ACCESS_TO]→(Salesforce)─[OWNED_BY]→(Sales Team)
In JD: 'Latency Optimization' — Reduce AI agent response time from 8-15s to under 3s while maintaining accuracy through caching, parallelism, and model routing.
"The AI agent takes 8-15s to respond. Design a system to reduce to under 3s while maintaining accuracy. Bottlenecks: LLM inference (2-4s), enterprise API calls (1-3s each), cold start for infrequent connectors."
Current: 5-12s. Target: <3s for 80%, <5s for 95%. 1.5M req/day, 70% common patterns.
Request → Query Classifier → Cache Hit? ──Yes──→ Instant Response (<100ms)
│ │
No │
│ │
Model Router → Fast Model (simple) → Single Tool Call → Stream Response (<2s)
│
Powerful Model (complex) → Parallel Tool Calls + Speculative Execution → Stream (<4s)