Core Product — Design a platform where an AI agent receives natural language requests and executes multi-step workflows across enterprise systems.
"Design a platform where an AI agent receives natural language requests from employees (via Slack/Teams), reasons about what actions to take, executes multi-step workflows across enterprise systems (ServiceNow, Jira, Salesforce, Okta), and returns results. The system must be multi-tenant, reliable, and respond within 5 seconds."
| # | Your Question | Expected Answer |
|---|---|---|
| Q1 | Should I focus on the reasoning/LLM layer or the execution infrastructure? | Both, but emphasize execution infra as SWE |
| Q2 | Latency target: conversational (<5s) or async background tasks? | Conversational — users expect fast response |
| Q3 | How many enterprise systems per customer? | 5-20 connectors |
| Q4 | Multi-tenant with different configs per customer? | Yes |
| Q5 | Permission model: does agent act as user or as system? | As user (user-level permissions) |
USERS: 350+ customers x 15,000 employees avg = ~5 Million total users 10% DAU (Daily Active Users) = 500,000 users/day Avg 3 requests per user = 1.5 Million requests/day QPS (Queries Per Second): 1.5M requests / 86,400 seconds = ~17 req/sec (average) Peak (3x average) = ~50 req/sec PER REQUEST BREAKDOWN: 1 LLM call for planning -> 1-3 seconds 2-3 tool/API calls -> 200ms - 2s each 1 LLM call for response -> 1-2 seconds Target: < 5 seconds end-to-end STORAGE: Each conversation: ~5KB (messages + metadata) 1.5M conversations/day x 5KB = 7.5 GB/day + Audit logs = 10 TB/year
┌─────────────┐
│ Slack/Teams │ ──→ API Gateway (Auth, Rate Limit, WebSocket)
└──────┬──────┘ │
│ Session Manager (Redis: context, history, user profile)
│ │
│ ┌─────────┴──────────┐
│ │ REASONING ENGINE │
│ │ ┌───────────────┐ │
│ │ │ Planning (LLM)│ │ → Decompose request into steps
│ │ │ Execution Eng │ │ → Run tool calls with retry/CB
│ │ │ Observation │ │ → Evaluate, re-plan if needed
│ │ └───────────────┘ │
│ └────┬──────────┬─────┘
│ │ │
│ ┌────────┴──┐ ┌───┴────────┐
│ │Tool Registry│ │State Manager│
│ │(per tenant) │ │(Redis + PG) │
│ └────┬───────┘ └────────────┘
│ │
│ ┌──────┼──────┬──────────┐
│ ┴ ┴ ┴ ┴
│ ServiceNow Jira Salesforce Okta
The Reasoning Engine implements the ReAct (Reason + Act) pattern, the industry-standard approach for agentic AI systems.
[{plugin:'servicenow', action:'create_ticket', params:{...}}, ...]. The plan specifies which tools to call, in what order, and with what parameters.
CIRCUIT BREAKER STATES:
┌────────┐ failures > threshold ┌──────┐
│ CLOSED │ ─────────────────────→ │ OPEN │
│(normal)│ │(fail │
└────────┘ │fast) │
▲ └──┬───┘
│ │
│ success │ timeout
│ │
┌────┴─────┐ ┌───┴──────┐
│ CLOSED │ ◀── success ───── │HALF-OPEN │
└──────────┘ │(test one)│
failure ──→ └──────────┘
back to OPEN
Route 70% of simple queries to cheaper models ($0.002/1K tokens) and only 30% complex queries to powerful models ($0.03/1K tokens). Result: $21,840/day vs $63,000/day without routing.
Fast Tier (Llama-3, Mistral): ~70% traffic | $0.002/1K tokens
Mid Tier (Claude Haiku): ~10% traffic | $0.005/1K tokens
Power Tier (GPT-4, Claude Opus): ~20% traffic | $0.03/1K tokens
Model Gateway:
┌─────────────────────────────────┐
│ Circuit Breaker per provider │
│ Retry/Fallback between tiers │
│ Rate Limiter per customer │
│ Response Caching │
└─────────────────────────────────┘
│
Evaluation Pipeline (async — non-blocking):
Accuracy Score | Hallucination Detector | Latency Tracker | Cost Tracker
Rollback Controller:
Monitor metric trends → Compare vs baseline → Auto-rollback if degraded
| Decision | Rationale |
|---|---|
| 3-tier model routing | 70% cheap + 10% mid + 20% expensive = 65% cost savings |
| Async evaluation | Non-blocking; doesn't add latency to user requests |
| Circuit breaker per provider | If GPT-4 is down, fallback to Claude automatically |
| Canary deployment | Progressive rollout 5% → 25% → 100% reduces blast radius |
| Response caching | Identical queries get cached responses (Redis, TTL 1hr) |
SHORT-TERM MEMORY (Redis Cluster, TTL 24hr):
Session messages, turn count, metadata
~50 KB/session, Eventual consistency
Structure: {session_id: {messages: [...], user_context: {...}, created_at}}
WORKING MEMORY (Redis Cluster, TTL 1hr):
Task state, tool output, scratch pad
~20 KB/task, Strong consistency
Example: Step 1 returned user's manager → stored → Step 2 uses for approval
LONG-TERM MEMORY (PostgreSQL + pgvector, configurable TTL):
User prefs, interaction summaries, learned patterns, embeddings
~200 KB/user, Strong consistency
Semantic search via cosine similarity on embeddings
CONTEXT WINDOW MANAGER:
1. Gather short-term messages (current conversation)
2. Attach working memory (current task state)
3. Semantic search long-term memory for relevant past interactions
4. Summarize if total exceeds token limit
5. Assemble final prompt for LLM
The Context Window Manager is critical — it intelligently selects which information to include in the LLM's context window (8K-128K tokens), prioritizing recency, relevance, and task state.
[0-5 min] Ask 5 clarifying questions (show the table)
[5-10 min] Walk through estimation: users → QPS → latency → storage
[10-20 min] Draw the architecture diagram, explain all 5 layers
[20-35 min] Deep dive: Reasoning Engine (Plan → Execute → Observe)
Then: Circuit breaker pattern, Plugin Registry, State Manager
[35-45 min] Scaling: parallel execution, streaming, caching, model routing
Trade-offs: cost vs latency, consistency vs availability
Set a timer for 45 minutes. Talk through each section aloud. Record yourself and listen back for where you hesitate.
ARCHITECTURE: User → Gateway → Reasoning Engine → Tools → Response (streamed)
REASONING ENGINE: PLAN → EXECUTE → OBSERVE → (re-plan if needed)
KEY PATTERNS: Circuit Breaker, Retry with exp backoff, Parallel execution,
Idempotency keys, Dead Letter Queue
MULTI-TENANCY: Per-tenant Tool Registry, Per-tenant credentials in Vault,
User-level permissions on every tool call
DATA STORES: Redis (session state), PostgreSQL (audit logs), Vault (credentials)
NUMBERS: 5M users | 500K DAU | 1.5M req/day | ~50 peak QPS |
<5s latency | 10 TB/year storage