MOST LIKELY QUESTION Staff Level 45 minutes

🤖 AI Agent Orchestration Platform

Core Product — Design a platform where an AI agent receives natural language requests and executes multi-step workflows across enterprise systems.

The Question & What Interviewers Look For
Step 1: Clarify Requirements (0-5 min)
Step 2: Back-of-Envelope Estimation (5-10 min)
Step 3: High-Level Architecture (10-20 min)
Step 4: Deep Dives (20-35 min)
Circuit Breaker Pattern
Step 5: Scaling & Trade-offs (35-40 min)
Step 6: ML Integration Layer (40-45 min)
Conversation Memory Architecture
Mock Interview Practice Script
Common Mistakes to Avoid
Summary Cheat Sheet

The Question

"Design a platform where an AI agent receives natural language requests from employees (via Slack/Teams), reasons about what actions to take, executes multi-step workflows across enterprise systems (ServiceNow, Jira, Salesforce, Okta), and returns results. The system must be multi-tenant, reliable, and respond within 5 seconds."

What the interviewer is looking for:

1. Decomposition: Can you break down a complex system into clear components?
2. Distributed systems: Do you understand queues, caches, databases?
3. Scale reasoning: Can you reason about QPS, storage, latency?
4. Depth: Can you go deep on 2-3 components with real-world trade-offs?
5. AI/LLM challenges: Do you understand LLM-specific challenges (latency, cost, reliability)?

1 Clarify Requirements (First 5 minutes)

#	Your Question	Expected Answer
Q1	Should I focus on the reasoning/LLM layer or the execution infrastructure?	Both, but emphasize execution infra as SWE
Q2	Latency target: conversational (<5s) or async background tasks?	Conversational — users expect fast response
Q3	How many enterprise systems per customer?	5-20 connectors
Q4	Multi-tenant with different configs per customer?	Yes
Q5	Permission model: does agent act as user or as system?	As user (user-level permissions)

2 Back-of-Envelope Estimation (5-10 min)

USERS:
  350+ customers x 15,000 employees avg = ~5 Million total users
  10% DAU (Daily Active Users) = 500,000 users/day
  Avg 3 requests per user = 1.5 Million requests/day

QPS (Queries Per Second):
  1.5M requests / 86,400 seconds = ~17 req/sec (average)
  Peak (3x average) = ~50 req/sec

PER REQUEST BREAKDOWN:
  1 LLM call for planning     -> 1-3 seconds
  2-3 tool/API calls           -> 200ms - 2s each
  1 LLM call for response      -> 1-2 seconds
  Target: < 5 seconds end-to-end

STORAGE:
  Each conversation: ~5KB (messages + metadata)
  1.5M conversations/day x 5KB = 7.5 GB/day
  + Audit logs = 10 TB/year

3 High-Level Architecture (10-20 min)

┌─────────────┐
│ Slack/Teams  │ ──→ API Gateway (Auth, Rate Limit, WebSocket)
└──────┬──────┘            │
       │            Session Manager (Redis: context, history, user profile)
       │                   │
       │         ┌─────────┴──────────┐
       │         │  REASONING ENGINE   │
       │         │  ┌───────────────┐  │
       │         │  │ Planning (LLM)│  │ → Decompose request into steps
       │         │  │ Execution Eng │  │ → Run tool calls with retry/CB
       │         │  │ Observation   │  │ → Evaluate, re-plan if needed
       │         │  └───────────────┘  │
       │         └────┬──────────┬─────┘
       │              │          │
       │     ┌────────┴──┐  ┌───┴────────┐
       │     │Tool Registry│  │State Manager│
       │     │(per tenant) │  │(Redis + PG) │
       │     └────┬───────┘  └────────────┘
       │          │
       │   ┌──────┼──────┬──────────┐
       │   ┴      ┴      ┴          ┴
       │ ServiceNow  Jira  Salesforce  Okta

The 5 Layers Explained

1. ENTRY: User sends message via Slack/Teams → API Gateway handles auth, rate limiting, and WebSocket for real-time bidirectional communication.
2. SESSION: Session Manager (Redis) loads user context, permissions, conversation history, and tenant configuration. Sub-millisecond lookups.
3. REASONING: The brain of the system with 3 phases: PLAN (LLM outputs step-by-step plan), EXECUTE (tool calls with timeout/retry/circuit breaker), OBSERVE (LLM evaluates result, decides to proceed, retry, or re-plan).
4. TOOLS: Tool Registry is per-tenant — each customer has different enterprise systems configured. Agent only sees plugins available to THAT customer. Self-service UI for admins to configure connectors.
5. RESPONSE: Results streamed back to user in real-time via WebSocket. User sees tokens appear word-by-word, reducing perceived latency.

4 Deep Dives (20-35 min — this is where you WIN)

Deep Dive #1: Reasoning Engine (The Brain)

ReAct Pattern: Plan → Execute → Observe → (Re-plan)

The Reasoning Engine implements the ReAct (Reason + Act) pattern, the industry-standard approach for agentic AI systems.

PLANNING: LLM receives the request + user context + available plugins + conversation history. It outputs a structured plan: [{plugin:'servicenow', action:'create_ticket', params:{...}}, ...]. The plan specifies which tools to call, in what order, and with what parameters.
Model routing: Fast model (GPT-3.5 class) for simple tasks like "what's my PTO balance?". Powerful model (GPT-4/Claude class) for complex multi-step reasoning like "provision a new employee across all systems."
EXECUTION: Each plugin call is wrapped with: timeout (don't wait forever), retry (exponential backoff with jitter), circuit breaker (fail fast if service is down), idempotency (same request produces same result, safe to retry).
Parallel execution: For independent steps, run concurrently using asyncio.gather(). Example: creating a ticket and sending a notification can happen simultaneously. This gives ~50% speedup on multi-step workflows (3.0s → 1.5s).
OBSERVATION: After each step, the LLM evaluates the result and decides: proceed to next step, retry the current step, or re-plan entirely. This enables self-healing — if a step fails, the agent can reason about alternatives.
DLQ (Dead Letter Queue): Failed executions that exhaust retries go to a DLQ for investigation. Alerting triggers when DLQ depth exceeds threshold.

Deep Dive #2: Plugin/Connector Registry (Multi-Tenant)

Plugin Definition: Each plugin has: name + description (for LLM to understand when to use it) + API endpoint + auth config + input/output schema.
Connector Instance: Customer-specific configuration: their ServiceNow URL, their credentials, their custom field mappings. Same plugin template, different instance per tenant.
Credential Storage: All credentials stored in HashiCorp Vault with per-tenant isolation. Agent authenticates to Vault per request (zero-trust). OAuth tokens auto-rotated before expiry.
Self-Service UI: Admin picks system type → enters their URL → authenticates via OAuth → maps custom fields → activates connector. No engineering involvement needed.
Agent Context: When the LLM plans, it only sees plugins available to THAT customer. Acme Corp sees ServiceNow + Okta; Beta Inc sees Jira + Salesforce.

Deep Dive #3: State Manager

What it tracks: Current plan, step results, intermediate data, conversation context, execution trace.
Hot State (Redis): Active conversations and in-flight task state. Sub-millisecond reads. TTL 24 hours.
Cold Storage (PostgreSQL): Audit trail for every action taken. Required for SOC2, GDPR compliance. Append-only for integrity.
Resume on Failure: If the agent crashes mid-execution, state manager has checkpoints. New instance picks up from last successful step.
Debugging: Full execution trace for every request — what the LLM planned, what tools were called, what they returned, what the LLM observed. Critical for production debugging.

Circuit Breaker Pattern

CIRCUIT BREAKER STATES:

  ┌────────┐  failures > threshold  ┌──────┐
  │ CLOSED │ ─────────────────────→ │ OPEN │
  │(normal)│                        │(fail │
  └────────┘                        │fast) │
       ▲                            └──┬───┘
       │                               │
       │    success                    │ timeout
       │                               │
  ┌────┴─────┐                    ┌───┴──────┐
  │  CLOSED  │ ◀── success ───── │HALF-OPEN │
  └──────────┘                    │(test one)│
                   failure ──→    └──────────┘
                   back to OPEN

CLOSED (Normal): All requests pass through. Failures are counted.
OPEN (Failing): When failures exceed threshold (e.g., 5 failures in 60s), circuit opens. All requests immediately fail with a cached/default response. No calls to the downstream service.
HALF-OPEN (Testing): After a recovery timeout (e.g., 30s), one test request is allowed through. If it succeeds, circuit closes. If it fails, circuit opens again.
CRITICAL: Circuit breaker is per connector INSTANCE (not per type). One customer's ServiceNow being down does NOT affect other customers' ServiceNow connections.

5 Scaling & Trade-offs (35-40 min)

Latency Optimization

Parallel tool calls with asyncio.gather() — 50% speedup: 3.0s → 1.5s for independent steps
Stream LLM tokens to user in real-time — perceived latency drops to first-token time (~200ms)
Cache user profiles + plugin schemas in Redis — eliminate DB lookups per request
Model routing: Simple → fast/cheap model (500ms), Complex → powerful model (2-3s)

Reliability

DLQ for failed executions with alerting and investigation workflow
Circuit breaker PER CONNECTOR INSTANCE — prevent cascade failures across tenants
Graceful degradation — if a non-critical tool fails, agent continues with partial results
Idempotent tool calls — same request produces same result, safe to retry without side effects

Cost Optimization

70% Cost Reduction with Model Routing

Route 70% of simple queries to cheaper models ($0.002/1K tokens) and only 30% complex queries to powerful models ($0.03/1K tokens). Result: $21,840/day vs $63,000/day without routing.

Cache common query patterns — "How do I reset my password?" served from cache
Token budget per customer — enforce limits, alert on overuse
Prompt compression — summarize conversation history to reduce input tokens

Security (Enterprise-grade)

Per-customer credentials in HashiCorp Vault — never in app DB or environment variables
User-level permissions on EVERY tool call — agent cannot do more than the user is authorized to do
Zero-trust: Agent authenticates to Vault per request, no long-lived tokens in memory
Audit trail for SOC2, GDPR — every action logged with who, what, when, result

Observability

OpenTelemetry tracing — end-to-end trace for every request across all services
Per-step latency/success metrics — identify which step is the bottleneck
Token usage tracking per customer per model — cost attribution and billing
Alerting if success rate drops below 95% or P99 latency exceeds 10s

6 ML Integration Layer (40-45 min)

Multiple Models

NLU classifier: Intent detection (is this a question, action request, or conversation?)
LLM for planning: Decompose complex requests into executable steps
Cross-encoder for re-ranking: When searching knowledge base, re-rank results by relevance
Model routing: Lightweight classifier predicts query complexity in <50ms

Model Router Architecture (3-Tier)

  Fast Tier  (Llama-3, Mistral):     ~70% traffic  |  $0.002/1K tokens
  Mid Tier   (Claude Haiku):          ~10% traffic  |  $0.005/1K tokens
  Power Tier (GPT-4, Claude Opus):    ~20% traffic  |  $0.03/1K tokens

  Model Gateway:
    ┌─────────────────────────────────┐
    │ Circuit Breaker per provider     │
    │ Retry/Fallback between tiers     │
    │ Rate Limiter per customer         │
    │ Response Caching                  │
    └─────────────────────────────────┘
                      │
  Evaluation Pipeline (async — non-blocking):
    Accuracy Score | Hallucination Detector | Latency Tracker | Cost Tracker

  Rollback Controller:
    Monitor metric trends → Compare vs baseline → Auto-rollback if degraded

Key Design Decisions

Decision	Rationale
3-tier model routing	70% cheap + 10% mid + 20% expensive = 65% cost savings
Async evaluation	Non-blocking; doesn't add latency to user requests
Circuit breaker per provider	If GPT-4 is down, fallback to Claude automatically
Canary deployment	Progressive rollout 5% → 25% → 100% reduces blast radius
Response caching	Identical queries get cached responses (Redis, TTL 1hr)

A/B Testing & Canary Deployment

Canary: New model gets 5% of traffic. Monitor for 24 hours.
Promotion: If metrics (accuracy, latency, satisfaction) are equal or better → 25% → 100%
Auto-rollback: If any metric degrades beyond threshold, automatic rollback to previous version
Data Flywheel: User thumbs up/down → fine-tuning signal. Collective Learning across ALL tenants.

Conversation Memory Architecture

SHORT-TERM MEMORY (Redis Cluster, TTL 24hr):
  Session messages, turn count, metadata
  ~50 KB/session, Eventual consistency
  Structure: {session_id: {messages: [...], user_context: {...}, created_at}}

WORKING MEMORY (Redis Cluster, TTL 1hr):
  Task state, tool output, scratch pad
  ~20 KB/task, Strong consistency
  Example: Step 1 returned user's manager → stored → Step 2 uses for approval

LONG-TERM MEMORY (PostgreSQL + pgvector, configurable TTL):
  User prefs, interaction summaries, learned patterns, embeddings
  ~200 KB/user, Strong consistency
  Semantic search via cosine similarity on embeddings

CONTEXT WINDOW MANAGER:
  1. Gather short-term messages (current conversation)
  2. Attach working memory (current task state)
  3. Semantic search long-term memory for relevant past interactions
  4. Summarize if total exceeds token limit
  5. Assemble final prompt for LLM

The Context Window Manager is critical — it intelligently selects which information to include in the LLM's context window (8K-128K tokens), prioritizing recency, relevance, and task state.

Mock Interview Practice Script

[0-5 min]  Ask 5 clarifying questions (show the table)
[5-10 min] Walk through estimation: users → QPS → latency → storage
[10-20 min] Draw the architecture diagram, explain all 5 layers
[20-35 min] Deep dive: Reasoning Engine (Plan → Execute → Observe)
            Then: Circuit breaker pattern, Plugin Registry, State Manager
[35-45 min] Scaling: parallel execution, streaming, caching, model routing
            Trade-offs: cost vs latency, consistency vs availability

Set a timer for 45 minutes. Talk through each section aloud. Record yourself and listen back for where you hesitate.

Common Mistakes to AVOID

Don't make these errors in your interview:

❌ Jumping into code without clarifying requirements first
❌ Drawing a monolithic architecture (show you think distributed)
❌ Ignoring multi-tenancy (this is an enterprise product!)
❌ Not discussing failure modes (what happens when ServiceNow is down?)
❌ Forgetting about permissions (agent acts as USER, not system)
❌ Not mentioning observability (how do you debug in production?)

Summary Cheat Sheet

ARCHITECTURE:      User → Gateway → Reasoning Engine → Tools → Response (streamed)
REASONING ENGINE:  PLAN → EXECUTE → OBSERVE → (re-plan if needed)
KEY PATTERNS:      Circuit Breaker, Retry with exp backoff, Parallel execution,
                   Idempotency keys, Dead Letter Queue
MULTI-TENANCY:     Per-tenant Tool Registry, Per-tenant credentials in Vault,
                   User-level permissions on every tool call
DATA STORES:       Redis (session state), PostgreSQL (audit logs), Vault (credentials)
NUMBERS:           5M users | 500K DAU | 1.5M req/day | ~50 peak QPS |
                   <5s latency | 10 TB/year storage

← Back to All 10 Designs Next: Enterprise Search (RAG) →