🤖 AI Agent Orchestration Platform

Core Product — Design a platform where an AI agent receives natural language requests and executes multi-step workflows across enterprise systems.

Table of Contents

  1. The Question & What Interviewers Look For
  2. Step 1: Clarify Requirements (0-5 min)
  3. Step 2: Back-of-Envelope Estimation (5-10 min)
  4. Step 3: High-Level Architecture (10-20 min)
  5. Step 4: Deep Dives (20-35 min)
  6. Circuit Breaker Pattern
  7. Step 5: Scaling & Trade-offs (35-40 min)
  8. Step 6: ML Integration Layer (40-45 min)
  9. Conversation Memory Architecture
  10. Mock Interview Practice Script
  11. Common Mistakes to Avoid
  12. Summary Cheat Sheet

The Question

"Design a platform where an AI agent receives natural language requests from employees (via Slack/Teams), reasons about what actions to take, executes multi-step workflows across enterprise systems (ServiceNow, Jira, Salesforce, Okta), and returns results. The system must be multi-tenant, reliable, and respond within 5 seconds."

What the interviewer is looking for:

1 Clarify Requirements (First 5 minutes)

#Your QuestionExpected Answer
Q1Should I focus on the reasoning/LLM layer or the execution infrastructure?Both, but emphasize execution infra as SWE
Q2Latency target: conversational (<5s) or async background tasks?Conversational — users expect fast response
Q3How many enterprise systems per customer?5-20 connectors
Q4Multi-tenant with different configs per customer?Yes
Q5Permission model: does agent act as user or as system?As user (user-level permissions)

2 Back-of-Envelope Estimation (5-10 min)

USERS:
  350+ customers x 15,000 employees avg = ~5 Million total users
  10% DAU (Daily Active Users) = 500,000 users/day
  Avg 3 requests per user = 1.5 Million requests/day

QPS (Queries Per Second):
  1.5M requests / 86,400 seconds = ~17 req/sec (average)
  Peak (3x average) = ~50 req/sec

PER REQUEST BREAKDOWN:
  1 LLM call for planning     -> 1-3 seconds
  2-3 tool/API calls           -> 200ms - 2s each
  1 LLM call for response      -> 1-2 seconds
  Target: < 5 seconds end-to-end

STORAGE:
  Each conversation: ~5KB (messages + metadata)
  1.5M conversations/day x 5KB = 7.5 GB/day
  + Audit logs = 10 TB/year

3 High-Level Architecture (10-20 min)

┌─────────────┐
│ Slack/Teams  │ ──→ API Gateway (Auth, Rate Limit, WebSocket)
└──────┬──────┘            │
       │            Session Manager (Redis: context, history, user profile)
       │                   │
       │         ┌─────────┴──────────┐
       │         │  REASONING ENGINE   │
       │         │  ┌───────────────┐  │
       │         │  │ Planning (LLM)│  │ → Decompose request into steps
       │         │  │ Execution Eng │  │ → Run tool calls with retry/CB
       │         │  │ Observation   │  │ → Evaluate, re-plan if needed
       │         │  └───────────────┘  │
       │         └────┬──────────┬─────┘
       │              │          │
       │     ┌────────┴──┐  ┌───┴────────┐
       │     │Tool Registry│  │State Manager│
       │     │(per tenant) │  │(Redis + PG) │
       │     └────┬───────┘  └────────────┘
       │          │
       │   ┌──────┼──────┬──────────┐
       │   ┴      ┴      ┴          ┴
       │ ServiceNow  Jira  Salesforce  Okta

The 5 Layers Explained

4 Deep Dives (20-35 min — this is where you WIN)

Deep Dive #1: Reasoning Engine (The Brain)

ReAct Pattern: Plan → Execute → Observe → (Re-plan)

The Reasoning Engine implements the ReAct (Reason + Act) pattern, the industry-standard approach for agentic AI systems.

Deep Dive #2: Plugin/Connector Registry (Multi-Tenant)

Deep Dive #3: State Manager

Circuit Breaker Pattern

CIRCUIT BREAKER STATES:

  ┌────────┐  failures > threshold  ┌──────┐
  │ CLOSED │ ─────────────────────→ │ OPEN │
  │(normal)│                        │(fail │
  └────────┘                        │fast) │
       ▲                            └──┬───┘
       │                               │
       │    success                    │ timeout
       │                               │
  ┌────┴─────┐                    ┌───┴──────┐
  │  CLOSED  │ ◀── success ───── │HALF-OPEN │
  └──────────┘                    │(test one)│
                   failure ──→    └──────────┘
                   back to OPEN

5 Scaling & Trade-offs (35-40 min)

Latency Optimization

Reliability

Cost Optimization

70% Cost Reduction with Model Routing

Route 70% of simple queries to cheaper models ($0.002/1K tokens) and only 30% complex queries to powerful models ($0.03/1K tokens). Result: $21,840/day vs $63,000/day without routing.

Security (Enterprise-grade)

Observability

6 ML Integration Layer (40-45 min)

Multiple Models

Model Router Architecture (3-Tier)

  Fast Tier  (Llama-3, Mistral):     ~70% traffic  |  $0.002/1K tokens
  Mid Tier   (Claude Haiku):          ~10% traffic  |  $0.005/1K tokens
  Power Tier (GPT-4, Claude Opus):    ~20% traffic  |  $0.03/1K tokens

  Model Gateway:
    ┌─────────────────────────────────┐
    │ Circuit Breaker per provider     │
    │ Retry/Fallback between tiers     │
    │ Rate Limiter per customer         │
    │ Response Caching                  │
    └─────────────────────────────────┘
                      │
  Evaluation Pipeline (async — non-blocking):
    Accuracy Score | Hallucination Detector | Latency Tracker | Cost Tracker

  Rollback Controller:
    Monitor metric trends → Compare vs baseline → Auto-rollback if degraded

Key Design Decisions

DecisionRationale
3-tier model routing70% cheap + 10% mid + 20% expensive = 65% cost savings
Async evaluationNon-blocking; doesn't add latency to user requests
Circuit breaker per providerIf GPT-4 is down, fallback to Claude automatically
Canary deploymentProgressive rollout 5% → 25% → 100% reduces blast radius
Response cachingIdentical queries get cached responses (Redis, TTL 1hr)

A/B Testing & Canary Deployment

Conversation Memory Architecture

SHORT-TERM MEMORY (Redis Cluster, TTL 24hr):
  Session messages, turn count, metadata
  ~50 KB/session, Eventual consistency
  Structure: {session_id: {messages: [...], user_context: {...}, created_at}}

WORKING MEMORY (Redis Cluster, TTL 1hr):
  Task state, tool output, scratch pad
  ~20 KB/task, Strong consistency
  Example: Step 1 returned user's manager → stored → Step 2 uses for approval

LONG-TERM MEMORY (PostgreSQL + pgvector, configurable TTL):
  User prefs, interaction summaries, learned patterns, embeddings
  ~200 KB/user, Strong consistency
  Semantic search via cosine similarity on embeddings

CONTEXT WINDOW MANAGER:
  1. Gather short-term messages (current conversation)
  2. Attach working memory (current task state)
  3. Semantic search long-term memory for relevant past interactions
  4. Summarize if total exceeds token limit
  5. Assemble final prompt for LLM

The Context Window Manager is critical — it intelligently selects which information to include in the LLM's context window (8K-128K tokens), prioritizing recency, relevance, and task state.

Mock Interview Practice Script

[0-5 min]  Ask 5 clarifying questions (show the table)
[5-10 min] Walk through estimation: users → QPS → latency → storage
[10-20 min] Draw the architecture diagram, explain all 5 layers
[20-35 min] Deep dive: Reasoning Engine (Plan → Execute → Observe)
            Then: Circuit breaker pattern, Plugin Registry, State Manager
[35-45 min] Scaling: parallel execution, streaming, caching, model routing
            Trade-offs: cost vs latency, consistency vs availability

Set a timer for 45 minutes. Talk through each section aloud. Record yourself and listen back for where you hesitate.

Common Mistakes to AVOID

Don't make these errors in your interview:

  • ❌ Jumping into code without clarifying requirements first
  • ❌ Drawing a monolithic architecture (show you think distributed)
  • ❌ Ignoring multi-tenancy (this is an enterprise product!)
  • ❌ Not discussing failure modes (what happens when ServiceNow is down?)
  • ❌ Forgetting about permissions (agent acts as USER, not system)
  • ❌ Not mentioning observability (how do you debug in production?)

Summary Cheat Sheet

ARCHITECTURE:      User → Gateway → Reasoning Engine → Tools → Response (streamed)
REASONING ENGINE:  PLAN → EXECUTE → OBSERVE → (re-plan if needed)
KEY PATTERNS:      Circuit Breaker, Retry with exp backoff, Parallel execution,
                   Idempotency keys, Dead Letter Queue
MULTI-TENANCY:     Per-tenant Tool Registry, Per-tenant credentials in Vault,
                   User-level permissions on every tool call
DATA STORES:       Redis (session state), PostgreSQL (audit logs), Vault (credentials)
NUMBERS:           5M users | 500K DAU | 1.5M req/day | ~50 peak QPS |
                   <5s latency | 10 TB/year storage