"The AI agent takes 8-15 seconds to respond. Reduce to <3 seconds. Main bottlenecks: LLM inference (2-4s), API calls to external systems (1-3s each, done sequentially), and cold start overhead."
| Dimension | Clarification | Assumption |
|---|---|---|
| Current State | What's the current latency breakdown? | 5-12s total: LLM 2-4s, API calls 1-3s each (sequential), overhead 1-2s |
| Target | What latency is acceptable? | <3s for 80% of requests, <5s for 95% |
| Common Patterns | What % of queries are repeated/similar? | 70% are common patterns (status check, FAQ, simple lookup) |
| Quality Tradeoff | Can we trade some quality for speed? | Yes, for simple queries. Quality must be maintained for complex ones. |
| Streaming | Can we stream partial responses? | Yes — start showing response before full generation completes |
| Current Bottleneck | Current Latency | Optimized Latency | Technique |
|---|---|---|---|
| LLM Inference | 2-4 seconds | 0.5-1.5s | Model routing (fast model for simple queries) |
| API Calls (sequential) | 3-9 seconds (3 calls) | 1-3s | Parallel execution (asyncio.gather) |
| Cold Start | 1-2 seconds | 0s | Warm pool, connection pooling |
| Repeated Queries | Full pipeline | <100ms | Response caching (30-40% hit rate) |
LATENCY-OPTIMIZED AGENT PIPELINE
═══════════════════════════════════════════════════════════════════
┌──────────┐
│ Request │
└────┬─────┘
│
┌────v──────────┐
│ Classifier │ <50ms
│ (complexity) │
└────┬──────────┘
│
┌────v──────────┐ YES ┌──────────────┐
│ Cache Hit? │──────────> │ INSTANT │ <100ms
│ (semantic) │ │ Response │
└────┬──────────┘ └──────────────┘
│ NO
│
┌────v──────────┐
│ Model Router │
└────┬──────────┘
│
├── Simple (70%) ──> Fast Model (500ms) ──┐
│ │
└── Complex (30%) ──> Powerful Model ─────┤
(2.5s) │
│
┌───────────────────────────────────────────────v────┐
│ PARALLEL TOOL EXECUTION │
│ asyncio.gather( │
│ call_api_1(), ──┐ │
│ call_api_2(), ──┼── All run simultaneously │
│ call_api_3(), ──┘ Max latency = slowest call │
│ ) │
└────────────────────────────────┬───────────────────┘
│
┌────────────────────────────────v───────────────────┐
│ STREAMING RESPONSE │
│ Start sending tokens as they're generated. │
│ User sees first word in ~200ms (TTFT). │
└────────────────────────────────────────────────────┘
In a large-scale enterprise notification system, implemented response caching for AI-powered content delivery. Achieved 85% cache hit rate for frequently accessed content, resulting in 94% latency reduction for cached paths. The key insight: most users ask the same categories of questions. Caching common patterns eliminates the most expensive part of the pipeline (LLM inference) entirely.
SEQUENTIAL (BEFORE):
═══════════════════════════════════════════
Call API 1: ████████ (2s)
Call API 2: ██████ (1.5s)
Call API 3: ████████ (2s)
Total: 5.5 seconds
───────────────────────────────────────────
PARALLEL (AFTER):
═══════════════════════════════════════════
Call API 1: ████████ (2s)
Call API 2: ██████ (1.5s)
Call API 3: ████████ (2s)
Total: 2 seconds (= max of all calls)
───────────────────────────────────────────
SAVINGS: 5.5s → 2s = 64% reduction
results = await asyncio.gather(
lookup_user(user_id), # 800ms
get_ticket_status(ticket_id), # 1200ms
check_permissions(user_id), # 600ms
return_exceptions=True
)
# Total: 1200ms (not 2600ms)
In a production notification delivery pipeline, applied asyncio parallelism to parallelize independent operations. Previously, checking user preferences, rendering templates, and validating delivery channels happened sequentially. By parallelizing independent operations, reduced end-to-end processing time by 60%. Same principle applies here — identify independent API calls and run them concurrently.
| Tier | % Traffic | Avg Latency | Example Queries |
|---|---|---|---|
| Fast Model | 70% | 500ms | "What's the status of INC-4821?", "Reset my password", "Who is my manager?" |
| Powerful Model | 30% | 2.5s | "Analyze the trend in P1 incidents this quarter", "Debug this error across 3 systems" |
MODEL ROUTING IMPACT ON LATENCY
═══════════════════════════════════════════════════════
WITHOUT routing (all powerful model):
Average latency = 3.0 seconds
WITH routing:
Average = (0.70 x 0.5s) + (0.30 x 2.5s)
= 0.35s + 0.75s
= 1.10 seconds
Improvement: 3.0s → 1.1s = 63% faster
BEFORE: Sequential Pipeline (Worst Case)
Cold start: 1.5s + LLM (powerful): 3.0s + API call 1: 2.0s + API call 2: 1.5s + API call 3: 2.0s + Response formatting: 0.5s = 10.5 seconds
Let's calculate the weighted average across all optimization strategies:
| Path | % Requests | Latency | Contribution |
|---|---|---|---|
| Cache hit (instant) | 35% | 0.1s | 0.035s |
| Fast model + parallel APIs | 45% | 1.5s | 0.675s |
| Powerful model + parallel APIs | 18% | 3.0s | 0.540s |
| Fast model retry → Powerful | 2% | 4.0s | 0.080s |
TOTAL WEIGHTED AVERAGE ═══════════════════════════════════════════════════════ 0.035 + 0.675 + 0.540 + 0.080 = 1.33 seconds BEFORE: 10.5 seconds (sequential, worst case) AFTER: 1.33 seconds (weighted average) ┌──────────────────────────────────────────────────┐ │ │ │ IMPROVEMENT: 87% REDUCTION │ │ 10.5s → 1.33s │ │ │ │ P80 target (<3s): ✓ ACHIEVED (98% under 3s) │ │ P95 target (<5s): ✓ ACHIEVED (99.5% under 5s)│ │ │ └──────────────────────────────────────────────────┘ BREAKDOWN OF SAVINGS: ───────────────────────────────────────────────────── Caching: -35% of requests skip pipeline entirely Parallel execution: -64% on API call latency (5.5s → 2s) Model routing: -63% on LLM latency (3s → 1.1s avg) Connection pooling: -300ms per cold connection eliminated Streaming: TTFT 200ms (perceived latency near-instant)