Agent Latency Optimization System

"The AI agent takes 8-15 seconds to respond. Reduce to <3 seconds. Main bottlenecks: LLM inference (2-4s), API calls to external systems (1-3s each, done sequentially), and cold start overhead."

Table of Contents

  1. Clarifying Questions & Scope
  2. Back-of-Envelope Estimation
  3. High-Level Architecture
  4. Deep Dive 1: Response Caching
  5. Deep Dive 2: Parallel Execution + Streaming
  6. Deep Dive 3: Model Routing for Speed
  7. Before/After Breakdown
  8. Scaling & ML
  9. Cheat Sheet

1 Clarifying Questions & Scope

Dimension Clarification Assumption
Current State What's the current latency breakdown? 5-12s total: LLM 2-4s, API calls 1-3s each (sequential), overhead 1-2s
Target What latency is acceptable? <3s for 80% of requests, <5s for 95%
Common Patterns What % of queries are repeated/similar? 70% are common patterns (status check, FAQ, simple lookup)
Quality Tradeoff Can we trade some quality for speed? Yes, for simple queries. Quality must be maintained for complex ones.
Streaming Can we stream partial responses? Yes — start showing response before full generation completes

2 Back-of-Envelope Estimation

Latency Targets

  • Current average: 5-12 seconds (unacceptable)
  • Target: <3s for 80%, <5s for 95%
  • 70% of queries are common patterns (cacheable or fast-model eligible)
  • Optimization strategies: caching, parallelism, model routing, streaming
Current Bottleneck Current Latency Optimized Latency Technique
LLM Inference 2-4 seconds 0.5-1.5s Model routing (fast model for simple queries)
API Calls (sequential) 3-9 seconds (3 calls) 1-3s Parallel execution (asyncio.gather)
Cold Start 1-2 seconds 0s Warm pool, connection pooling
Repeated Queries Full pipeline <100ms Response caching (30-40% hit rate)

3 High-Level Architecture

  LATENCY-OPTIMIZED AGENT PIPELINE
  ═══════════════════════════════════════════════════════════════════

  ┌──────────┐
  │ Request  │
  └────┬─────┘
       │
  ┌────v──────────┐
  │  Classifier   │  <50ms
  │  (complexity) │
  └────┬──────────┘
       │
  ┌────v──────────┐     YES    ┌──────────────┐
  │  Cache Hit?   │──────────> │  INSTANT      │  <100ms
  │  (semantic)   │            │  Response     │
  └────┬──────────┘            └──────────────┘
       │ NO
       │
  ┌────v──────────┐
  │  Model Router │
  └────┬──────────┘
       │
       ├── Simple (70%) ──> Fast Model (500ms) ──┐
       │                                          │
       └── Complex (30%) ──> Powerful Model ─────┤
                              (2.5s)              │
                                                  │
  ┌───────────────────────────────────────────────v────┐
  │  PARALLEL TOOL EXECUTION                           │
  │  asyncio.gather(                                    │
  │    call_api_1(),  ──┐                               │
  │    call_api_2(),  ──┼── All run simultaneously      │
  │    call_api_3(),  ──┘   Max latency = slowest call  │
  │  )                                                  │
  └────────────────────────────────┬───────────────────┘
                                   │
  ┌────────────────────────────────v───────────────────┐
  │  STREAMING RESPONSE                                │
  │  Start sending tokens as they're generated.        │
  │  User sees first word in ~200ms (TTFT).            │
  └────────────────────────────────────────────────────┘

4 Deep Dive 1: Response Caching

Semantic Cache (Not Just Exact Match)

PRODUCTION EXPERIENCE

In a large-scale enterprise notification system, implemented response caching for AI-powered content delivery. Achieved 85% cache hit rate for frequently accessed content, resulting in 94% latency reduction for cached paths. The key insight: most users ask the same categories of questions. Caching common patterns eliminates the most expensive part of the pipeline (LLM inference) entirely.

Cache Invalidation

5 Deep Dive 2: Parallel Execution + Streaming

Parallel Tool/API Execution

  SEQUENTIAL (BEFORE):
  ═══════════════════════════════════════════
  Call API 1: ████████ (2s)
                       Call API 2: ██████ (1.5s)
                                         Call API 3: ████████ (2s)
  Total: 5.5 seconds
  ───────────────────────────────────────────

  PARALLEL (AFTER):
  ═══════════════════════════════════════════
  Call API 1: ████████ (2s)
  Call API 2: ██████   (1.5s)
  Call API 3: ████████ (2s)
  Total: 2 seconds (= max of all calls)
  ───────────────────────────────────────────

  SAVINGS: 5.5s → 2s = 64% reduction

asyncio.gather Implementation

Pattern: Parallel Tool Calls

results = await asyncio.gather(
  lookup_user(user_id),          # 800ms
  get_ticket_status(ticket_id),  # 1200ms
  check_permissions(user_id),    # 600ms
  return_exceptions=True
)

# Total: 1200ms (not 2600ms)

PRODUCTION EXPERIENCE

In a production notification delivery pipeline, applied asyncio parallelism to parallelize independent operations. Previously, checking user preferences, rendering templates, and validating delivery channels happened sequentially. By parallelizing independent operations, reduced end-to-end processing time by 60%. Same principle applies here — identify independent API calls and run them concurrently.

Token Streaming

Speculative Execution

6 Deep Dive 3: Model Routing for Speed

Speed-Optimized Routing

Tier % Traffic Avg Latency Example Queries
Fast Model 70% 500ms "What's the status of INC-4821?", "Reset my password", "Who is my manager?"
Powerful Model 30% 2.5s "Analyze the trend in P1 incidents this quarter", "Debug this error across 3 systems"

Weighted Average Calculation

  MODEL ROUTING IMPACT ON LATENCY
  ═══════════════════════════════════════════════════════

  WITHOUT routing (all powerful model):
  Average latency = 3.0 seconds

  WITH routing:
  Average = (0.70 x 0.5s) + (0.30 x 2.5s)
          = 0.35s + 0.75s
          = 1.10 seconds

  Improvement: 3.0s → 1.1s = 63% faster

7 Before/After Breakdown

BEFORE: Sequential Pipeline (Worst Case)

Cold start: 1.5s + LLM (powerful): 3.0s + API call 1: 2.0s + API call 2: 1.5s + API call 3: 2.0s + Response formatting: 0.5s = 10.5 seconds

AFTER: Optimized Pipeline (Weighted Average)

Let's calculate the weighted average across all optimization strategies:

Path % Requests Latency Contribution
Cache hit (instant) 35% 0.1s 0.035s
Fast model + parallel APIs 45% 1.5s 0.675s
Powerful model + parallel APIs 18% 3.0s 0.540s
Fast model retry → Powerful 2% 4.0s 0.080s
  TOTAL WEIGHTED AVERAGE
  ═══════════════════════════════════════════════════════

  0.035 + 0.675 + 0.540 + 0.080 = 1.33 seconds

  BEFORE:  10.5 seconds (sequential, worst case)
  AFTER:    1.33 seconds (weighted average)

  ┌──────────────────────────────────────────────────┐
  │                                                  │
  │   IMPROVEMENT: 87% REDUCTION                     │
  │   10.5s → 1.33s                                  │
  │                                                  │
  │   P80 target (<3s):  ✓ ACHIEVED (98% under 3s)  │
  │   P95 target (<5s):  ✓ ACHIEVED (99.5% under 5s)│
  │                                                  │
  └──────────────────────────────────────────────────┘

  BREAKDOWN OF SAVINGS:
  ─────────────────────────────────────────────────────
  Caching:              -35% of requests skip pipeline entirely
  Parallel execution:   -64% on API call latency (5.5s → 2s)
  Model routing:        -63% on LLM latency (3s → 1.1s avg)
  Connection pooling:   -300ms per cold connection eliminated
  Streaming:            TTFT 200ms (perceived latency near-instant)

8 Scaling & ML

Scaling Strategies

ML Enhancements

9 Cheat Sheet

Agent Latency Optimization — Key Numbers

  • Before: 10.5s → After: 1.33s weighted avg (87% reduction)
  • P80 <3s achieved, P95 <5s achieved
  • Response caching: 30-40% hit rate, <100ms response
  • Production experience: 85% cache hit rate, 94% latency reduction
  • Parallel execution: asyncio.gather, max(calls) not sum(calls)
  • Production experience: asyncio parallelism, 60% processing time reduction
  • Model routing: 70% fast (500ms) + 30% powerful (2.5s) = 1.1s avg
  • Streaming: TTFT 200ms, perceived latency near-instant
  • Connection pooling eliminates ~300ms cold connection overhead
  • Speculative execution: pre-fetch data before LLM decides