Agent Latency Optimization System

1 Clarifying Questions & Scope

Dimension	Clarification	Assumption
Current State	What's the current latency breakdown?	5-12s total: LLM 2-4s, API calls 1-3s each (sequential), overhead 1-2s
Target	What latency is acceptable?	<3s for 80% of requests, <5s for 95%
Common Patterns	What % of queries are repeated/similar?	70% are common patterns (status check, FAQ, simple lookup)
Quality Tradeoff	Can we trade some quality for speed?	Yes, for simple queries. Quality must be maintained for complex ones.
Streaming	Can we stream partial responses?	Yes — start showing response before full generation completes

2 Back-of-Envelope Estimation

        Latency Targets
        Current average: 5-12 seconds (unacceptable)
Target: <3s for 80%, <5s for 95%
70% of queries are common patterns (cacheable or fast-model eligible)
Optimization strategies: caching, parallelism, model routing, streaming

      

Current Bottleneck	Current Latency	Optimized Latency	Technique
LLM Inference	2-4 seconds	0.5-1.5s	Model routing (fast model for simple queries)
API Calls (sequential)	3-9 seconds (3 calls)	1-3s	Parallel execution (asyncio.gather)
Cold Start	1-2 seconds	0s	Warm pool, connection pooling
Repeated Queries	Full pipeline	<100ms	Response caching (30-40% hit rate)

3 High-Level Architecture

  LATENCY-OPTIMIZED AGENT PIPELINE
  ═══════════════════════════════════════════════════════════════════

  ┌──────────┐
  │ Request  │
  └────┬─────┘
       │
  ┌────v──────────┐
  │  Classifier   │  <50ms
  │  (complexity) │
  └────┬──────────┘
       │
  ┌────v──────────┐     YES    ┌──────────────┐
  │  Cache Hit?   │──────────> │  INSTANT      │  <100ms
  │  (semantic)   │            │  Response     │
  └────┬──────────┘            └──────────────┘
       │ NO
       │
  ┌────v──────────┐
  │  Model Router │
  └────┬──────────┘
       │
       ├── Simple (70%) ──> Fast Model (500ms) ──┐
       │                                          │
       └── Complex (30%) ──> Powerful Model ─────┤
                              (2.5s)              │
                                                  │
  ┌───────────────────────────────────────────────v────┐
  │  PARALLEL TOOL EXECUTION                           │
  │  asyncio.gather(                                    │
  │    call_api_1(),  ──┐                               │
  │    call_api_2(),  ──┼── All run simultaneously      │
  │    call_api_3(),  ──┘   Max latency = slowest call  │
  │  )                                                  │
  └────────────────────────────────┬───────────────────┘
                                   │
  ┌────────────────────────────────v───────────────────┐
  │  STREAMING RESPONSE                                │
  │  Start sending tokens as they're generated.        │
  │  User sees first word in ~200ms (TTFT).            │
  └────────────────────────────────────────────────────┘

4 Deep Dive 1: Response Caching

Semantic Cache (Not Just Exact Match)

Key generation: Normalize query + hash conversation context. "What's my PTO balance?" and "How much PTO do I have left?" should hit the same cache entry.
Embedding similarity: Embed the query, search cache entries within cosine similarity >0.95. If match found, return cached response.
Context-aware: Cache key includes user_id and relevant context hash. "My PTO balance" for User A is different from User B.
TTL strategy: Static data (FAQ, policy questions): TTL 1 hour. Dynamic data (ticket status): TTL 5 minutes. Personalized data: TTL 1 minute.
Hit rate: 30-40% across all queries. For FAQ-type queries, hit rate exceeds 70%.

PRODUCTION EXPERIENCE

In a large-scale enterprise notification system, implemented response caching for AI-powered content delivery. Achieved 85% cache hit rate for frequently accessed content, resulting in 94% latency reduction for cached paths. The key insight: most users ask the same categories of questions. Caching common patterns eliminates the most expensive part of the pipeline (LLM inference) entirely.

Cache Invalidation

Event-driven: When underlying data changes (ticket updated, access granted), invalidate related cache entries via event bus.
TTL-based: All entries have TTL. Even without explicit invalidation, stale data expires automatically.
User-triggered: User says "refresh" or "check again" → bypass cache, fetch fresh data.

5 Deep Dive 2: Parallel Execution + Streaming

Parallel Tool/API Execution

  SEQUENTIAL (BEFORE):
  ═══════════════════════════════════════════
  Call API 1: ████████ (2s)
                       Call API 2: ██████ (1.5s)
                                         Call API 3: ████████ (2s)
  Total: 5.5 seconds
  ───────────────────────────────────────────

  PARALLEL (AFTER):
  ═══════════════════════════════════════════
  Call API 1: ████████ (2s)
  Call API 2: ██████   (1.5s)
  Call API 3: ████████ (2s)
  Total: 2 seconds (= max of all calls)
  ───────────────────────────────────────────

  SAVINGS: 5.5s → 2s = 64% reduction

asyncio.gather Implementation

Pattern: Parallel Tool Calls

results = await asyncio.gather(
  lookup_user(user_id),          # 800ms
  get_ticket_status(ticket_id),  # 1200ms
  check_permissions(user_id),    # 600ms
  return_exceptions=True
)

# Total: 1200ms (not 2600ms)

PRODUCTION EXPERIENCE

In a production notification delivery pipeline, applied asyncio parallelism to parallelize independent operations. Previously, checking user preferences, rendering templates, and validating delivery channels happened sequentially. By parallelizing independent operations, reduced end-to-end processing time by 60%. Same principle applies here — identify independent API calls and run them concurrently.

Token Streaming

First Token (TTFT): Stream the first token to the user within ~200ms of LLM starting generation. User sees the response "typing" immediately.
Perceived latency: Even if full response takes 3s, user sees content starting at 200ms. Perceived wait time drops dramatically.
Progressive rendering: Frontend renders each token as it arrives. Markdown formatting applied incrementally.

Speculative Execution

Predict next action: While LLM is generating a plan, start pre-fetching data the plan is likely to need. Example: if user asks about a ticket, start fetching ticket details before the LLM finishes deciding which tool to call.
Connection pooling: Maintain persistent connections to frequently-called APIs. Eliminates TCP handshake + TLS negotiation overhead (~100-300ms per cold connection).
DNS caching: Cache DNS resolutions for external APIs. Saves ~50ms per call.

6 Deep Dive 3: Model Routing for Speed

Speed-Optimized Routing

Tier	% Traffic	Avg Latency	Example Queries
Fast Model	70%	500ms	"What's the status of INC-4821?", "Reset my password", "Who is my manager?"
Powerful Model	30%	2.5s	"Analyze the trend in P1 incidents this quarter", "Debug this error across 3 systems"

Weighted Average Calculation

  MODEL ROUTING IMPACT ON LATENCY
  ═══════════════════════════════════════════════════════

  WITHOUT routing (all powerful model):
  Average latency = 3.0 seconds

  WITH routing:
  Average = (0.70 x 0.5s) + (0.30 x 2.5s)
          = 0.35s + 0.75s
          = 1.10 seconds

  Improvement: 3.0s → 1.1s = 63% faster

Classifier overhead: <50ms. Tiny cost for massive latency savings.
Quality gate: If fast model's response quality is below threshold (detected by quick eval), automatically retry with powerful model. Adds latency only for the ~5% of cases where fast model fails.
Adaptive routing: Track success rate per query type per model. Gradually shift more traffic to fast model as it proves capable on new query types.

7 Before/After Breakdown

BEFORE: Sequential Pipeline (Worst Case)

Cold start: 1.5s + LLM (powerful): 3.0s + API call 1: 2.0s + API call 2: 1.5s + API call 3: 2.0s + Response formatting: 0.5s = 10.5 seconds

AFTER: Optimized Pipeline (Weighted Average)

Let's calculate the weighted average across all optimization strategies:

Path	% Requests	Latency	Contribution
Cache hit (instant)	35%	0.1s	0.035s
Fast model + parallel APIs	45%	1.5s	0.675s
Powerful model + parallel APIs	18%	3.0s	0.540s
Fast model retry → Powerful	2%	4.0s	0.080s

  TOTAL WEIGHTED AVERAGE
  ═══════════════════════════════════════════════════════

  0.035 + 0.675 + 0.540 + 0.080 = 1.33 seconds

  BEFORE:  10.5 seconds (sequential, worst case)
  AFTER:    1.33 seconds (weighted average)

  ┌──────────────────────────────────────────────────┐
  │                                                  │
  │   IMPROVEMENT: 87% REDUCTION                     │
  │   10.5s → 1.33s                                  │
  │                                                  │
  │   P80 target (<3s):  ✓ ACHIEVED (98% under 3s)  │
  │   P95 target (<5s):  ✓ ACHIEVED (99.5% under 5s)│
  │                                                  │
  └──────────────────────────────────────────────────┘

  BREAKDOWN OF SAVINGS:
  ─────────────────────────────────────────────────────
  Caching:              -35% of requests skip pipeline entirely
  Parallel execution:   -64% on API call latency (5.5s → 2s)
  Model routing:        -63% on LLM latency (3s → 1.1s avg)
  Connection pooling:   -300ms per cold connection eliminated
  Streaming:            TTFT 200ms (perceived latency near-instant)

8 Scaling & ML

Scaling Strategies

Cache infrastructure: Redis cluster with read replicas. Cache layer scales independently from compute layer.
Connection pools: Per-service connection pools with health checks. Auto-scale pool size based on traffic patterns.
Model serving: Auto-scale fast model instances based on queue depth. Always maintain warm capacity for powerful model.
Regional deployment: Deploy in user's region to minimize network latency. US, EU, APAC endpoints.

ML Enhancements

Query pattern prediction: Predict which APIs a query will need BEFORE the LLM decides. Pre-fetch data speculatively. 80% prediction accuracy on common patterns.
Adaptive caching: ML model learns which queries are worth caching (high frequency + stable answers). Dynamic TTL based on answer volatility.
Latency prediction: Predict expected latency for each request. If predicted >5s, proactively warn user: "This may take a moment..."
Continuous optimization: A/B test routing thresholds. Track latency vs quality tradeoff. Optimize for user satisfaction, not just raw speed.

9 Cheat Sheet

Agent Latency Optimization — Key Numbers

Before: 10.5s → After: 1.33s weighted avg (87% reduction)
P80 <3s achieved, P95 <5s achieved
Response caching: 30-40% hit rate, <100ms response
Production experience: 85% cache hit rate, 94% latency reduction
Parallel execution: asyncio.gather, max(calls) not sum(calls)
Production experience: asyncio parallelism, 60% processing time reduction
Model routing: 70% fast (500ms) + 30% powerful (2.5s) = 1.1s avg
Streaming: TTFT 200ms, perceived latency near-instant
Connection pooling eliminates ~300ms cold connection overhead
Speculative execution: pre-fetch data before LLM decides

Table of Contents

1 Clarifying Questions & Scope

2 Back-of-Envelope Estimation

Latency Targets

3 High-Level Architecture

4 Deep Dive 1: Response Caching

Semantic Cache (Not Just Exact Match)

PRODUCTION EXPERIENCE

Cache Invalidation

5 Deep Dive 2: Parallel Execution + Streaming

Parallel Tool/API Execution

asyncio.gather Implementation

Pattern: Parallel Tool Calls

PRODUCTION EXPERIENCE

Token Streaming

Speculative Execution

6 Deep Dive 3: Model Routing for Speed

Speed-Optimized Routing

Weighted Average Calculation

7 Before/After Breakdown

AFTER: Optimized Pipeline (Weighted Average)

8 Scaling & ML

Scaling Strategies

ML Enhancements

9 Cheat Sheet

Agent Latency Optimization — Key Numbers