LLM Model Serving & Evaluation Pipeline

"Serve multiple LLMs with intelligent model routing, A/B testing, canary deployments, evaluation pipeline, and auto-rollback capabilities."

Table of Contents

  1. Clarifying Questions & Scope
  2. Back-of-Envelope Estimation
  3. High-Level Architecture
  4. Deep Dive 1: Model Router
  5. Deep Dive 2: A/B Testing & Canary
  6. Deep Dive 3: Evaluation Pipeline
  7. Key Design Decisions
  8. Cheat Sheet

1 Clarifying Questions & Scope

Dimension Clarification Assumption
Model Count How many models to serve? 3-5 models: fast (GPT-3.5/Haiku), mid (GPT-4o-mini/Sonnet), powerful (GPT-4/Opus)
Routing Who decides which model? Automatic complexity-based routing with manual overrides
Deployment How do we roll out new models? Canary deployment: 5% → 25% → 50% → 100%
Evaluation How do we measure quality? Offline evals, online metrics, and human evaluation
Rollback When do we auto-rollback? If quality drops >5% or error rate >2% vs baseline

2 Back-of-Envelope Estimation

Scale Numbers

  • 3M LLM calls/day across all tenants
  • Average 700 tokens/call = 2.1B tokens/day
  • Peak: ~50 calls/second

Cost Savings with Intelligent Routing

WITHOUT routing (all powerful model):
3M calls x $0.021/call = $63,000/day

WITH routing (70% fast, 10% mid, 20% powerful):
2.1M x $0.002 + 0.3M x $0.007 + 0.6M x $0.021 = $21,840/day

Savings: 65% ($41,160/day = $15M/year)

Latency Budget Breakdown

Component Fast Model Powerful Model
Complexity Classification <50ms <50ms
Cache Check <5ms <5ms
Model Inference (TTFT) 200-500ms 500ms-2s
Token Generation 300-800ms 1-3s
Total 500ms-1.3s 1.5-5s

3 High-Level Architecture

  LLM SERVING PLATFORM
  ═══════════════════════════════════════════════════════════════════

                         ┌───────────────────┐
                         │   API Gateway     │
                         │ Circuit Breaker   │
                         │ Retry + Cache     │
                         └────────┬──────────┘
                                  │
                         ┌────────v──────────┐
                         │  Model Router     │
                         │  (Complexity      │
                         │   Classifier)     │
                         └────────┬──────────┘
                                  │
              ┌───────────────────┼───────────────────┐
              │                   │                   │
     ┌────────v────────┐ ┌───────v────────┐ ┌────────v────────┐
     │   FAST TIER     │ │   MID TIER     │ │  POWER TIER     │
     │   (70%)         │ │   (10%)        │ │  (20%)          │
     │                 │ │                │ │                  │
     │ GPT-3.5-turbo  │ │ GPT-4o-mini   │ │ GPT-4 / Opus    │
     │ Claude Haiku   │ │ Claude Sonnet  │ │ Claude Opus     │
     │ Llama-70B      │ │               │ │                  │
     │                 │ │                │ │                  │
     │ ~500ms          │ │ ~1.5s          │ │ ~3s              │
     │ $0.002/call     │ │ $0.007/call    │ │ $0.021/call      │
     └─────────────────┘ └────────────────┘ └──────────────────┘
              │                   │                   │
              └───────────────────┼───────────────────┘
                                  │
                    ┌─────────────v──────────────┐
                    │    EVAL PIPELINE (Async)   │
                    │  Offline + Online + Human  │
                    └─────────────┬──────────────┘
                                  │
                    ┌─────────────v──────────────┐
                    │   ROLLBACK CONTROLLER      │
                    │  Auto-rollback if quality   │
                    │  drops >5% vs baseline      │
                    └────────────────────────────┘

4 Deep Dive 1: Model Router

Complexity Classifier

A lightweight classifier (<50ms) analyzes each request and routes to the appropriate model tier:

Complexity Signal Simple (Fast) Complex (Powerful)
Task Type FAQ, status check, simple lookup Multi-step reasoning, analysis, code gen
Token Count <200 tokens expected >500 tokens expected
Context Needed Single source, no RAG Multiple sources, complex RAG
Reasoning Depth Direct answer, no chain-of-thought Multi-hop reasoning required
Stakes Informational, low consequence Financial, compliance, customer-facing

Routing Logic

Cost Savings Math

70% of requests are simple (password reset status, ticket update, basic lookup). These get answered perfectly by the fast model at 1/10th the cost. Only 20% truly need the powerful model. The remaining 10% are medium complexity handled by the mid-tier. This alone saves $15M/year.

5 Deep Dive 2: A/B Testing & Canary Deployment

Canary Rollout Stages

  CANARY DEPLOYMENT: New Model Rollout
  ═══════════════════════════════════════════════════════

  Stage 1: 5% traffic    │ Monitor for 2 hours
  ────────────────────────│ Metrics: error rate, latency, quality
  If healthy:             │
                          v
  Stage 2: 25% traffic   │ Monitor for 6 hours
  ────────────────────────│ Compare quality scores vs baseline
  If healthy:             │
                          v
  Stage 3: 50% traffic   │ Monitor for 24 hours
  ────────────────────────│ Statistical significance check
  If healthy:             │
                          v
  Stage 4: 100% traffic  │ Old model kept warm for 48h (rollback)
  ────────────────────────┘

  AUTO-ROLLBACK TRIGGERS:
  ─────────────────────────
  • Error rate > 2% (vs baseline)
  • Quality score drops > 5%
  • Latency P95 increases > 50%
  • Hallucination rate increases > 3%

A/B Testing Framework

Auto-Rollback

6 Deep Dive 3: Evaluation Pipeline

Three-Tier Evaluation

  EVALUATION PIPELINE (3 TIERS)
  ═══════════════════════════════════════════════════════

  TIER 1: OFFLINE EVALUATION (before deployment)
  ┌──────────────────────────────────────────────┐
  │  1,000+ labeled test cases                    │
  │  Run new model on entire test suite           │
  │  Compare: accuracy, hallucination, latency    │
  │  Gate: must pass >= baseline on all metrics   │
  └──────────────────────────────────────────────┘
           │ passes
           v
  TIER 2: ONLINE EVALUATION (during canary)
  ┌──────────────────────────────────────────────┐
  │  Real traffic metrics:                        │
  │  • Task success rate (did agent complete?)    │
  │  • Thumbs up/down from users                  │
  │  • Error rate, timeout rate                   │
  │  • Response latency P50, P95, P99            │
  │  Compare new model vs baseline in real-time   │
  └──────────────────────────────────────────────┘
           │ passes
           v
  TIER 3: HUMAN EVALUATION (weekly)
  ┌──────────────────────────────────────────────┐
  │  100 random responses/week reviewed by humans │
  │  Grade: accuracy, helpfulness, safety         │
  │  Catch subtle issues ML metrics miss          │
  │  Feed back into offline test suite            │
  └──────────────────────────────────────────────┘

Key Metrics

Metric Target Measurement
Task Accuracy >92% Did the model produce the correct output on labeled test cases?
Hallucination Rate <3% LLM-as-judge: does the response contain claims not in the context?
Latency P50 <1.5s Median time to complete response
Latency P95 <4s 95th percentile — tail latency matters for UX
Latency P99 <8s 99th percentile — worst case SLA
User Satisfaction >85% positive Thumbs up rate on responses where feedback is collected

7 Key Design Decisions

Decision Choice Why
Routing approach ML-based complexity classifier Rules break on edge cases. ML adapts and improves with data.
Multi-provider Abstract behind unified API Swap providers without code changes. Negotiate better rates.
Caching strategy Semantic cache (embedding similarity) "What's my PTO balance?" and "How much PTO do I have?" should cache-hit.
Evaluation approach 3-tier (offline + online + human) Each tier catches different issues. Together they're comprehensive.
Rollback trigger Automatic with configurable thresholds Humans too slow. Bad model at 100% traffic = thousands of bad responses.

8 Cheat Sheet

LLM Serving & Evaluation — Key Numbers

  • 3M LLM calls/day, 2.1B tokens/day
  • 3-tier routing: Fast (70%), Mid (10%), Power (20%)
  • Cost with routing: $21,840/day vs $63,000/day without = 65% savings
  • Complexity classifier: <50ms, TF-Lite model
  • Canary stages: 5% → 25% → 50% → 100%
  • Auto-rollback: error >2%, quality drop >5%, latency P95 >50% increase
  • 3-tier eval: Offline (1000+ labeled), Online (real metrics), Human (100/week)
  • Hallucination target: <3%
  • Semantic cache for similar queries (not just exact match)
  • Multi-provider abstraction: swap models without code changes