LLM Model Serving & Evaluation Pipeline

1 Clarifying Questions & Scope

Dimension	Clarification	Assumption
Model Count	How many models to serve?	3-5 models: fast (GPT-3.5/Haiku), mid (GPT-4o-mini/Sonnet), powerful (GPT-4/Opus)
Routing	Who decides which model?	Automatic complexity-based routing with manual overrides
Deployment	How do we roll out new models?	Canary deployment: 5% → 25% → 50% → 100%
Evaluation	How do we measure quality?	Offline evals, online metrics, and human evaluation
Rollback	When do we auto-rollback?	If quality drops >5% or error rate >2% vs baseline

2 Back-of-Envelope Estimation

        Scale Numbers
        3M LLM calls/day across all tenants
Average 700 tokens/call = 2.1B tokens/day
Peak: ~50 calls/second

      

Cost Savings with Intelligent Routing

WITHOUT routing (all powerful model):
3M calls x $0.021/call = $63,000/day

WITH routing (70% fast, 10% mid, 20% powerful):
2.1M x $0.002 + 0.3M x $0.007 + 0.6M x $0.021 = $21,840/day

Savings: 65% ($41,160/day = $15M/year)

Latency Budget Breakdown

Component	Fast Model	Powerful Model
Complexity Classification	<50ms	<50ms
Cache Check	<5ms	<5ms
Model Inference (TTFT)	200-500ms	500ms-2s
Token Generation	300-800ms	1-3s
Total	500ms-1.3s	1.5-5s

3 High-Level Architecture

  LLM SERVING PLATFORM
  ═══════════════════════════════════════════════════════════════════

                         ┌───────────────────┐
                         │   API Gateway     │
                         │ Circuit Breaker   │
                         │ Retry + Cache     │
                         └────────┬──────────┘
                                  │
                         ┌────────v──────────┐
                         │  Model Router     │
                         │  (Complexity      │
                         │   Classifier)     │
                         └────────┬──────────┘
                                  │
              ┌───────────────────┼───────────────────┐
              │                   │                   │
     ┌────────v────────┐ ┌───────v────────┐ ┌────────v────────┐
     │   FAST TIER     │ │   MID TIER     │ │  POWER TIER     │
     │   (70%)         │ │   (10%)        │ │  (20%)          │
     │                 │ │                │ │                  │
     │ GPT-3.5-turbo  │ │ GPT-4o-mini   │ │ GPT-4 / Opus    │
     │ Claude Haiku   │ │ Claude Sonnet  │ │ Claude Opus     │
     │ Llama-70B      │ │               │ │                  │
     │                 │ │                │ │                  │
     │ ~500ms          │ │ ~1.5s          │ │ ~3s              │
     │ $0.002/call     │ │ $0.007/call    │ │ $0.021/call      │
     └─────────────────┘ └────────────────┘ └──────────────────┘
              │                   │                   │
              └───────────────────┼───────────────────┘
                                  │
                    ┌─────────────v──────────────┐
                    │    EVAL PIPELINE (Async)   │
                    │  Offline + Online + Human  │
                    └─────────────┬──────────────┘
                                  │
                    ┌─────────────v──────────────┐
                    │   ROLLBACK CONTROLLER      │
                    │  Auto-rollback if quality   │
                    │  drops >5% vs baseline      │
                    └────────────────────────────┘

4 Deep Dive 1: Model Router

Complexity Classifier

A lightweight classifier (<50ms) analyzes each request and routes to the appropriate model tier:

Complexity Signal	Simple (Fast)	Complex (Powerful)
Task Type	FAQ, status check, simple lookup	Multi-step reasoning, analysis, code gen
Token Count	<200 tokens expected	>500 tokens expected
Context Needed	Single source, no RAG	Multiple sources, complex RAG
Reasoning Depth	Direct answer, no chain-of-thought	Multi-hop reasoning required
Stakes	Informational, low consequence	Financial, compliance, customer-facing

Routing Logic

Primary classification: TF-Lite model trained on (query, best_model) pairs from historical data. Features: query length, intent, entity count, conversation depth.
Fallback on low confidence: If classifier confidence <0.7, default to mid-tier model (safe middle ground).
Manual override: Tenant config can force specific models for specific operations (e.g., "always use powerful for legal reviews").
Upgrade path: If fast model returns low-quality response (detected by eval), automatically retry with powerful model. User sees slightly longer latency but correct answer.

Cost Savings Math

70% of requests are simple (password reset status, ticket update, basic lookup). These get answered perfectly by the fast model at 1/10th the cost. Only 20% truly need the powerful model. The remaining 10% are medium complexity handled by the mid-tier. This alone saves $15M/year.

5 Deep Dive 2: A/B Testing & Canary Deployment

Canary Rollout Stages

  CANARY DEPLOYMENT: New Model Rollout
  ═══════════════════════════════════════════════════════

  Stage 1: 5% traffic    │ Monitor for 2 hours
  ────────────────────────│ Metrics: error rate, latency, quality
  If healthy:             │
                          v
  Stage 2: 25% traffic   │ Monitor for 6 hours
  ────────────────────────│ Compare quality scores vs baseline
  If healthy:             │
                          v
  Stage 3: 50% traffic   │ Monitor for 24 hours
  ────────────────────────│ Statistical significance check
  If healthy:             │
                          v
  Stage 4: 100% traffic  │ Old model kept warm for 48h (rollback)
  ────────────────────────┘

  AUTO-ROLLBACK TRIGGERS:
  ─────────────────────────
  • Error rate > 2% (vs baseline)
  • Quality score drops > 5%
  • Latency P95 increases > 50%
  • Hallucination rate increases > 3%

A/B Testing Framework

Traffic splitting: Consistent hashing by (tenant_id + user_id) ensures same user always hits same variant. Prevents confusing UX from model switching.
Statistical significance: Require p-value <0.05 and minimum 1,000 samples per variant before drawing conclusions.
Metrics collected: Task completion rate, user satisfaction (thumbs up/down), response latency, cost per request, hallucination rate.
Experiment duration: Minimum 7 days to account for weekly patterns. Auto-stop if degradation detected.

Auto-Rollback

Real-time monitoring: Datadog/Prometheus tracks all metrics per model variant. Alerts within 5 minutes of degradation.
Automatic rollback: If any auto-rollback trigger fires, immediately shift all traffic back to baseline model. No human intervention needed.
Post-mortem: After rollback, run offline evaluation to understand WHY the new model degraded. Was it a specific query type? A specific tenant?

6 Deep Dive 3: Evaluation Pipeline

Three-Tier Evaluation

  EVALUATION PIPELINE (3 TIERS)
  ═══════════════════════════════════════════════════════

  TIER 1: OFFLINE EVALUATION (before deployment)
  ┌──────────────────────────────────────────────┐
  │  1,000+ labeled test cases                    │
  │  Run new model on entire test suite           │
  │  Compare: accuracy, hallucination, latency    │
  │  Gate: must pass >= baseline on all metrics   │
  └──────────────────────────────────────────────┘
           │ passes
           v
  TIER 2: ONLINE EVALUATION (during canary)
  ┌──────────────────────────────────────────────┐
  │  Real traffic metrics:                        │
  │  • Task success rate (did agent complete?)    │
  │  • Thumbs up/down from users                  │
  │  • Error rate, timeout rate                   │
  │  • Response latency P50, P95, P99            │
  │  Compare new model vs baseline in real-time   │
  └──────────────────────────────────────────────┘
           │ passes
           v
  TIER 3: HUMAN EVALUATION (weekly)
  ┌──────────────────────────────────────────────┐
  │  100 random responses/week reviewed by humans │
  │  Grade: accuracy, helpfulness, safety         │
  │  Catch subtle issues ML metrics miss          │
  │  Feed back into offline test suite            │
  └──────────────────────────────────────────────┘

Key Metrics

Metric	Target	Measurement
Task Accuracy	>92%	Did the model produce the correct output on labeled test cases?
Hallucination Rate	<3%	LLM-as-judge: does the response contain claims not in the context?
Latency P50	<1.5s	Median time to complete response
Latency P95	<4s	95th percentile — tail latency matters for UX
Latency P99	<8s	99th percentile — worst case SLA
User Satisfaction	>85% positive	Thumbs up rate on responses where feedback is collected

7 Key Design Decisions

Decision	Choice	Why
Routing approach	ML-based complexity classifier	Rules break on edge cases. ML adapts and improves with data.
Multi-provider	Abstract behind unified API	Swap providers without code changes. Negotiate better rates.
Caching strategy	Semantic cache (embedding similarity)	"What's my PTO balance?" and "How much PTO do I have?" should cache-hit.
Evaluation approach	3-tier (offline + online + human)	Each tier catches different issues. Together they're comprehensive.
Rollback trigger	Automatic with configurable thresholds	Humans too slow. Bad model at 100% traffic = thousands of bad responses.

8 Cheat Sheet

LLM Serving & Evaluation — Key Numbers

3M LLM calls/day, 2.1B tokens/day
3-tier routing: Fast (70%), Mid (10%), Power (20%)
Cost with routing: $21,840/day vs $63,000/day without = 65% savings
Complexity classifier: <50ms, TF-Lite model
Canary stages: 5% → 25% → 50% → 100%
Auto-rollback: error >2%, quality drop >5%, latency P95 >50% increase
3-tier eval: Offline (1000+ labeled), Online (real metrics), Human (100/week)
Hallucination target: <3%
Semantic cache for similar queries (not just exact match)
Multi-provider abstraction: swap models without code changes