LLM Model Serving & Evaluation Pipeline
FOR ML STAFF
Staff Level
40 min
"Serve multiple LLMs with intelligent model routing, A/B testing, canary deployments, evaluation pipeline, and auto-rollback capabilities."
2 Back-of-Envelope Estimation
Scale Numbers
- 3M LLM calls/day across all tenants
- Average 700 tokens/call = 2.1B tokens/day
- Peak: ~50 calls/second
Cost Savings with Intelligent Routing
WITHOUT routing (all powerful model):
3M calls x $0.021/call = $63,000/day
WITH routing (70% fast, 10% mid, 20% powerful):
2.1M x $0.002 + 0.3M x $0.007 + 0.6M x $0.021 = $21,840/day
Savings: 65% ($41,160/day = $15M/year)
Latency Budget Breakdown
| Component |
Fast Model |
Powerful Model |
| Complexity Classification |
<50ms |
<50ms |
| Cache Check |
<5ms |
<5ms |
| Model Inference (TTFT) |
200-500ms |
500ms-2s |
| Token Generation |
300-800ms |
1-3s |
| Total |
500ms-1.3s |
1.5-5s |
3 High-Level Architecture
LLM SERVING PLATFORM
═══════════════════════════════════════════════════════════════════
┌───────────────────┐
│ API Gateway │
│ Circuit Breaker │
│ Retry + Cache │
└────────┬──────────┘
│
┌────────v──────────┐
│ Model Router │
│ (Complexity │
│ Classifier) │
└────────┬──────────┘
│
┌───────────────────┼───────────────────┐
│ │ │
┌────────v────────┐ ┌───────v────────┐ ┌────────v────────┐
│ FAST TIER │ │ MID TIER │ │ POWER TIER │
│ (70%) │ │ (10%) │ │ (20%) │
│ │ │ │ │ │
│ GPT-3.5-turbo │ │ GPT-4o-mini │ │ GPT-4 / Opus │
│ Claude Haiku │ │ Claude Sonnet │ │ Claude Opus │
│ Llama-70B │ │ │ │ │
│ │ │ │ │ │
│ ~500ms │ │ ~1.5s │ │ ~3s │
│ $0.002/call │ │ $0.007/call │ │ $0.021/call │
└─────────────────┘ └────────────────┘ └──────────────────┘
│ │ │
└───────────────────┼───────────────────┘
│
┌─────────────v──────────────┐
│ EVAL PIPELINE (Async) │
│ Offline + Online + Human │
└─────────────┬──────────────┘
│
┌─────────────v──────────────┐
│ ROLLBACK CONTROLLER │
│ Auto-rollback if quality │
│ drops >5% vs baseline │
└────────────────────────────┘
4 Deep Dive 1: Model Router
Complexity Classifier
A lightweight classifier (<50ms) analyzes each request and routes to the appropriate model tier:
| Complexity Signal |
Simple (Fast) |
Complex (Powerful) |
| Task Type |
FAQ, status check, simple lookup |
Multi-step reasoning, analysis, code gen |
| Token Count |
<200 tokens expected |
>500 tokens expected |
| Context Needed |
Single source, no RAG |
Multiple sources, complex RAG |
| Reasoning Depth |
Direct answer, no chain-of-thought |
Multi-hop reasoning required |
| Stakes |
Informational, low consequence |
Financial, compliance, customer-facing |
Routing Logic
- Primary classification: TF-Lite model trained on (query, best_model) pairs from historical data. Features: query length, intent, entity count, conversation depth.
- Fallback on low confidence: If classifier confidence <0.7, default to mid-tier model (safe middle ground).
- Manual override: Tenant config can force specific models for specific operations (e.g., "always use powerful for legal reviews").
- Upgrade path: If fast model returns low-quality response (detected by eval), automatically retry with powerful model. User sees slightly longer latency but correct answer.
Cost Savings Math
70% of requests are simple (password reset status, ticket update, basic lookup). These get answered perfectly by the fast model at 1/10th the cost. Only 20% truly need the powerful model. The remaining 10% are medium complexity handled by the mid-tier. This alone saves $15M/year.
5 Deep Dive 2: A/B Testing & Canary Deployment
Canary Rollout Stages
CANARY DEPLOYMENT: New Model Rollout
═══════════════════════════════════════════════════════
Stage 1: 5% traffic │ Monitor for 2 hours
────────────────────────│ Metrics: error rate, latency, quality
If healthy: │
v
Stage 2: 25% traffic │ Monitor for 6 hours
────────────────────────│ Compare quality scores vs baseline
If healthy: │
v
Stage 3: 50% traffic │ Monitor for 24 hours
────────────────────────│ Statistical significance check
If healthy: │
v
Stage 4: 100% traffic │ Old model kept warm for 48h (rollback)
────────────────────────┘
AUTO-ROLLBACK TRIGGERS:
─────────────────────────
• Error rate > 2% (vs baseline)
• Quality score drops > 5%
• Latency P95 increases > 50%
• Hallucination rate increases > 3%
A/B Testing Framework
- Traffic splitting: Consistent hashing by (tenant_id + user_id) ensures same user always hits same variant. Prevents confusing UX from model switching.
- Statistical significance: Require p-value <0.05 and minimum 1,000 samples per variant before drawing conclusions.
- Metrics collected: Task completion rate, user satisfaction (thumbs up/down), response latency, cost per request, hallucination rate.
- Experiment duration: Minimum 7 days to account for weekly patterns. Auto-stop if degradation detected.
Auto-Rollback
- Real-time monitoring: Datadog/Prometheus tracks all metrics per model variant. Alerts within 5 minutes of degradation.
- Automatic rollback: If any auto-rollback trigger fires, immediately shift all traffic back to baseline model. No human intervention needed.
- Post-mortem: After rollback, run offline evaluation to understand WHY the new model degraded. Was it a specific query type? A specific tenant?
6 Deep Dive 3: Evaluation Pipeline
Three-Tier Evaluation
EVALUATION PIPELINE (3 TIERS)
═══════════════════════════════════════════════════════
TIER 1: OFFLINE EVALUATION (before deployment)
┌──────────────────────────────────────────────┐
│ 1,000+ labeled test cases │
│ Run new model on entire test suite │
│ Compare: accuracy, hallucination, latency │
│ Gate: must pass >= baseline on all metrics │
└──────────────────────────────────────────────┘
│ passes
v
TIER 2: ONLINE EVALUATION (during canary)
┌──────────────────────────────────────────────┐
│ Real traffic metrics: │
│ • Task success rate (did agent complete?) │
│ • Thumbs up/down from users │
│ • Error rate, timeout rate │
│ • Response latency P50, P95, P99 │
│ Compare new model vs baseline in real-time │
└──────────────────────────────────────────────┘
│ passes
v
TIER 3: HUMAN EVALUATION (weekly)
┌──────────────────────────────────────────────┐
│ 100 random responses/week reviewed by humans │
│ Grade: accuracy, helpfulness, safety │
│ Catch subtle issues ML metrics miss │
│ Feed back into offline test suite │
└──────────────────────────────────────────────┘
Key Metrics
| Metric |
Target |
Measurement |
| Task Accuracy |
>92% |
Did the model produce the correct output on labeled test cases? |
| Hallucination Rate |
<3% |
LLM-as-judge: does the response contain claims not in the context? |
| Latency P50 |
<1.5s |
Median time to complete response |
| Latency P95 |
<4s |
95th percentile — tail latency matters for UX |
| Latency P99 |
<8s |
99th percentile — worst case SLA |
| User Satisfaction |
>85% positive |
Thumbs up rate on responses where feedback is collected |