"Design a system that routes IT tickets to the correct assignment group (out of hundreds) with >95% accuracy. Handle small data per customer and learn across organizations."
Companies have 50-500 assignment groups. Manual triage takes 5-15 minutes per ticket. Misroutes add DAYS to resolution. The key challenge: each customer has too few tickets to train a good model alone. The answer is COLLECTIVE LEARNING.
| Dimension | Clarification | Assumption |
|---|---|---|
| Assignment Groups | How many groups per customer? | 50-500 groups per customer |
| Ticket Volume | Daily ticket volume per customer? | 100-10K tickets/day per customer |
| Cross-Customer Learning | Can we learn patterns across orgs? | Yes — Collective Learning (share weights, not data) |
| Input Fields | What ticket data is available? | All fields: short desc, description, category, subcategory, priority, department, location |
| Low Confidence | What happens when model is unsure? | Low confidence tickets routed to human for manual triage |
THE MAIN CHALLENGE IS SMALL DATA. A single organization with 2,000 employees might have only 1,000 training examples across 50 groups. That's just 20 tickets per group on average — far too few for any ML model to learn reliable patterns. The solution: COLLECTIVE LEARNING across all customers.
Think of it this way: individually, each customer has too little data. But collectively, 350 customers generate millions of tickets. The challenge is learning shared patterns (like "password" relates to "access management") while respecting that each customer's group names and routing rules are different.
| Confidence Band | Threshold | Action |
|---|---|---|
| HIGH | >0.95 | Auto-route immediately (no human in loop) |
| MEDIUM | 0.70 - 0.95 | Flag for human review with suggestion |
| LOW | <0.70 | Route to manual triage queue |
TICKET TRIAGE PIPELINE — 4 LAYERS
═══════════════════════════════════════════════════════════════════
LAYER 1: FEATURE EXTRACTION
┌──────────────────────────────────────────────────────────────┐
│ Incoming Ticket │
│ ┌─────────┬────────────┬──────┬─────────┬────────┬───────┐ │
│ │ Short │ Description│ Cat │ SubCat │Priority│ Dept │ │
│ │ Desc │ │ │ │ │ + Loc │ │
│ └────┬────┴─────┬──────┴──┬───┴────┬────┴───┬────┴───┬───┘ │
│ └──────────┴─────────┴────────┴────────┴────────┘ │
│ ALL FIELDS │
└──────────────────────────────┬───────────────────────────────┘
│
LAYER 2: BERT ENCODER v
┌──────────────────────────────────────────────────────────────┐
│ [SHORT] Cannot connect to VPN [DESC] Getting timeout error │
│ when trying to access corporate VPN from home [CAT] Network │
│ [SUBCAT] VPN [PRIORITY] P2 [DEPT] Engineering [LOC] Remote │
│ │
│ Pre-trained BERT (shared across ALL customers) │
│ + Fine-tuned classification head (PER customer) │
└──────────────────────────────┬───────────────────────────────┘
│
LAYER 3: CONFIDENCE ROUTER v
┌──────────────────────────────────────────────────────────────┐
│ ┌──────────┐ ┌─────────────┐ ┌──────────────────────┐ │
│ │ >0.95 │ │ 0.70-0.95 │ │ <0.70 │ │
│ │AUTO-ROUTE│ │FLAG + SUGGEST│ │MANUAL TRIAGE │ │
│ │ (60-70%) │ │ (20-25%) │ │ (5-10%) │ │
│ └──────────┘ └─────────────┘ └──────────────────────┘ │
└──────────────────────────────┬───────────────────────────────┘
│
LAYER 4: FEEDBACK LOOP v
┌──────────────────────────────────────────────────────────────┐
│ Human corrections → Labeled data → Nightly retrain │
│ Misroutes tracked → Per-group accuracy dashboard │
└──────────────────────────────────────────────────────────────┘
Classical ML approaches failed because they used only 1-2 fields (typically just the short description). The breakthrough insight is that ALL fields matter, especially structured fields like department and location.
TOKENIZED INPUT TO BERT:
[SHORT] Cannot connect to VPN
[DESC] Getting timeout error when trying to access corporate VPN
from home office. Started after laptop update yesterday.
[CAT] Network
[SUBCAT] VPN
[PRIORITY] P2
[DEPT] Engineering
[LOC] Remote
CRITICAL EXAMPLE — Why ALL Fields Matter:
The exact same ticket text "Cannot connect to VPN" routes to DIFFERENT groups depending on context:
Engineering + Remote → routes to "Network Security" (they manage VPN certificates for remote engineers)
Sales + London → routes to "EMEA Desktop Support" (regional team handles office connectivity)
Without department and location, you'd route both to the same group — and be wrong 50% of the time.
| Feature | Impact | Why |
|---|---|---|
| Short Description | HIGH | Core intent signal — what the user needs |
| Department | HIGH | Determines which team variant handles it |
| Location | HIGH | Regional routing (APAC vs EMEA vs Americas) |
| Category / SubCategory | MEDIUM | Pre-classification signal (if available) |
| Description | MEDIUM | Additional context, but noisy and verbose |
| Priority | LOW-MEDIUM | Some groups only handle P1s (e.g., "Major Incident") |
THE SMALL DATA PROBLEM ═══════════════════════════════════════════════════ Single Customer (Acme Corp): ┌─────────────────────────────────────────────────┐ │ 50 assignment groups │ │ × 20 tickets per group (average) │ │ = 1,000 total training examples │ │ │ │ That's like trying to teach someone 50 topics │ │ with only 20 flashcards each. NOT ENOUGH. │ └─────────────────────────────────────────────────┘ All Customers Combined: ┌─────────────────────────────────────────────────┐ │ 350 customers × 1,000 tickets/day │ │ × 365 days = 127.75 MILLION tickets/year │ │ │ │ PLENTY of data to learn that "password" │ │ relates to "access management" concepts. │ └─────────────────────────────────────────────────┘
1 Pre-train BERT on ALL customers' data
The shared BERT base learns universal IT patterns across all 350 customers. It learns that "password" relates to "access", "VPN" relates to "network", "printer" relates to "hardware". These are universal IT concepts that transfer across organizations.
2 Fine-tune per customer with customer-specific classification head
Each customer gets their own classification head (final layers) that maps the shared representations to THEIR specific assignment groups:
SHARED BERT BASE (trained on ALL customers)
┌──────────────────────────────────────────────┐
│ "password" → [access_concept_vector] │
│ "VPN" → [network_concept_vector] │
│ "printer" → [hardware_concept_vector] │
└──────────────────────┬───────────────────────┘
│
┌──────────────┼──────────────┐
│ │ │
┌───────v──────┐ ┌────v─────────┐ ┌──v──────────────┐
│ Acme Head │ │ Beta Head │ │ Gamma Head │
│ │ │ │ │ │
│ "password" → │ │ "password" → │ │ "password" → │
│ Identity Team│ │ IAM Group │ │ Access Mgmt Team │
└──────────────┘ └──────────────┘ └──────────────────┘
3 Transfer learning for brand-new customers (0 tickets)
New customer onboards with zero historical data. Use the shared BERT base + a generic classification head trained on similar-sized companies. Within 100 tickets of feedback, the customer-specific head starts outperforming the generic one.
We share WEIGHTS, not DATA. No customer ever sees another customer's tickets. The shared BERT base is trained on aggregated patterns — it learns that "password" relates to access concepts, not that "John from Acme" had a password issue. This is the same principle behind federated learning.
| Band | Confidence | Action | % of Tickets |
|---|---|---|---|
| AUTO-ROUTE | >0.95 | Route immediately, no human review | 60-70% |
| FLAG | 0.70 - 0.95 | Suggest group, human confirms/corrects | 20-25% |
| MANUAL | <0.70 | Route to manual triage queue | 5-10% |
Thresholds are tuned per customer. A customer with 500 groups needs higher confidence than one with 50 groups (more room for confusion). A customer in healthcare needs higher confidence than one in retail (higher cost of misroute). Result: 96% accuracy on auto-routed tickets across the board.
| Ticket | Predicted Group | Confidence | Action |
|---|---|---|---|
| "Password reset for SAP" | Identity & Access Mgmt | 0.98 | AUTO-ROUTE |
| "Laptop screen flickering" | Desktop Support - HQ | 0.96 | AUTO-ROUTE |
| "Need access to Salesforce" | SaaS Provisioning | 0.82 | FLAG (suggest) |
| "Application running slow" | App Support? Infra? | 0.61 | MANUAL TRIAGE |
| "New hire setup for Tokyo" | APAC Onboarding | 0.93 | FLAG (suggest) |
THE DATA FLYWHEEL
═══════════════════════════════════════════════════
┌──────────────┐ ┌──────────────┐ ┌──────────────┐
│ Ticket │────>│ ML Model │────>│ Auto-Route │
│ Submitted │ │ Predicts │ │ or Flag │
└──────────────┘ └──────────────┘ └──────┬───────┘
^ │
│ v
┌──────┴───────┐ ┌──────────────┐ ┌──────────────┐
│ Nightly │<────│ Labeled │<────│ Human │
│ Retrain │ │ Data Store │ │ Correction │
└──────────────┘ └──────────────┘ └──────────────┘
Every correction makes the model smarter.
More accuracy → more auto-routes → less human work → faster resolution.
| Metric | Example Value | Action Trigger |
|---|---|---|
| Overall Accuracy | 96.2% | Alert if drops below 94% |
| Auto-Route Rate | 67% | Investigate if drops below 55% |
| Per-Group Accuracy (worst) | "EMEA Infra": 89% | Flag groups below 90% for review |
| Common Misroute Pair | "Desktop Support" ↔ "Hardware" | Consider merging groups or adding features |
| New Group Detection | 3 new groups this month | Auto-trigger retraining with new labels |