AI Ticket Triage System

1 Clarifying Questions & Scope

Dimension	Clarification	Assumption
Assignment Groups	How many groups per customer?	50-500 groups per customer
Ticket Volume	Daily ticket volume per customer?	100-10K tickets/day per customer
Cross-Customer Learning	Can we learn patterns across orgs?	Yes — Collective Learning (share weights, not data)
Input Fields	What ticket data is available?	All fields: short desc, description, category, subcategory, priority, department, location
Low Confidence	What happens when model is unsure?	Low confidence tickets routed to human for manual triage

2 Key Insight: The Small Data Problem

THE MAIN CHALLENGE IS SMALL DATA. A single organization with 2,000 employees might have only 1,000 training examples across 50 groups. That's just 20 tickets per group on average — far too few for any ML model to learn reliable patterns. The solution: COLLECTIVE LEARNING across all customers.

Think of it this way: individually, each customer has too little data. But collectively, 350 customers generate millions of tickets. The challenge is learning shared patterns (like "password" relates to "access management") while respecting that each customer's group names and routing rules are different.

3 Back-of-Envelope Estimation

        Scale Numbers
        350 customers x 1,000 tickets/day avg = 350K tickets/day
Peak load: ~15 tickets/second
Inference latency: <100ms (classification, NOT generation)
Nightly retraining per customer (~35 min total pipeline)

      

Confidence Band	Threshold	Action
HIGH	>0.95	Auto-route immediately (no human in loop)
MEDIUM	0.70 - 0.95	Flag for human review with suggestion
LOW	<0.70	Route to manual triage queue

4 High-Level Architecture

  TICKET TRIAGE PIPELINE — 4 LAYERS
  ═══════════════════════════════════════════════════════════════════

  LAYER 1: FEATURE EXTRACTION
  ┌──────────────────────────────────────────────────────────────┐
  │  Incoming Ticket                                             │
  │  ┌─────────┬────────────┬──────┬─────────┬────────┬───────┐ │
  │  │ Short   │ Description│ Cat  │ SubCat  │Priority│ Dept  │ │
  │  │ Desc    │            │      │         │        │ + Loc │ │
  │  └────┬────┴─────┬──────┴──┬───┴────┬────┴───┬────┴───┬───┘ │
  │       └──────────┴─────────┴────────┴────────┴────────┘     │
  │                         ALL FIELDS                           │
  └──────────────────────────────┬───────────────────────────────┘
                                 │
  LAYER 2: BERT ENCODER          v
  ┌──────────────────────────────────────────────────────────────┐
  │  [SHORT] Cannot connect to VPN [DESC] Getting timeout error  │
  │  when trying to access corporate VPN from home [CAT] Network │
  │  [SUBCAT] VPN [PRIORITY] P2 [DEPT] Engineering [LOC] Remote  │
  │                                                              │
  │  Pre-trained BERT (shared across ALL customers)              │
  │  + Fine-tuned classification head (PER customer)             │
  └──────────────────────────────┬───────────────────────────────┘
                                 │
  LAYER 3: CONFIDENCE ROUTER     v
  ┌──────────────────────────────────────────────────────────────┐
  │  ┌──────────┐  ┌─────────────┐  ┌──────────────────────┐    │
  │  │ >0.95    │  │ 0.70-0.95   │  │ <0.70                │    │
  │  │AUTO-ROUTE│  │FLAG + SUGGEST│  │MANUAL TRIAGE         │    │
  │  │ (60-70%) │  │ (20-25%)    │  │ (5-10%)              │    │
  │  └──────────┘  └─────────────┘  └──────────────────────┘    │
  └──────────────────────────────┬───────────────────────────────┘
                                 │
  LAYER 4: FEEDBACK LOOP         v
  ┌──────────────────────────────────────────────────────────────┐
  │  Human corrections → Labeled data → Nightly retrain          │
  │  Misroutes tracked → Per-group accuracy dashboard            │
  └──────────────────────────────────────────────────────────────┘

5 Deep Dive 1: Feature Engineering

Use ALL Fields — Not Just Description

Classical ML approaches failed because they used only 1-2 fields (typically just the short description). The breakthrough insight is that ALL fields matter, especially structured fields like department and location.

Input Format

  TOKENIZED INPUT TO BERT:

  [SHORT] Cannot connect to VPN
  [DESC] Getting timeout error when trying to access corporate VPN
         from home office. Started after laptop update yesterday.
  [CAT] Network
  [SUBCAT] VPN
  [PRIORITY] P2
  [DEPT] Engineering
  [LOC] Remote

CRITICAL EXAMPLE — Why ALL Fields Matter:

The exact same ticket text "Cannot connect to VPN" routes to DIFFERENT groups depending on context:

Engineering + Remote → routes to "Network Security" (they manage VPN certificates for remote engineers)

Sales + London → routes to "EMEA Desktop Support" (regional team handles office connectivity)

Without department and location, you'd route both to the same group — and be wrong 50% of the time.

Feature Importance Ranking

Feature	Impact	Why
Short Description	HIGH	Core intent signal — what the user needs
Department	HIGH	Determines which team variant handles it
Location	HIGH	Regional routing (APAC vs EMEA vs Americas)
Category / SubCategory	MEDIUM	Pre-classification signal (if available)
Description	MEDIUM	Additional context, but noisy and verbose
Priority	LOW-MEDIUM	Some groups only handle P1s (e.g., "Major Incident")

6 Deep Dive 2: Collective Learning

The Problem

  THE SMALL DATA PROBLEM
  ═══════════════════════════════════════════════════

  Single Customer (Acme Corp):
  ┌─────────────────────────────────────────────────┐
  │  50 assignment groups                            │
  │  × 20 tickets per group (average)                │
  │  = 1,000 total training examples                 │
  │                                                  │
  │  That's like trying to teach someone 50 topics   │
  │  with only 20 flashcards each. NOT ENOUGH.       │
  └─────────────────────────────────────────────────┘

  All Customers Combined:
  ┌─────────────────────────────────────────────────┐
  │  350 customers × 1,000 tickets/day               │
  │  × 365 days = 127.75 MILLION tickets/year        │
  │                                                  │
  │  PLENTY of data to learn that "password"          │
  │  relates to "access management" concepts.         │
  └─────────────────────────────────────────────────┘

The Solution: 3-Stage Training

1 Pre-train BERT on ALL customers' data

The shared BERT base learns universal IT patterns across all 350 customers. It learns that "password" relates to "access", "VPN" relates to "network", "printer" relates to "hardware". These are universal IT concepts that transfer across organizations.

2 Fine-tune per customer with customer-specific classification head

Each customer gets their own classification head (final layers) that maps the shared representations to THEIR specific assignment groups:

  SHARED BERT BASE (trained on ALL customers)
  ┌──────────────────────────────────────────────┐
  │  "password" → [access_concept_vector]         │
  │  "VPN"      → [network_concept_vector]        │
  │  "printer"  → [hardware_concept_vector]       │
  └──────────────────────┬───────────────────────┘
                         │
          ┌──────────────┼──────────────┐
          │              │              │
  ┌───────v──────┐ ┌────v─────────┐ ┌──v──────────────┐
  │ Acme Head    │ │ Beta Head    │ │ Gamma Head       │
  │              │ │              │ │                  │
  │ "password" → │ │ "password" → │ │ "password" →     │
  │ Identity Team│ │ IAM Group    │ │ Access Mgmt Team │
  └──────────────┘ └──────────────┘ └──────────────────┘

3 Transfer learning for brand-new customers (0 tickets)

New customer onboards with zero historical data. Use the shared BERT base + a generic classification head trained on similar-sized companies. Within 100 tickets of feedback, the customer-specific head starts outperforming the generic one.

Privacy Guarantee

We share WEIGHTS, not DATA. No customer ever sees another customer's tickets. The shared BERT base is trained on aggregated patterns — it learns that "password" relates to access concepts, not that "John from Acme" had a password issue. This is the same principle behind federated learning.

7 Deep Dive 3: Confidence Routing

Three-Band Routing

Band	Confidence	Action	% of Tickets
AUTO-ROUTE	>0.95	Route immediately, no human review	60-70%
FLAG	0.70 - 0.95	Suggest group, human confirms/corrects	20-25%
MANUAL	<0.70	Route to manual triage queue	5-10%

Why Variable Thresholds?

Thresholds are tuned per customer. A customer with 500 groups needs higher confidence than one with 50 groups (more room for confusion). A customer in healthcare needs higher confidence than one in retail (higher cost of misroute). Result: 96% accuracy on auto-routed tickets across the board.

Confidence Calibration

Temperature scaling: Raw softmax outputs are poorly calibrated. Apply learned temperature parameter so that "0.95 confidence" actually means "correct 95% of the time".
Platt scaling: Alternative calibration using logistic regression on validation set outputs.
Per-group calibration: Some groups are inherently harder to predict. Adjust thresholds per group based on historical accuracy.

8 Example Output

Ticket	Predicted Group	Confidence	Action
"Password reset for SAP"	Identity & Access Mgmt	0.98	AUTO-ROUTE
"Laptop screen flickering"	Desktop Support - HQ	0.96	AUTO-ROUTE
"Need access to Salesforce"	SaaS Provisioning	0.82	FLAG (suggest)
"Application running slow"	App Support? Infra?	0.61	MANUAL TRIAGE
"New hire setup for Tokyo"	APAC Onboarding	0.93	FLAG (suggest)

9 Scaling & Feedback Loop

Model Architecture & Serving

Hybrid model: Shared BERT base (~110M params) + per-customer classification head (~500K params each). Total: 110M + (350 x 500K) = ~285M params.
Serving: TF Serving or NVIDIA Triton. Batch inference for throughput. <100ms per prediction (classification, not generation).
Nightly retraining: Per-customer head retrained nightly on last 90 days of labeled data. Shared base retrained weekly on aggregate data.
A/B testing: Shadow mode for new models — run predictions but don't route. Compare accuracy before promoting.

Data Flywheel

  THE DATA FLYWHEEL
  ═══════════════════════════════════════════════════

  ┌──────────────┐     ┌──────────────┐     ┌──────────────┐
  │  Ticket      │────>│  ML Model    │────>│  Auto-Route  │
  │  Submitted   │     │  Predicts    │     │  or Flag     │
  └──────────────┘     └──────────────┘     └──────┬───────┘
         ^                                         │
         │                                         v
  ┌──────┴───────┐     ┌──────────────┐     ┌──────────────┐
  │  Nightly     │<────│  Labeled     │<────│  Human       │
  │  Retrain     │     │  Data Store  │     │  Correction  │
  └──────────────┘     └──────────────┘     └──────────────┘

  Every correction makes the model smarter.
  More accuracy → more auto-routes → less human work → faster resolution.

Monitoring Metrics

Metric	Example Value	Action Trigger
Overall Accuracy	96.2%	Alert if drops below 94%
Auto-Route Rate	67%	Investigate if drops below 55%
Per-Group Accuracy (worst)	"EMEA Infra": 89%	Flag groups below 90% for review
Common Misroute Pair	"Desktop Support" ↔ "Hardware"	Consider merging groups or adding features
New Group Detection	3 new groups this month	Auto-trigger retraining with new labels

10 Cheat Sheet

AI Ticket Triage — Key Numbers

350K tickets/day, 15 tickets/sec peak
<100ms inference (classification, not generation)
Collective Learning: shared BERT base + per-customer head
Use ALL fields: short desc + desc + cat + subcat + priority + dept + location
3-band confidence: >0.95 auto, 0.70-0.95 flag, <0.70 manual
96% accuracy on auto-routed tickets
60-70% of tickets auto-routed (no human needed)
Nightly retraining per customer, weekly shared base update
New customer: transfer learning from shared base, effective within 100 tickets
Share WEIGHTS not DATA (privacy preserved)

Why It Matters

Table of Contents