Multi-Tenant Plugin/Connector Platform

ENTERPRISE EXPERIENCE Staff Level 40 min

"Design a platform for customers to connect their business systems (ServiceNow, Jira, Salesforce, SAP, etc.) to an AI agent. Different systems, APIs, authentication methods, and data schemas."

Interviewer Signals
Clarifying Questions & Scope
Back-of-Envelope Estimation
High-Level Architecture
Key Design Decisions
Deep Dive 1: Template vs Instance
Deep Dive 2: Middleware Chain
Deep Dive 3: Sliding Window Rate Limiter
Deep Dive 4: Credential Management
Cheat Sheet

1 Interviewer Signals

Signal	What They Want to See
Abstraction	Can you design a unified interface over heterogeneous systems?
Multi-tenancy	How do you isolate tenant data, credentials, and rate limits?
Extensibility	How easy is it to add a new connector type (e.g., Workday)?
Resilience	How do you handle external API failures, rate limits, timeouts?
Security	How are credentials stored, rotated, and scoped?
Operational Maturity	Monitoring, alerting, debugging — can you operate this at scale?

2 Clarifying Questions & Scope

Dimension	Clarification	Assumption
Connector Count	How many systems to support?	10 connector types initially (ServiceNow, Jira, Salesforce, etc.)
Instances	How many instances per customer?	5-15 connectors per customer, 350 customers = ~3,500 instances
Operations	Read-only or read-write?	Both: read data + create/update records
Auth Methods	What auth types?	OAuth 2.0, API Key, Basic Auth, mTLS, SAML
Rate Limits	External API limits?	Per-tenant, per-connector. Must respect external system limits.

3 Back-of-Envelope Estimation

        Scale Numbers
        3,500 connector instances (350 customers x 10 avg)
350K API calls/day to external systems
1K credential rotations/day (OAuth token refreshes)
Peak: ~15 API calls/second
Avg response time target: <2s (including external API latency)

      

4 High-Level Architecture

  PLUGIN/CONNECTOR PLATFORM
  ═══════════════════════════════════════════════════════════════════

  AI Agent Request
       │
       v
  ┌──────────┐     ┌─────────────────────────────────────────────┐
  │ Gateway  │────>│          RUNTIME ENGINE                     │
  │ (Auth,   │     │  ┌─────────────────────────────────────┐    │
  │  Route)  │     │  │     MIDDLEWARE CHAIN                 │    │
  └──────────┘     │  │                                     │    │
                   │  │  Auth → RateLimit → Transform →     │    │
                   │  │  Execute → Retry → Log              │    │
                   │  │                                     │    │
                   │  └─────────────────────────────────────┘    │
                   └─────────────────────┬───────────────────────┘
                                         │
              ┌──────────────────────────┼──────────────────────┐
              │                          │                      │
       ┌──────v──────┐          ┌────────v────────┐     ┌──────v──────┐
       │  Registry   │          │ Config Store    │     │   Vault     │
       │ (Templates  │          │ (Per-tenant     │     │(Credentials │
       │ + Instances)│          │  settings)      │     │  secrets)   │
       └─────────────┘          └─────────────────┘     └─────────────┘

                   EXTERNAL SYSTEMS              CROSS-CUTTING
              ┌────────────────────┐        ┌─────────────────────┐
              │ ServiceNow │ Jira  │        │ Metrics │ Tracing   │
              │ Salesforce │ SAP   │        │ Alerts  │ Audit Log │
              │ Workday    │ etc.  │        └─────────────────────┘
              └────────────────────┘

5 Key Design Decisions

Decision	Choice	Why
Template vs Instance model	Separate template (blueprint) from instance (runtime config)	Like Docker Image vs Container. One "ServiceNow connector" template, many customer instances.
Middleware chain pattern	Ordered chain of composable middleware	Each concern (auth, rate-limit, transform) is isolated and testable. Easy to add new middleware.
Credential storage	HashiCorp Vault with dynamic secrets	Never store credentials in DB. Auto-rotation. Per-tenant isolation. Audit trail.
Rate limiting	Sliding window per (tenant, connector)	Respects external API limits. No burst spikes at window boundaries. Redis ZSET implementation.
Schema transformation	Declarative field mappings in JSON	Customers map their custom fields without code. "status" → "ticket_state", "assignee" → "owner".

6 Deep Dive 1: Template vs Instance

Analogy: Docker Image vs Container

A Template is like a Docker Image — it defines WHAT a connector can do. An Instance is like a Container — it's a running configuration for a specific tenant with their credentials and custom mappings.

Template (Blueprint)

  CONNECTOR TEMPLATE: ServiceNow
  ═══════════════════════════════════════════

  {
    "template_id": "servicenow-v2",
    "name": "ServiceNow ITSM Connector",
    "version": "2.3.1",
    "auth_types": ["oauth2", "basic_auth"],
    "base_url_pattern": "https://{instance}.service-now.com",
    "capabilities": [
      "read_tickets",
      "create_ticket",
      "update_ticket",
      "list_groups",
      "get_user",
      "search_kb_articles"
    ],
    "api_version": "v2",
    "rate_limit_default": 500,  // requests/minute
    "required_fields": ["instance_name"],
    "optional_fields": ["custom_table_prefix"]
  }

Instance (Tenant Configuration)

  CONNECTOR INSTANCE: Acme Corp's ServiceNow
  ═══════════════════════════════════════════

  {
    "instance_id": "inst-acme-snow-001",
    "tenant_id": "acme-corp",
    "template_id": "servicenow-v2",
    "config": {
      "instance_name": "acmecorp",
      "base_url": "https://acmecorp.service-now.com"
    },
    "credential_ref": "vault://acme-corp/servicenow/oauth",
    "field_mappings": {
      "short_description": "title",
      "assignment_group": "team",
      "u_custom_field_1": "business_unit",
      "u_location_code": "office_location"
    },
    "rate_limit_override": 300,  // Acme's ServiceNow plan limit
    "status": "active",
    "health_check_interval": 60  // seconds
  }

Registry

Template Registry: PostgreSQL table of all connector templates. Versioned — updates create new versions, instances can pin to specific versions.
Instance Registry: PostgreSQL table of all active instances. Includes health status, last successful call, error rate.
Discovery: AI agent queries registry: "What connectors does Acme have?" → Returns list of active instances with capabilities.

7 Deep Dive 2: Middleware Chain

  REQUEST FLOW THROUGH MIDDLEWARE CHAIN
  ═══════════════════════════════════════════════════════

  Incoming Request
       │
  ┌────v────┐  Inject credentials from Vault. Handle OAuth
  │  AUTH   │  token refresh automatically. mTLS cert loading.
  └────┬────┘
       │
  ┌────v────────┐  Check sliding window. Per (tenant, connector).
  │ RATE LIMIT  │  429 if exceeded. Queue if near limit.
  └────┬────────┘
       │
  ┌────v──────────┐  Map internal schema → external API schema.
  │  TRANSFORM    │  Apply customer's field_mappings. Type coercion.
  └────┬──────────┘
       │
  ┌────v────────┐  HTTP call to external system. Connection pooling.
  │  EXECUTE    │  Timeout: 30s. Circuit breaker per instance.
  └────┬────────┘
       │
  ┌────v────┐  Exponential backoff: 1s, 2s, 4s. Max 3 retries.
  │  RETRY  │  Only on 429, 503, 504. NOT on 400, 401, 404.
  └────┬────┘
       │
  ┌────v────┐  Full audit trail. Request/response (sanitized).
  │   LOG   │  Latency, status code, tenant, connector, operation.
  └────┬────┘
       │
       v
  Response to AI Agent

Middleware Details

Auth Middleware: Reads credential_ref from instance config. Fetches from Vault. For OAuth: checks token expiry, refreshes if <5 min remaining. Injects Authorization header. For API Key: injects header or query param per template spec.
Rate Limit Middleware: Sliding window algorithm (see Deep Dive 3). Per (tenant_id, connector_instance_id). Respects both our internal limits AND the external system's API limits. Returns 429 with Retry-After header.
Transform Middleware: Applies field_mappings from instance config. Maps internal canonical schema to external API format. Example: internal "title" → ServiceNow "short_description". Handles type coercion (string → int, date formatting).
Execute Middleware: HTTP client with connection pooling (keep-alive). Per-instance circuit breaker: opens after 5 consecutive failures, half-open after 30s. Configurable timeout (default 30s).
Retry Middleware: Exponential backoff: 1s, 2s, 4s. Max 3 retries. Only retries on transient errors (429, 503, 504). Never retries client errors (400, 401, 404) — those indicate a real problem.
Log Middleware: Structured JSON logs. Every request/response recorded with: tenant_id, instance_id, operation, latency_ms, status_code, request_id. Credentials redacted. Audit compliance.

8 Deep Dive 3: Sliding Window Rate Limiter

Redis ZSET Algorithm

  SLIDING WINDOW RATE LIMITER (Redis ZSET)
  ═══════════════════════════════════════════════════════

  Key: rate_limit:{tenant_id}:{connector_id}
  Score: timestamp (Unix ms)
  Member: unique request ID

  ALGORITHM (per request):
  ─────────────────────────────────────────────────────
  1. ZADD key {now_ms} {request_id}        // Add this request
  2. ZREMRANGEBYSCORE key 0 {now_ms - 60000} // Remove requests older than 60s
  3. count = ZCARD key                       // Count requests in window
  4. IF count > limit: REJECT (429)          // Over limit
     ELSE: ALLOW                             // Under limit
  5. EXPIRE key 120                          // TTL cleanup safety net

  EXAMPLE (limit = 5 requests/minute):
  ─────────────────────────────────────────────────────
  Time    Action          ZSET Size    Result
  00:00   Request A       1            ALLOWED
  00:15   Request B       2            ALLOWED
  00:30   Request C       3            ALLOWED
  00:45   Request D       4            ALLOWED
  00:50   Request E       5            ALLOWED
  00:55   Request F       6            REJECTED (429)
  01:05   Request G       5            ALLOWED (A expired at 01:00)
  01:20   Request H       5            ALLOWED (B expired at 01:15)

Why Sliding Window over Fixed Window?

Fixed window problem: Limit is 100/min. User sends 100 requests at 0:59, then 100 more at 1:01. That's 200 requests in 2 seconds — the external API sees a burst and throttles us.

Sliding window: Always counts the last 60 seconds exactly. No burst at window boundaries. External APIs stay happy.

9 Deep Dive 4: Credential Management

Vault-Based Architecture

HashiCorp Vault: All credentials stored in Vault, never in PostgreSQL or config files. Each tenant gets their own Vault namespace for isolation.
Auto-rotation: OAuth tokens auto-refreshed before expiry. API keys rotated on configurable schedule (e.g., every 90 days). Rotation triggers health check to verify new credentials work.
Per-tenant isolation: Vault policies ensure Tenant A can never access Tenant B's credentials. Service accounts scoped to specific tenants.
Zero-trust: Runtime engine never caches credentials in memory beyond single request. Credentials fetched from Vault per-request (with short Vault token TTL).

OAuth 2.0 Flow Detail

  OAUTH TOKEN LIFECYCLE
  ═══════════════════════════════════════════

  1. Customer configures connector in UI
     → Redirects to external system's OAuth consent screen
     → Receives authorization code

  2. Backend exchanges code for access_token + refresh_token
     → Stores both in Vault: vault://acme/servicenow/oauth
     → Sets token_expiry metadata

  3. At runtime (each API call):
     → Auth middleware reads token from Vault
     → If expires_at < now + 5min:
        → Use refresh_token to get new access_token
        → Store new token in Vault
        → Use new token for request
     → Inject Authorization: Bearer {token}

  4. If refresh fails (token revoked):
     → Mark instance as "auth_failed"
     → Notify customer: "Please re-authenticate ServiceNow"
     → Stop processing requests (don't leak errors to users)

10 Cheat Sheet

Plugin/Connector Platform — Key Numbers

3,500 connector instances (350 customers x 10 avg)
350K API calls/day, 1K credential rotations/day
Template vs Instance = Docker Image vs Container
Middleware chain: Auth → RateLimit → Transform → Execute → Retry → Log
Sliding window rate limiter: Redis ZSET, per (tenant, connector)
Credentials in Vault only, never in DB or config
OAuth auto-refresh when <5 min remaining
Circuit breaker: opens after 5 consecutive failures, half-open at 30s
Retry: exponential backoff 1s/2s/4s, only on 429/503/504
Declarative field mappings in JSON (no code needed per customer)

← Notification & Approval LLM Serving →