Sandboxed Code Execution Environment

"Design a system where an AI agent generates and executes code (Python, SQL, shell scripts). Must be fully sandboxed, time-limited, resource-constrained, and auditable."

Table of Contents

  1. Clarifying Questions & Scope
  2. Back-of-Envelope Estimation
  3. High-Level Architecture
  4. Deep Dive 1: Code Validator
  5. Deep Dive 2: Sandbox Pool
  6. Deep Dive 3: Data Access Layer
  7. Scaling & ML
  8. Cheat Sheet

1 Clarifying Questions & Scope

Dimension Clarification Assumption
Languages Which languages to support? Python, SQL, shell scripts (Bash). Extensible to JS/R.
Use Cases What kinds of code? Data queries, report generation, automation scripts, data transforms
Data Access What data can code access? Tenant's own data only, via controlled data access proxy
Execution Limits Time and resource limits? 30s max execution, 1 CPU, 512MB RAM, 100MB disk
Audit What needs to be logged? Every execution: code, input, output, duration, user, status

2 Back-of-Envelope Estimation

Scale Numbers

  • 50K code executions/day across all tenants
  • Peak: 200 concurrent executions
  • Sandbox spin-up: <2 seconds (warm pool)
  • Average execution time: 5 seconds
  • Sandbox pool size: 300 warm instances (200 peak + 50% buffer)

3 High-Level Architecture

  SANDBOXED CODE EXECUTION PIPELINE
  ═══════════════════════════════════════════════════════════════════

  AI Agent generates code
       │
  ┌────v──────────┐     ┌──────────────┐     ┌──────────────────┐
  │  CODE         │────>│  SANDBOX     │────>│   EXECUTE        │
  │  VALIDATOR    │     │  POOL        │     │                  │
  │               │     │              │     │  Resource limits: │
  │ • Static      │     │ • gVisor /   │     │  • 1 CPU          │
  │   analysis    │     │   Firecracker│     │  • 512MB RAM      │
  │ • Whitelist   │     │ • Fresh per  │     │  • 100MB disk     │
  │   libs        │     │   execution  │     │  • 30s timeout    │
  │ • SQL inject  │     │ • Warm pool  │     │  • SIGKILL on     │
  │   prevention  │     │   (300)      │     │    timeout        │
  │ • LLM safety  │     │ • tmpfs      │     │                  │
  │   review      │     │   filesystem │     │                  │
  └───────────────┘     └──────────────┘     └────────┬─────────┘
                                                      │
       ┌──────────────────────────────────────────────┘
       │
  ┌────v──────────┐     ┌──────────────┐
  │  OUTPUT       │────>│  AUDIT LOG   │
  │  SANITIZER    │     │              │
  │               │     │ • Code       │
  │ • Truncate    │     │ • Input      │
  │   large output│     │ • Output     │
  │ • Redact PII  │     │ • Duration   │
  │ • Format      │     │ • User       │
  │   results     │     │ • Status     │
  └───────────────┘     └──────────────┘

4 Deep Dive 1: Code Validator

Multi-Layer Validation

Before any code touches a sandbox, it passes through 4 layers of validation:

1 Static Analysis (AST Parsing)

2 Library Whitelist

3 SQL Injection Prevention

4 LLM Safety Review (for complex cases)

Defense in Depth: Validation is the FIRST line of defense, not the ONLY one. Even if validation misses something, the sandbox itself prevents real damage (no network, no persistent filesystem, resource limits, SIGKILL timeout).

5 Deep Dive 2: Sandbox Pool

Sandbox Technology

Technology Isolation Level Spin-Up Time Best For
gVisor (runsc) Kernel-level syscall filtering <500ms (warm) Most use cases. Good balance of security + speed.
Firecracker Full microVM isolation <125ms (warm) Highest security needs. AWS Lambda uses this.
Docker + seccomp Container-level <1s Development/testing. Not recommended for production.

Warm Pool Architecture

  SANDBOX POOL MANAGEMENT
  ═══════════════════════════════════════════════════════

  WARM POOL (300 pre-created sandboxes)
  ┌─────────────────────────────────────────────────┐
  │  [sandbox-001] IDLE  │ Python 3.11 + libs loaded │
  │  [sandbox-002] IDLE  │ Python 3.11 + libs loaded │
  │  [sandbox-003] IN USE│ Running user code...      │
  │  [sandbox-004] IDLE  │ Python 3.11 + libs loaded │
  │  ...                                              │
  │  [sandbox-300] IDLE  │ Python 3.11 + libs loaded │
  └─────────────────────────────────────────────────┘

  LIFECYCLE:
  ─────────────────────────────────────────────────────
  1. IDLE → Checkout (assign to execution request)
  2. IN USE → Code runs inside sandbox
  3. COMPLETE → Sandbox DESTROYED (never reused)
  4. REPLENISH → New sandbox created to maintain pool size

  WHY DESTROY?
  A previous execution might have left state (variables,
  temp files, modified env). Fresh sandbox = zero leakage.

Resource Limits (per sandbox)

6 Deep Dive 3: Data Access Layer

Data Access Proxy

Code inside the sandbox cannot directly access databases. Instead, it talks to a Data Access Proxy that enforces permissions:

  DATA ACCESS ARCHITECTURE
  ═══════════════════════════════════════════════════════

  ┌──────────┐     ┌──────────────────┐     ┌──────────────┐
  │ Sandbox  │────>│  Data Access     │────>│  Read-Only   │
  │ (Code)   │     │  Proxy           │     │  Replica     │
  │          │     │                  │     │  (Database)  │
  │ import   │     │ • Validates SQL  │     │              │
  │ db_client│     │ • Checks perms   │     │ SELECT only  │
  │          │     │ • Enforces row   │     │ No writes    │
  │ result = │     │   limits (10K)   │     │              │
  │ db.query(│     │ • No raw creds   │     │              │
  │  "SELECT │     │ • Query timeout  │     │              │
  │   ...")  │     │   (10s)          │     │              │
  └──────────┘     └──────────────────┘     └──────────────┘

Proxy Enforcement Rules

Full Audit Log

Field Example
execution_id exec-2026031510-abc123
tenant_id acme-corp
user_id jane@acme.com
language python
code (sanitized) import pandas as pd; df = db.query("SELECT...")...
duration_ms 3,450
status success
output_size_bytes 12,480
data_accessed ["tickets", "users"] (tables queried)
rows_returned 247

7 Scaling & ML

Scaling Strategies

ML Enhancements

8 Cheat Sheet

Sandboxed Code Execution — Key Numbers

  • 50K executions/day, 200 peak concurrent
  • <2s spin-up with warm pool (300 instances)
  • 4-layer validation: static analysis, whitelist, SQL prevention, LLM review
  • gVisor/Firecracker for kernel-level isolation
  • Resource limits: 1 CPU, 512MB RAM, 100MB tmpfs, 30s timeout
  • SIGKILL on timeout — no graceful shutdown
  • Fresh sandbox per execution — never reused
  • No network access inside sandbox
  • Data Access Proxy: read-only replica, 10K row limit, table allowlist
  • Full audit log: code, input, output, duration, user, status, tables accessed