Sandboxed Code Execution Environment

1 Clarifying Questions & Scope

Dimension	Clarification	Assumption
Languages	Which languages to support?	Python, SQL, shell scripts (Bash). Extensible to JS/R.
Use Cases	What kinds of code?	Data queries, report generation, automation scripts, data transforms
Data Access	What data can code access?	Tenant's own data only, via controlled data access proxy
Execution Limits	Time and resource limits?	30s max execution, 1 CPU, 512MB RAM, 100MB disk
Audit	What needs to be logged?	Every execution: code, input, output, duration, user, status

2 Back-of-Envelope Estimation

        Scale Numbers
        50K code executions/day across all tenants
Peak: 200 concurrent executions
Sandbox spin-up: <2 seconds (warm pool)
Average execution time: 5 seconds
Sandbox pool size: 300 warm instances (200 peak + 50% buffer)

      

3 High-Level Architecture

  SANDBOXED CODE EXECUTION PIPELINE
  ═══════════════════════════════════════════════════════════════════

  AI Agent generates code
       │
  ┌────v──────────┐     ┌──────────────┐     ┌──────────────────┐
  │  CODE         │────>│  SANDBOX     │────>│   EXECUTE        │
  │  VALIDATOR    │     │  POOL        │     │                  │
  │               │     │              │     │  Resource limits: │
  │ • Static      │     │ • gVisor /   │     │  • 1 CPU          │
  │   analysis    │     │   Firecracker│     │  • 512MB RAM      │
  │ • Whitelist   │     │ • Fresh per  │     │  • 100MB disk     │
  │   libs        │     │   execution  │     │  • 30s timeout    │
  │ • SQL inject  │     │ • Warm pool  │     │  • SIGKILL on     │
  │   prevention  │     │   (300)      │     │    timeout        │
  │ • LLM safety  │     │ • tmpfs      │     │                  │
  │   review      │     │   filesystem │     │                  │
  └───────────────┘     └──────────────┘     └────────┬─────────┘
                                                      │
       ┌──────────────────────────────────────────────┘
       │
  ┌────v──────────┐     ┌──────────────┐
  │  OUTPUT       │────>│  AUDIT LOG   │
  │  SANITIZER    │     │              │
  │               │     │ • Code       │
  │ • Truncate    │     │ • Input      │
  │   large output│     │ • Output     │
  │ • Redact PII  │     │ • Duration   │
  │ • Format      │     │ • User       │
  │   results     │     │ • Status     │
  └───────────────┘     └──────────────┘

4 Deep Dive 1: Code Validator

Multi-Layer Validation

Before any code touches a sandbox, it passes through 4 layers of validation:

1 Static Analysis (AST Parsing)

Python: Parse with `ast` module. Walk the AST to detect forbidden patterns.
Blocked operations: `os.system()`, `subprocess.*`, `eval()`, `exec()`, `__import__()`, file system writes outside tmpfs, network socket creation, process spawning.
Blocked modules: `os`, `subprocess`, `socket`, `ctypes`, `pickle` (deserialization attacks), `importlib`.

2 Library Whitelist

Allowed libraries: pandas, numpy, datetime, json, csv, math, statistics, re, collections, itertools.
NOT allowed: requests (no HTTP), boto3 (no AWS), paramiko (no SSH), any library that enables network or file system access.
Custom libraries: Tenant-configurable whitelist for their specific needs (e.g., allow `openpyxl` for Excel processing).

3 SQL Injection Prevention

Parameterized queries only: All SQL must use parameterized queries. Raw string concatenation with user input is rejected.
Statement whitelist: Only SELECT, WITH (CTEs) allowed. No INSERT, UPDATE, DELETE, DROP, ALTER, CREATE, TRUNCATE.
Table whitelist: SQL can only reference tables the tenant has been granted access to.

4 LLM Safety Review (for complex cases)

When triggered: If static analysis can't determine safety (e.g., dynamic attribute access, metaprogramming patterns).
Fast LLM check: Send code to a fast model with prompt: "Does this code attempt to access the filesystem, network, or execute arbitrary commands? Respond YES or NO with explanation."
Latency: ~200ms. Only triggered for ~5% of code submissions.

Defense in Depth: Validation is the FIRST line of defense, not the ONLY one. Even if validation misses something, the sandbox itself prevents real damage (no network, no persistent filesystem, resource limits, SIGKILL timeout).

5 Deep Dive 2: Sandbox Pool

Sandbox Technology

Technology	Isolation Level	Spin-Up Time	Best For
gVisor (runsc)	Kernel-level syscall filtering	<500ms (warm)	Most use cases. Good balance of security + speed.
Firecracker	Full microVM isolation	<125ms (warm)	Highest security needs. AWS Lambda uses this.
Docker + seccomp	Container-level	<1s	Development/testing. Not recommended for production.

Warm Pool Architecture

  SANDBOX POOL MANAGEMENT
  ═══════════════════════════════════════════════════════

  WARM POOL (300 pre-created sandboxes)
  ┌─────────────────────────────────────────────────┐
  │  [sandbox-001] IDLE  │ Python 3.11 + libs loaded │
  │  [sandbox-002] IDLE  │ Python 3.11 + libs loaded │
  │  [sandbox-003] IN USE│ Running user code...      │
  │  [sandbox-004] IDLE  │ Python 3.11 + libs loaded │
  │  ...                                              │
  │  [sandbox-300] IDLE  │ Python 3.11 + libs loaded │
  └─────────────────────────────────────────────────┘

  LIFECYCLE:
  ─────────────────────────────────────────────────────
  1. IDLE → Checkout (assign to execution request)
  2. IN USE → Code runs inside sandbox
  3. COMPLETE → Sandbox DESTROYED (never reused)
  4. REPLENISH → New sandbox created to maintain pool size

  WHY DESTROY?
  A previous execution might have left state (variables,
  temp files, modified env). Fresh sandbox = zero leakage.

Resource Limits (per sandbox)

CPU: 1 vCPU. Cgroup CPU quota prevents monopolizing host.
Memory: 512 MB hard limit. OOM killer triggers if exceeded — execution fails cleanly.
Disk: 100 MB tmpfs (in-memory filesystem). No persistent disk. Destroyed with sandbox.
Time: 30 seconds hard timeout. SIGKILL sent at timeout — no graceful shutdown, immediate termination.
Network: NONE. Network namespace with no interfaces. Code cannot make HTTP calls, DNS lookups, or any network communication.
Processes: Max 10 PIDs. Prevents fork bombs.

6 Deep Dive 3: Data Access Layer

Data Access Proxy

Code inside the sandbox cannot directly access databases. Instead, it talks to a Data Access Proxy that enforces permissions:

  DATA ACCESS ARCHITECTURE
  ═══════════════════════════════════════════════════════

  ┌──────────┐     ┌──────────────────┐     ┌──────────────┐
  │ Sandbox  │────>│  Data Access     │────>│  Read-Only   │
  │ (Code)   │     │  Proxy           │     │  Replica     │
  │          │     │                  │     │  (Database)  │
  │ import   │     │ • Validates SQL  │     │              │
  │ db_client│     │ • Checks perms   │     │ SELECT only  │
  │          │     │ • Enforces row   │     │ No writes    │
  │ result = │     │   limits (10K)   │     │              │
  │ db.query(│     │ • No raw creds   │     │              │
  │  "SELECT │     │ • Query timeout  │     │              │
  │   ...")  │     │   (10s)          │     │              │
  └──────────┘     └──────────────────┘     └──────────────┘

Proxy Enforcement Rules

Read-only replica: Proxy connects to read-only database replica. Even if SQL injection slips through, it can't modify data.
Row limits: Max 10,000 rows per query. Prevents data exfiltration via large SELECT * queries.
Table allowlist: Code can only query tables explicitly granted to this tenant. No system tables, no cross-tenant data.
No raw credentials: Code uses `db_client.query()` — the proxy injects database credentials. Code never sees connection strings, passwords, or tokens.
Query timeout: 10-second timeout on queries. Long-running queries killed by the proxy before they impact database performance.
Result size limit: Max 5 MB response payload. Prevents memory exhaustion in sandbox from huge result sets.

Full Audit Log

Field	Example
execution_id	exec-2026031510-abc123
tenant_id	acme-corp
user_id	jane@acme.com
language	python
code (sanitized)	import pandas as pd; df = db.query("SELECT...")...
duration_ms	3,450
status	success
output_size_bytes	12,480
data_accessed	["tickets", "users"] (tables queried)
rows_returned	247

7 Scaling & ML

Scaling Strategies

Auto-scaling pool: Monitor pool utilization. Scale up when >70% in use. Scale down when <30% in use. Min 100, max 500 sandboxes.
Multi-region: Deploy sandbox pools in each region (US, EU, APAC). Route executions to nearest region for lowest latency.
Queue overflow: When pool is exhausted, queue requests with timeout. Users see "Execution queued, estimated wait: 5 seconds".

ML Enhancements

Code quality prediction: Before execution, predict if code will succeed or fail. Catch obvious bugs (undefined variables, type mismatches) before wasting a sandbox.
Resource estimation: Predict CPU and memory needs from code analysis. Allocate right-sized sandbox (some code needs 256MB, some needs 1GB).
Anomaly detection: Detect unusual patterns — user suddenly running 100x more executions, code attempting new patterns. Alert security team.
Code optimization: Suggest more efficient code. "Your query scans full table; add WHERE clause to reduce rows by 90%."

8 Cheat Sheet

Sandboxed Code Execution — Key Numbers

50K executions/day, 200 peak concurrent
<2s spin-up with warm pool (300 instances)
4-layer validation: static analysis, whitelist, SQL prevention, LLM review
gVisor/Firecracker for kernel-level isolation
Resource limits: 1 CPU, 512MB RAM, 100MB tmpfs, 30s timeout
SIGKILL on timeout — no graceful shutdown
Fresh sandbox per execution — never reused
No network access inside sandbox
Data Access Proxy: read-only replica, 10K row limit, table allowlist
Full audit log: code, input, output, duration, user, status, tables accessed

Table of Contents