Conversation Memory & Context System

"Design a memory system with three layers: short-term (current conversation), working (intermediate computation results), and long-term (user preferences, learned patterns across sessions)."

Table of Contents

  1. Requirements
  2. Back-of-Envelope Estimation
  3. High-Level Architecture
  4. Deep Dive 1: Short-Term Memory
  5. Deep Dive 2: Working Memory
  6. Deep Dive 3: Long-Term Memory
  7. Scaling & ML
  8. Cheat Sheet

1 Requirements

Functional Requirements

Non-Functional Requirements

2 Back-of-Envelope Estimation

Scale Numbers

  • 15M messages/day across all conversations
  • Peak: 520 messages/second
  • Redis (short-term + working): 24.8 GB
  • PostgreSQL (long-term structured): 0.93 TB
  • Vector storage (long-term semantic): 2.85 TB
Memory Layer Storage TTL Access Pattern
Short-term Redis 24 hours Sequential reads, append-only during conversation
Working Redis 1 hour Random reads/writes during task execution
Long-term PG + pgvector Configurable (90d-forever) Semantic search, structured queries

3 High-Level Architecture

  3-LAYER MEMORY ARCHITECTURE
  ═══════════════════════════════════════════════════════════════════

  ┌─────────────────────────────────────────────────────────────┐
  │                   CONTEXT WINDOW MANAGER                    │
  │  Assembles the optimal context for each LLM call:          │
  │  1. System prompt (fixed)                                   │
  │  2. Relevant long-term memories (semantic search)           │
  │  3. Working memory (current task state)                     │
  │  4. Recent conversation (last N turns)                      │
  │  5. Current user message                                    │
  └───────┬──────────────────┬──────────────────┬──────────────┘
          │                  │                  │
  ┌───────v──────┐  ┌───────v──────┐  ┌────────v─────────┐
  │ SHORT-TERM   │  │   WORKING    │  │   LONG-TERM      │
  │              │  │              │  │                   │
  │ Redis        │  │ Redis        │  │ PostgreSQL        │
  │ TTL: 24hr    │  │ TTL: 1hr     │  │ + pgvector        │
  │              │  │              │  │                   │
  │ Conversation │  │ Intermediate │  │ User preferences  │
  │ messages     │  │ results      │  │ Interaction history│
  │ (ordered)    │  │ (key-value)  │  │ Learned patterns  │
  │              │  │              │  │ Semantic search    │
  │ ~8 GB        │  │ ~16.8 GB     │  │ ~3.78 TB          │
  └──────────────┘  └──────────────┘  └───────────────────┘

Context Window Manager — 5 Steps

4 Deep Dive 1: Short-Term Memory

Redis Data Structure

  SHORT-TERM MEMORY (Redis)
  ═══════════════════════════════════════════

  Key: conv:{conversation_id}
  Type: Redis List (ordered messages)
  TTL: 24 hours

  LPUSH conv:abc123 {
    "role": "user",
    "content": "I need VPN access",
    "timestamp": "2026-03-15T10:00:00Z",
    "tokens": 8
  }

  LPUSH conv:abc123 {
    "role": "assistant",
    "content": "I'll help you with VPN access. Which office?",
    "timestamp": "2026-03-15T10:00:02Z",
    "tokens": 14,
    "tool_calls": []
  }

  LRANGE conv:abc123 0 -1  → Full conversation history
  LRANGE conv:abc123 0 9   → Last 10 messages

Context Window Summarization

When a conversation exceeds the context budget, older messages are summarized:

5 Deep Dive 2: Working Memory

Purpose: Intermediate Results During Multi-Step Tasks

When an agent executes a multi-step workflow, it needs to remember intermediate results. Working memory stores these temporarily.

Example: "Who is my manager and do they have an open approval for me?"

Step 1: Look up user in HR system → Working memory stores: { "manager": "Sarah Chen", "manager_id": "SC-4521" }
Step 2: Query approval system with manager_id → Working memory stores: { "pending_approvals": [{ "id": "APR-892", "type": "VPN Access", "status": "pending" }] }
Step 3: Compose answer using both results from working memory: "Your manager is Sarah Chen. She has one pending approval for you: VPN Access request (APR-892)."

Redis Data Structure

  WORKING MEMORY (Redis Hash)
  ═══════════════════════════════════════════

  Key: work:{conversation_id}:{task_id}
  Type: Redis Hash
  TTL: 1 hour

  HSET work:abc123:task-001 "step_1_result" '{
    "manager": "Sarah Chen",
    "manager_id": "SC-4521",
    "source": "workday_api",
    "retrieved_at": "2026-03-15T10:00:01Z"
  }'

  HSET work:abc123:task-001 "step_2_result" '{
    "pending_approvals": [...],
    "source": "approval_system",
    "retrieved_at": "2026-03-15T10:00:03Z"
  }'

  HSET work:abc123:task-001 "plan" '{
    "steps": ["lookup_manager", "check_approvals", "compose_answer"],
    "current_step": 2,
    "status": "in_progress"
  }'

Re-Planning Support

6 Deep Dive 3: Long-Term Memory

Dual Storage: Structured + Semantic

What Gets Stored Long-Term

Example: Pattern Detection

Observation: User asked about VPN 3 times in the last month (March 1, March 8, March 14).
Stored as: "User has recurring VPN connectivity issues. Previous solutions: certificate renewal (March 1), DNS cache flush (March 8)."
Used when: Next time user mentions VPN, agent proactively says: "I see you've had VPN issues before. Last time, flushing the DNS cache resolved it. Would you like to try that first?"

IMPORTANT: Store summaries, not raw conversations. Raw messages contain too much noise and PII. Summarization extracts the useful signal: preferences, patterns, resolved issues, key decisions. This also dramatically reduces storage costs.

Privacy & Compliance

7 Scaling & ML

Scaling Strategies

ML Enhancements

8 Cheat Sheet

Conversation Memory — Key Numbers

  • 3 layers: Short-term (Redis 24hr), Working (Redis 1hr), Long-term (PG + pgvector)
  • 15M messages/day, 520 peak msg/sec
  • Redis: 24.8 GB, PG: 0.93 TB, Vectors: 2.85 TB
  • Context Window Manager: system → long-term → working → conversation → current
  • Summarize older conversation turns when exceeding 80% budget
  • Working memory: intermediate results for multi-step tasks
  • Long-term: store summaries, NOT raw conversations
  • Semantic search with memory decay: score * exp(-lambda * days)
  • Privacy: per-user encryption, right-to-delete, configurable retention
  • "You asked about VPN 3 times" — proactive pattern surfacing