Conversation Memory & Context System

IN JD: AGENT MEMORY Senior Level 35 min

"Design a memory system with three layers: short-term (current conversation), working (intermediate computation results), and long-term (user preferences, learned patterns across sessions)."

Requirements
Back-of-Envelope Estimation
High-Level Architecture
Deep Dive 1: Short-Term Memory
Deep Dive 2: Working Memory
Deep Dive 3: Long-Term Memory
Scaling & ML
Cheat Sheet

1 Requirements

Functional Requirements

Conversation history: Maintain full conversation context within a session for coherent multi-turn dialogue
Working memory: Store intermediate computation results (API call outputs, tool results) during multi-step tasks
Long-term preferences: Learn and recall user preferences, past interactions, and patterns across sessions
Context window management: Intelligently select what to include when context window is limited
Semantic search over memory: Find relevant past interactions by meaning, not just keyword match

Non-Functional Requirements

Latency: <50ms for short-term reads, <200ms for long-term semantic search
Durability: Short-term can be lost (Redis). Long-term must be durable (PostgreSQL).
Privacy: Per-user encryption. Right-to-delete. Configurable retention policies.
Scale: Support 500K concurrent conversations, 50M user memory profiles
Consistency: Within a conversation: strong consistency. Across sessions: eventual is OK.

2 Back-of-Envelope Estimation

        Scale Numbers
        15M messages/day across all conversations
Peak: 520 messages/second
Redis (short-term + working): 24.8 GB
PostgreSQL (long-term structured): 0.93 TB
Vector storage (long-term semantic): 2.85 TB

      

Memory Layer	Storage	TTL	Access Pattern
Short-term	Redis	24 hours	Sequential reads, append-only during conversation
Working	Redis	1 hour	Random reads/writes during task execution
Long-term	PG + pgvector	Configurable (90d-forever)	Semantic search, structured queries

3 High-Level Architecture

  3-LAYER MEMORY ARCHITECTURE
  ═══════════════════════════════════════════════════════════════════

  ┌─────────────────────────────────────────────────────────────┐
  │                   CONTEXT WINDOW MANAGER                    │
  │  Assembles the optimal context for each LLM call:          │
  │  1. System prompt (fixed)                                   │
  │  2. Relevant long-term memories (semantic search)           │
  │  3. Working memory (current task state)                     │
  │  4. Recent conversation (last N turns)                      │
  │  5. Current user message                                    │
  └───────┬──────────────────┬──────────────────┬──────────────┘
          │                  │                  │
  ┌───────v──────┐  ┌───────v──────┐  ┌────────v─────────┐
  │ SHORT-TERM   │  │   WORKING    │  │   LONG-TERM      │
  │              │  │              │  │                   │
  │ Redis        │  │ Redis        │  │ PostgreSQL        │
  │ TTL: 24hr    │  │ TTL: 1hr     │  │ + pgvector        │
  │              │  │              │  │                   │
  │ Conversation │  │ Intermediate │  │ User preferences  │
  │ messages     │  │ results      │  │ Interaction history│
  │ (ordered)    │  │ (key-value)  │  │ Learned patterns  │
  │              │  │              │  │ Semantic search    │
  │ ~8 GB        │  │ ~16.8 GB     │  │ ~3.78 TB          │
  └──────────────┘  └──────────────┘  └───────────────────┘

Context Window Manager — 5 Steps

Step 1 — Budget allocation: Given 128K token context window, allocate: 2K system prompt, 4K long-term memories, 2K working memory, remaining for conversation history + current message.
Step 2 — Long-term retrieval: Semantic search over user's memory using current query as search key. Return top 5 most relevant past interactions/preferences.
Step 3 — Working memory injection: Include current task state, intermediate results, pending actions. Critical for multi-step workflows.
Step 4 — Conversation truncation: If conversation exceeds budget, summarize older turns (keep last 10 verbatim, summarize earlier ones).
Step 5 — Assembly: Combine all layers into a single prompt. Order matters: system → long-term → working → conversation → current message.

4 Deep Dive 1: Short-Term Memory

Redis Data Structure

  SHORT-TERM MEMORY (Redis)
  ═══════════════════════════════════════════

  Key: conv:{conversation_id}
  Type: Redis List (ordered messages)
  TTL: 24 hours

  LPUSH conv:abc123 {
    "role": "user",
    "content": "I need VPN access",
    "timestamp": "2026-03-15T10:00:00Z",
    "tokens": 8
  }

  LPUSH conv:abc123 {
    "role": "assistant",
    "content": "I'll help you with VPN access. Which office?",
    "timestamp": "2026-03-15T10:00:02Z",
    "tokens": 14,
    "tool_calls": []
  }

  LRANGE conv:abc123 0 -1  → Full conversation history
  LRANGE conv:abc123 0 9   → Last 10 messages

Context Window Summarization

When a conversation exceeds the context budget, older messages are summarized:

Threshold: When conversation exceeds 80% of allocated budget, trigger summarization
Process: Take messages 11-50, send to fast LLM with prompt "Summarize this conversation so far in 3-5 bullet points. Preserve key decisions, names, and numbers."
Result: Replace messages 11-50 with a single [SUMMARY] message. Keep messages 1-10 (most recent) verbatim.
Progressive: As conversation grows, re-summarize periodically. Summary of summaries keeps total under budget.

5 Deep Dive 2: Working Memory

Purpose: Intermediate Results During Multi-Step Tasks

When an agent executes a multi-step workflow, it needs to remember intermediate results. Working memory stores these temporarily.

Example: "Who is my manager and do they have an open approval for me?"

Step 1: Look up user in HR system → Working memory stores: { "manager": "Sarah Chen", "manager_id": "SC-4521" }
Step 2: Query approval system with manager_id → Working memory stores: { "pending_approvals": [{ "id": "APR-892", "type": "VPN Access", "status": "pending" }] }
Step 3: Compose answer using both results from working memory: "Your manager is Sarah Chen. She has one pending approval for you: VPN Access request (APR-892)."

Redis Data Structure

  WORKING MEMORY (Redis Hash)
  ═══════════════════════════════════════════

  Key: work:{conversation_id}:{task_id}
  Type: Redis Hash
  TTL: 1 hour

  HSET work:abc123:task-001 "step_1_result" '{
    "manager": "Sarah Chen",
    "manager_id": "SC-4521",
    "source": "workday_api",
    "retrieved_at": "2026-03-15T10:00:01Z"
  }'

  HSET work:abc123:task-001 "step_2_result" '{
    "pending_approvals": [...],
    "source": "approval_system",
    "retrieved_at": "2026-03-15T10:00:03Z"
  }'

  HSET work:abc123:task-001 "plan" '{
    "steps": ["lookup_manager", "check_approvals", "compose_answer"],
    "current_step": 2,
    "status": "in_progress"
  }'

Re-Planning Support

Plan storage: The agent's execution plan is stored in working memory. If a step fails, the agent can re-plan using the partial results already collected.
Rollback: If a multi-step task fails at step 3, the agent has steps 1-2 results in working memory. It can try an alternative step 3 without re-executing steps 1-2.
TTL: 1 hour. Working memory is ephemeral — if the task takes longer than 1 hour, it's likely abandoned or should be restarted.

6 Deep Dive 3: Long-Term Memory

Dual Storage: Structured + Semantic

PostgreSQL (structured): User preferences, settings, explicit facts. Queryable with SQL. Example: preferred_language = "en", timezone = "PST", notification_channel = "slack".
pgvector (semantic): Embedded summaries of past interactions. Searchable by meaning. Example: "User asked about VPN issues 3 times in the last month" can be found by querying "network connectivity problems".

What Gets Stored Long-Term

Example: Pattern Detection

Observation: User asked about VPN 3 times in the last month (March 1, March 8, March 14).
Stored as: "User has recurring VPN connectivity issues. Previous solutions: certificate renewal (March 1), DNS cache flush (March 8)."
Used when: Next time user mentions VPN, agent proactively says: "I see you've had VPN issues before. Last time, flushing the DNS cache resolved it. Would you like to try that first?"

IMPORTANT: Store summaries, not raw conversations. Raw messages contain too much noise and PII. Summarization extracts the useful signal: preferences, patterns, resolved issues, key decisions. This also dramatically reduces storage costs.

Privacy & Compliance

Encryption: All long-term memory encrypted at rest (AES-256) and in transit (TLS 1.3). Per-user encryption keys stored in Vault.
Retention policy: Configurable per tenant. Default: 90 days for interaction summaries, forever for explicit preferences. Auto-purge on expiry.
Right to delete: User requests deletion → immediately purge all memory (short-term, working, long-term) associated with their user_id. Cascading delete across all stores.
Data minimization: Only store what's useful. No raw message content in long-term. No sensitive data (passwords, SSNs) ever stored in memory.

7 Scaling & ML

Scaling Strategies

Redis cluster: Shard by conversation_id for short-term and working memory. Each shard handles ~100K concurrent conversations.
PostgreSQL partitioning: Partition long-term memory by tenant_id. Separate pgvector indexes per tenant for query isolation.
Summarization workers: Async workers process conversation summaries after session ends. Batch embedding generation for cost efficiency.

ML Enhancements

Embed → Search: Embed conversation summaries using ada-002. Semantic search finds relevant memories even when wording differs ("VPN problem" matches "network connectivity issue").
Summarization: Use fast LLM (GPT-3.5/Haiku) to extract key facts from conversations. Structured output: { preferences: [], patterns: [], resolved_issues: [] }.
Pattern detection: Identify recurring issues per user. "User has asked about printer 5 times" → surface proactively. Batch job runs nightly.
Memory decay: Older memories become less relevant. Apply decay factor to semantic search scores: score * exp(-lambda * days_since_memory). Lambda tuned per memory type (preferences decay slowly, task details decay fast).

8 Cheat Sheet

Conversation Memory — Key Numbers

3 layers: Short-term (Redis 24hr), Working (Redis 1hr), Long-term (PG + pgvector)
15M messages/day, 520 peak msg/sec
Redis: 24.8 GB, PG: 0.93 TB, Vectors: 2.85 TB
Context Window Manager: system → long-term → working → conversation → current
Summarize older conversation turns when exceeding 80% budget
Working memory: intermediate results for multi-step tasks
Long-term: store summaries, NOT raw conversations
Semantic search with memory decay: score * exp(-lambda * days)
Privacy: per-user encryption, right-to-delete, configurable retention
"You asked about VPN 3 times" — proactive pattern surfacing

← LLM Serving Sandboxed Code Execution →

Conversation Memory & Context System

Table of Contents

1 Requirements

Functional Requirements

Non-Functional Requirements

2 Back-of-Envelope Estimation

Scale Numbers

3 High-Level Architecture

Context Window Manager — 5 Steps

4 Deep Dive 1: Short-Term Memory

Redis Data Structure

Context Window Summarization

5 Deep Dive 2: Working Memory

Purpose: Intermediate Results During Multi-Step Tasks

Example: "Who is my manager and do they have an open approval for me?"

Redis Data Structure

Re-Planning Support

6 Deep Dive 3: Long-Term Memory

Dual Storage: Structured + Semantic

What Gets Stored Long-Term

Example: Pattern Detection

Privacy & Compliance

7 Scaling & ML

Scaling Strategies

ML Enhancements

8 Cheat Sheet

Conversation Memory — Key Numbers