๐Ÿš€ The Visual Guide to System Design

Master System Design
One Concept at a Time

Visual explanations of distributed systems, scalability patterns, and real-world architectures. Everything you need to ace your system design interview.

25+Core Concepts
10+Real Designs
50+Diagrams
๐Ÿ“ Foundation

System Design Fundamentals

Core concepts every engineer must understand before designing large-scale systems.

๐Ÿ“ˆ

Scalability

How systems handle growing amounts of work by adding resources.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         VERTICAL SCALING (Scale Up)      โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”          โ”‚
โ”‚  โ”‚ 4 CPUโ”‚ โ”€โ”€โ†’ โ”‚  32 CPU      โ”‚          โ”‚
โ”‚  โ”‚ 8 GB โ”‚     โ”‚  256 GB RAM  โ”‚          โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜          โ”‚
โ”‚  Simpler but has hardware limits        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚        HORIZONTAL SCALING (Scale Out)    โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚Serverโ”‚ โ”‚Serverโ”‚ โ”‚Serverโ”‚ โ”‚Serverโ”‚   โ”‚
โ”‚  โ”‚  1   โ”‚ โ”‚  2   โ”‚ โ”‚  3   โ”‚ โ”‚  N   โ”‚   โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜   โ”‚
โ”‚  Complex but virtually unlimited        โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Concepts

  • Vertical Scaling: Add more power (CPU, RAM) to existing machine. Simple but limited.
  • Horizontal Scaling: Add more machines. Complex (need load balancing, data sync) but unlimited.
  • Elasticity: Auto-scale based on demand (AWS Auto Scaling, K8s HPA).
  • Rule of thumb: Design for 10x current load, plan for 100x.
Core
โฑ๏ธ

Latency vs Throughput

Understanding time-per-request vs requests-per-second trade-offs.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚          LATENCY NUMBERS (2024)          โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  L1 cache ref .............. 1 ns       โ”‚
โ”‚  L2 cache ref .............. 4 ns       โ”‚
โ”‚  Main memory ref .......... 100 ns      โ”‚
โ”‚  SSD random read .......... 16 ฮผs       โ”‚
โ”‚  HDD random read .......... 2 ms        โ”‚
โ”‚  Same datacenter RTT ...... 0.5 ms      โ”‚
โ”‚  Cross-continent RTT ...... 150 ms      โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  THROUGHPUT = Requests / Second          โ”‚
โ”‚                                          โ”‚
โ”‚  Single server:    ~1K-10K RPS          โ”‚
โ”‚  With caching:     ~50K-100K RPS        โ”‚
โ”‚  CDN edge:         ~1M+ RPS            โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Concepts

  • Latency: Time to complete one operation. Measured in ms/ฮผs.
  • Throughput: Number of operations per unit time (QPS/RPS).
  • P99 Latency: 99th percentile โ€” worst 1% of requests. More important than average.
  • Trade-off: Batching increases throughput but adds latency.
Core
๐Ÿ”บ

CAP Theorem

In a distributed system, you can only guarantee two of three properties.

            Consistency (C)
                 โ•ฑโ•ฒ
                โ•ฑ  โ•ฒ
               โ•ฑ    โ•ฒ
         CP   โ•ฑ  Pick โ•ฒ   CA
             โ•ฑ   Two!   โ•ฒ
            โ•ฑ            โ•ฒ
           โ•ฑโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ•ฒ
   Availability (A) โ”€โ”€ Partition
                        Tolerance (P)
                   AP

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚    CP    โ”‚ MongoDB, HBase, Redis    โ”‚
โ”‚          โ”‚ Consistent but may be    โ”‚
โ”‚          โ”‚ unavailable during split โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚    AP    โ”‚ Cassandra, DynamoDB,     โ”‚
โ”‚          โ”‚ CouchDB โ€” Always         โ”‚
โ”‚          โ”‚ available, eventually    โ”‚
โ”‚          โ”‚ consistent               โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚    CA    โ”‚ Traditional RDBMS        โ”‚
โ”‚          โ”‚ (single node only)       โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Concepts

  • Consistency: Every read receives the most recent write.
  • Availability: Every request receives a response (no errors).
  • Partition Tolerance: System works despite network failures between nodes.
  • Reality: Network partitions DO happen, so you really choose between CP and AP.
Core
๐Ÿ”„

Consistency Patterns

Strong, eventual, and causal consistency โ€” when to use which.

STRONG CONSISTENCY (Linearizable):
  Client โ”€โ”€write(x=5)โ”€โ”€โ†’ DB โ”€โ”€โ†’ All replicas
  Client โ”€โ”€read(x)โ”€โ”€โ”€โ”€โ†’ DB โ”€โ”€โ†’ Always returns 5
  โœ… Banking, inventory  โš ๏ธ High latency

EVENTUAL CONSISTENCY:
  Client โ”€โ”€write(x=5)โ”€โ”€โ†’ Primary
  Client โ”€โ”€read(x)โ”€โ”€โ”€โ”€โ†’ Replica โ”€โ”€โ†’ May return old value
  Eventually all replicas converge
  โœ… Social feeds, DNS  โšก Low latency

CAUSAL CONSISTENCY:
  If A causes B, everyone sees A before B
  But concurrent events can be in any order
  โœ… Chat apps, collaborative editing

When to Use

  • Strong: Financial transactions, inventory counts, booking systems.
  • Eventual: Social media feeds, analytics, DNS updates.
  • Causal: Chat messages, document collaboration.
  • Read-your-writes: User always sees their own updates immediately.
Core
๐Ÿงฎ

Back-of-Envelope Estimation

Quick calculations to estimate system capacity and requirements.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         POWER OF TWO CHEAT SHEET        โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚  2^10 = 1 Thousand (1 KB)              โ”‚
โ”‚  2^20 = 1 Million   (1 MB)              โ”‚
โ”‚  2^30 = 1 Billion   (1 GB)              โ”‚
โ”‚  2^40 = 1 Trillion  (1 TB)              โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚         TIME CONVERSIONS                โ”‚
โ”‚  1 day  = 86,400 sec โ‰ˆ 100K sec        โ”‚
โ”‚  1 month = 2.6M sec  โ‰ˆ 2.5M sec        โ”‚
โ”‚  1 year  = 31.5M sec โ‰ˆ 30M sec         โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚         COMMON ESTIMATES                โ”‚
โ”‚  Daily active users โ†’ QPS:              โ”‚
โ”‚  QPS = DAU ร— queries/day / 86400       โ”‚
โ”‚  Peak QPS = QPS ร— 2-3                  โ”‚
โ”‚                                          โ”‚
โ”‚  Storage:                               โ”‚
โ”‚  = users ร— data_per_user ร— retention   โ”‚
โ”‚                                          โ”‚
โ”‚  Bandwidth:                             โ”‚
โ”‚  = QPS ร— avg_request_size              โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Estimation Framework

  • Step 1: Estimate DAU from total users (typically 10-30% DAU ratio).
  • Step 2: QPS = DAU ร— avg actions / 86,400 seconds.
  • Step 3: Storage = QPS ร— data per request ร— retention period.
  • Step 4: Peak QPS = Average QPS ร— 2-3x for burst traffic.
Interview
๐Ÿ›ก๏ธ

Availability & Reliability

Nines of availability and how to achieve high uptime.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Level   โ”‚  Uptime %  โ”‚  Downtime/yr  โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ 2 nines  โ”‚  99%       โ”‚  3.65 days    โ”‚
โ”‚ 3 nines  โ”‚  99.9%     โ”‚  8.77 hours   โ”‚
โ”‚ 4 nines  โ”‚  99.99%    โ”‚  52.6 min     โ”‚
โ”‚ 5 nines  โ”‚  99.999%   โ”‚  5.26 min     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

ACHIEVING HIGH AVAILABILITY:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚  Redundancy    โ†’ No single point of     โ”‚
โ”‚                  failure (SPOF)          โ”‚
โ”‚  Replication   โ†’ Data across regions    โ”‚
โ”‚  Failover      โ†’ Auto switch on failure โ”‚
โ”‚  Health Checks โ†’ Detect failures fast   โ”‚
โ”‚  Graceful      โ†’ Degrade, don't crash   โ”‚
โ”‚  Degradation                             โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Strategies

  • Active-Passive: Standby takes over on primary failure (simple, slight downtime).
  • Active-Active: All nodes serve traffic (no downtime, complex sync).
  • SLA: Service Level Agreement โ€” contractual uptime guarantee.
  • MTTR: Mean Time To Recovery โ€” how fast you can fix failures.
Core
๐Ÿงฑ Components

Building Blocks of System Design

The essential components used to build any large-scale distributed system.

โš–๏ธ

Load Balancer

Distributes incoming traffic across multiple servers for reliability and performance.

         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ”‚  Clients โ”‚
         โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜
              โ”‚
       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”
       โ”‚    LOAD      โ”‚
       โ”‚  BALANCER    โ”‚
       โ”‚  (L4 / L7)  โ”‚
       โ””โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”ฌโ”€โ”€โ”˜
          โ”‚   โ”‚   โ”‚
     โ”Œโ”€โ”€โ”€โ”€โ–ผโ” โ”Œโ–ผโ”€โ”€โ” โ”Œโ–ผโ”€โ”€โ”€โ”€โ”
     โ”‚ Srv โ”‚ โ”‚Srvโ”‚ โ”‚ Srv โ”‚
     โ”‚  1  โ”‚ โ”‚ 2 โ”‚ โ”‚  3  โ”‚
     โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”˜

ALGORITHMS:
  Round Robin โ”€โ”€โ”€ Equal distribution
  Weighted RR โ”€โ”€โ”€ Based on server capacity
  Least Conn โ”€โ”€โ”€ Route to least busy
  IP Hash โ”€โ”€โ”€โ”€โ”€โ”€ Session affinity
  Consistent โ”€โ”€โ”€ Minimize remapping
    Hash           on server changes

Key Concepts

  • L4 (Transport): Routes based on IP/port. Fast, no content inspection. (HAProxy, NLB)
  • L7 (Application): Routes based on URL, headers, cookies. Smart but slower. (Nginx, ALB)
  • Health Checks: Periodically ping servers, remove unhealthy ones from pool.
  • Global LB: DNS-based, routes to nearest datacenter (Cloudflare, Route53).
Infrastructure
๐Ÿ’พ

Caching

Store frequently accessed data in fast storage to reduce latency and database load.

CACHE-ASIDE (Lazy Loading):
  App โ”€โ”€readโ”€โ”€โ†’ Cache โ”€โ”€missโ”€โ”€โ†’ DB
                  โ”‚               โ”‚
                  โ”‚    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                  โ”‚    โ”‚ load into cache
                  โ—€โ”€โ”€โ”€โ”€โ”˜

WRITE-THROUGH:
  App โ”€โ”€writeโ”€โ”€โ†’ Cache โ”€โ”€writeโ”€โ”€โ†’ DB
  (Consistent but higher write latency)

WRITE-BEHIND (Write-Back):
  App โ”€โ”€writeโ”€โ”€โ†’ Cache โ”€โ”€asyncโ”€โ”€โ†’ DB
  (Fast writes but risk of data loss)

CACHE HIERARCHY:
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” < 1ms  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚ L1: App โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’โ”‚Browserโ”‚
  โ”‚  Memory โ”‚         โ”‚ Cache โ”‚
  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ”‚ 1-5ms
  โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚L2: Redisโ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’โ”‚  CDN  โ”‚
  โ”‚Memcachedโ”‚         โ”‚ Edge  โ”‚
  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ”‚ 5-50ms
  โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”
  โ”‚L3: DB   โ”‚
  โ”‚  Query  โ”‚
  โ”‚  Cache  โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Concepts

  • TTL: Time-to-live โ€” auto-expire stale data. Balance freshness vs hit rate.
  • Eviction: LRU (Least Recently Used), LFU (Least Frequently Used), FIFO.
  • Cache Stampede: Many requests hit DB when cache expires. Fix: lock or stagger TTL.
  • Invalidation: Hardest problem! Event-driven invalidation > TTL-based.
Performance
๐ŸŒ

Content Delivery Network (CDN)

Geographically distributed servers that cache content close to end users.

Without CDN:
  User (Tokyo) โ”€โ”€โ”€โ”€โ”€โ”€โ”€500msโ”€โ”€โ”€โ”€โ”€โ”€โ†’ Origin (US)

With CDN:
  User (Tokyo) โ”€โ”€20msโ”€โ”€โ†’ CDN Edge (Tokyo)
                              โ”‚ cache miss
                              โ–ผ
                         Origin (US)
                              โ”‚
                    CDN caches response
                              โ”‚
  Next User โ”€โ”€โ”€โ”€20msโ”€โ”€โ”€โ†’ CDN Edge โœ“ (cached!)

PUSH CDN vs PULL CDN:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚   PUSH   โ”‚ You upload to CDN upfront    โ”‚
โ”‚          โ”‚ Best for static, known contentโ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚   PULL   โ”‚ CDN fetches on first request โ”‚
โ”‚          โ”‚ Best for dynamic content     โ”‚
โ”‚          โ”‚ Risk: slow first request     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Concepts

  • Edge Locations: 200+ PoPs worldwide (Cloudflare, CloudFront, Akamai).
  • Static Content: Images, CSS, JS, videos โ€” perfect for CDN.
  • Dynamic Content: API responses with short TTL or edge computing.
  • Cache Invalidation: Purge by URL, tag, or wildcard when content updates.
Infrastructure
๐Ÿ—„๏ธ

Database Selection

Choosing the right database for your use case โ€” SQL vs NoSQL and beyond.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚          DATABASE DECISION TREE           โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚                                           โ”‚
โ”‚  Need ACID transactions?                 โ”‚
โ”‚    YES โ†’ SQL (PostgreSQL, MySQL)         โ”‚
โ”‚    NO  โ†“                                 โ”‚
โ”‚                                           โ”‚
โ”‚  Need flexible schema?                   โ”‚
โ”‚    YES โ†’ Document DB (MongoDB)           โ”‚
โ”‚    NO  โ†“                                 โ”‚
โ”‚                                           โ”‚
โ”‚  Key-value lookups at massive scale?     โ”‚
โ”‚    YES โ†’ DynamoDB, Redis, Cassandra      โ”‚
โ”‚    NO  โ†“                                 โ”‚
โ”‚                                           โ”‚
โ”‚  Complex relationships / graph queries?  โ”‚
โ”‚    YES โ†’ Neo4j, Neptune                  โ”‚
โ”‚    NO  โ†“                                 โ”‚
โ”‚                                           โ”‚
โ”‚  Time-series / IoT data?                 โ”‚
โ”‚    YES โ†’ InfluxDB, TimescaleDB           โ”‚
โ”‚    NO  โ†“                                 โ”‚
โ”‚                                           โ”‚
โ”‚  Full-text search?                       โ”‚
โ”‚    YES โ†’ Elasticsearch, OpenSearch       โ”‚
โ”‚                                           โ”‚
โ”‚  Vector similarity (AI/ML)?              โ”‚
โ”‚    YES โ†’ Pinecone, pgvector, Milvus     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Concepts

  • SQL: Strong consistency, ACID, complex queries. Scale with read replicas + sharding.
  • NoSQL: Flexible schema, horizontal scaling, eventual consistency. Types: document, key-value, wide-column, graph.
  • NewSQL: Best of both โ€” CockroachDB, Spanner, TiDB. Distributed SQL.
  • Polyglot Persistence: Use different DBs for different parts of the system.
Data
๐Ÿ“Š

Database Sharding & Replication

Scaling databases horizontally through partitioning and replication strategies.

REPLICATION (Read Scaling):
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   sync/async   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚Primary โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’  โ”‚ Replica โ”‚ โ† reads
  โ”‚(writes)โ”‚โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’  โ”‚ Replica โ”‚ โ† reads
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                 โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

SHARDING (Write Scaling):
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚        Shard Router            โ”‚
  โ””โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
      โ”‚        โ”‚        โ”‚
  โ”Œโ”€โ”€โ”€โ–ผโ”€โ”€โ” โ”Œโ”€โ”€โ–ผโ”€โ”€โ”€โ” โ”Œโ”€โ”€โ–ผโ”€โ”€โ”€โ”
  โ”‚Shard 1โ”‚ โ”‚Shard 2โ”‚ โ”‚Shard 3โ”‚
  โ”‚ A-H   โ”‚ โ”‚ I-P   โ”‚ โ”‚ Q-Z   โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

SHARDING STRATEGIES:
  Range-based  โ†’ user_id 1-1M, 1M-2M
  Hash-based   โ†’ hash(user_id) % N
  Geo-based    โ†’ by region/country
  Directory    โ†’ lookup table maps keyโ†’shard

Key Concepts

  • Hotspot: One shard gets disproportionate traffic. Fix: better partition key or consistent hashing.
  • Cross-shard queries: Expensive! Design schema to avoid them.
  • Rebalancing: Adding/removing shards. Consistent hashing minimizes data movement.
  • Vitess: MySQL sharding middleware (used by YouTube, Slack).
Data
๐Ÿ“จ

Message Queues & Event Streaming

Decouple services with asynchronous communication for reliability and scalability.

MESSAGE QUEUE (Point-to-Point):
  Producer โ”€โ”€โ†’ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”€โ”€โ†’ Consumer
               โ”‚  Queue   โ”‚
  Producer โ”€โ”€โ†’ โ”‚ โ”Œโ”€โ”€โ”ฌโ”€โ”€โ”  โ”‚ โ”€โ”€โ†’ Consumer
               โ”‚ โ”‚m3โ”‚m2โ”‚m1โ”‚
               โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

EVENT STREAMING (Pub/Sub):
  Publisher โ”€โ”€โ†’ โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
               โ”‚  Topic        โ”‚
               โ”‚  โ”Œโ”€โ”€โ”ฌโ”€โ”€โ”ฌโ”€โ”€โ”  โ”‚โ”€โ”€โ†’ Consumer Group A
               โ”‚  โ”‚e3โ”‚e2โ”‚e1โ”‚  โ”‚โ”€โ”€โ†’ Consumer Group B
               โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜โ”€โ”€โ†’ Consumer Group C

WHEN TO USE WHAT:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ RabbitMQ    โ”‚ Task queues, RPC, routingโ”‚
โ”‚ Kafka       โ”‚ Event streaming, logs,   โ”‚
โ”‚             โ”‚ CDC, high throughput     โ”‚
โ”‚ SQS         โ”‚ Simple cloud queues      โ”‚
โ”‚ Redis Pub/  โ”‚ Real-time notifications  โ”‚
โ”‚   Sub       โ”‚ (no persistence)         โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Concepts

  • At-least-once: Messages may be delivered multiple times. Consumers must be idempotent.
  • Exactly-once: Hard to achieve. Kafka supports with transactions.
  • Dead Letter Queue: Failed messages go to DLQ for investigation.
  • Backpressure: Slow consumer? Queue grows. Solution: scale consumers or rate limit producers.
Infrastructure
๐Ÿ”Œ

API Design

REST, GraphQL, gRPC โ€” choosing the right API paradigm for your system.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚          โ”‚   REST   โ”‚ GraphQL  โ”‚  gRPC    โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Protocol โ”‚  HTTP    โ”‚  HTTP    โ”‚ HTTP/2   โ”‚
โ”‚ Format   โ”‚  JSON    โ”‚  JSON    โ”‚ Protobuf โ”‚
โ”‚ Contract โ”‚  Loose   โ”‚  Schema  โ”‚ Strict   โ”‚
โ”‚ Caching  โ”‚  Easy    โ”‚  Hard    โ”‚  Hard    โ”‚
โ”‚ Learning โ”‚  Easy    โ”‚  Medium  โ”‚  Hard    โ”‚
โ”‚ Speed    โ”‚  Good    โ”‚  Good    โ”‚  Fastest โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ Best For โ”‚ Public   โ”‚ Flexible โ”‚ Internal โ”‚
โ”‚          โ”‚ APIs,    โ”‚ frontend โ”‚ micro-   โ”‚
โ”‚          โ”‚ CRUD     โ”‚ queries  โ”‚ services โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

REST Example:
  GET  /api/users/123
  POST /api/users     { "name": "Alex" }
  PUT  /api/users/123 { "name": "Alex B" }

gRPC: 10x faster for service-to-service

Key Concepts

  • Versioning: URL (/v2/users) or header (Accept: v2). Never break clients.
  • Rate Limiting: Token bucket or sliding window. Return 429 when exceeded.
  • Pagination: Cursor-based > offset-based for large datasets.
  • Idempotency: PUT/DELETE should be idempotent. Use idempotency keys for POST.
Design
๐Ÿ”—

Consistent Hashing

Distribute data across nodes with minimal redistribution when nodes change.

HASH RING:
              Node A
               โ•ฑโ•ฒ
              โ•ฑ  โ•ฒ
         โ—‹  โ•ฑ    โ•ฒ  โ—‹ โ† keys
           โ•ฑ      โ•ฒ
    Node D โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ Node B
           โ•ฒ      โ•ฑ
         โ—‹  โ•ฒ    โ•ฑ  โ—‹
              โ•ฒ  โ•ฑ
               โ•ฒโ•ฑ
             Node C

Traditional Hashing:
  hash(key) % N โ†’ Add/remove node =
  ALL keys remapped! โŒ

Consistent Hashing:
  Keys mapped to ring position
  Only K/N keys remapped when
  nodes change โœ…

VIRTUAL NODES:
  Each physical node โ†’ 100-200 virtual nodes
  Ensures even distribution
  Node A: A-1, A-2, ... A-150 on ring

Key Concepts

  • Problem: hash(key) % N breaks when N changes โ€” all data reshuffled.
  • Solution: Hash ring where only neighbors are affected by node changes.
  • Virtual Nodes: Each server gets multiple points on ring for better balance.
  • Used by: DynamoDB, Cassandra, Memcached, CDN routing.
Algorithm
๐Ÿ”€

Proxy & API Gateway

Forward proxy, reverse proxy, and API gateway patterns for routing and security.

FORWARD PROXY (client-side):
  Client โ”€โ”€โ†’ Proxy โ”€โ”€โ†’ Internet โ”€โ”€โ†’ Server
  (VPN, anonymity, content filtering)

REVERSE PROXY (server-side):
  Client โ”€โ”€โ†’ Reverse โ”€โ”€โ†’ Server 1
              Proxy  โ”€โ”€โ†’ Server 2
                     โ”€โ”€โ†’ Server 3
  (Load balancing, SSL, caching, security)

API GATEWAY:
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚         API GATEWAY             โ”‚
  โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
  โ”‚  โ”‚Auth โ”‚Rate  โ”‚Route โ”‚Trans-โ”‚  โ”‚
  โ”‚  โ”‚     โ”‚Limit โ”‚ing   โ”‚form  โ”‚  โ”‚
  โ”‚  โ””โ”€โ”€โ”ฌโ”€โ”€โ”ดโ”€โ”€โ”ฌโ”€โ”€โ”€โ”ดโ”€โ”€โ”ฌโ”€โ”€โ”€โ”ดโ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜  โ”‚
  โ”‚     โ”‚     โ”‚      โ”‚      โ”‚      โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
        โ–ผ     โ–ผ      โ–ผ      โ–ผ
     User  Order  Payment  Notif
     Svc    Svc    Svc     Svc

Key Concepts

  • Nginx: Most popular reverse proxy. Also serves static files, SSL termination.
  • API Gateway: Kong, AWS API Gateway โ€” auth, rate limiting, monitoring in one place.
  • Service Mesh: Istio/Envoy โ€” proxy sidecar per service for observability.
  • BFF: Backend-for-Frontend โ€” separate gateway per client type (web, mobile).
Infrastructure
๐Ÿ—๏ธ Architecture

Design Patterns & Architectures

Battle-tested patterns for building reliable, scalable distributed systems.

๐Ÿ”ฒ

Microservices Architecture

Breaking a monolith into independently deployable services.

MONOLITH:                  MICROSERVICES:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”
โ”‚   All Code   โ”‚     โ”‚User โ”‚ โ”‚Orderโ”‚ โ”‚Pay  โ”‚
โ”‚   One Deploy โ”‚ โ”€โ”€โ†’ โ”‚ Svc โ”‚ โ”‚ Svc โ”‚ โ”‚ Svc โ”‚
โ”‚   One DB     โ”‚     โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜        โ”‚      โ”‚      โ”‚
                     โ”Œโ”€โ”€โ–ผโ”€โ”€โ”โ”Œโ”€โ”€โ–ผโ”€โ”€โ”โ”Œโ”€โ”€โ–ผโ”€โ”€โ”
                     โ”‚Usersโ”‚โ”‚Orderโ”‚โ”‚Pay  โ”‚
                     โ”‚ DB  โ”‚โ”‚ DB  โ”‚โ”‚ DB  โ”‚
                     โ””โ”€โ”€โ”€โ”€โ”€โ”˜โ””โ”€โ”€โ”€โ”€โ”€โ”˜โ””โ”€โ”€โ”€โ”€โ”€โ”˜

COMMUNICATION:
  Synchronous:  REST / gRPC (request-response)
  Asynchronous: Kafka / RabbitMQ (events)

SERVICE DISCOVERY:
  Client โ”€โ”€โ†’ Service Registry โ”€โ”€โ†’ Instance
  (Consul, Eureka, K8s DNS)

Key Principles

  • Single Responsibility: Each service owns one business capability.
  • Database per Service: No shared databases โ€” communicate via APIs/events.
  • Independent Deploy: Change one service without affecting others.
  • Trade-off: Operational complexity (networking, monitoring, debugging) increases significantly.
Architecture
โšก

Event-Driven Architecture

Services communicate through events for loose coupling and real-time processing.

EVENT SOURCING:
  Commands โ”€โ”€โ†’ Event Store โ”€โ”€โ†’ Projections
                  โ”‚
  OrderCreated    โ”‚โ”€โ”€โ†’ Read Model (SQL)
  ItemAdded       โ”‚โ”€โ”€โ†’ Analytics
  OrderPaid       โ”‚โ”€โ”€โ†’ Notifications
  OrderShipped    โ”‚โ”€โ”€โ†’ Search Index

CQRS (Command Query Separation):
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚ COMMAND โ”‚โ”€โ”€writeโ”€โ”€โ†’โ”‚ Write Model โ”‚
  โ”‚  API    โ”‚         โ”‚ (Event Store)โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                             โ”‚ events
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚  QUERY  โ”‚โ—€โ”€โ”€readโ”€โ”€โ”‚ Read Model  โ”‚
  โ”‚  API    โ”‚         โ”‚(Materialized)โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜         โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

SAGA PATTERN (Distributed Transactions):
  Order โ†’ Payment โ†’ Inventory โ†’ Shipping
    โ”‚        โ”‚          โ”‚
    โ—€โ”€โ”€ Compensate โ—€โ”€โ”€ Rollback (on failure)

Key Concepts

  • Event Sourcing: Store events, not state. Replay to rebuild any point in time.
  • CQRS: Separate read and write models. Optimize each independently.
  • Saga: Coordinate distributed transactions with compensating actions.
  • Choreography vs Orchestration: Events react (loose) vs central coordinator (controlled).
Architecture
๐Ÿšฆ

Rate Limiting

Protect your system from abuse and ensure fair usage across clients.

TOKEN BUCKET:
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚  Bucket (max: 10)    โ”‚
  โ”‚  โ—โ—โ—โ—โ—โ—โ—โ—‹โ—‹โ—‹          โ”‚ โ† refill 1/sec
  โ”‚                      โ”‚
  โ”‚  Request arrives:    โ”‚
  โ”‚    Token available?  โ”‚
  โ”‚    YES โ†’ Process โœ…  โ”‚
  โ”‚    NO  โ†’ Reject 429 โŒโ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

SLIDING WINDOW LOG:
  |โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€ 1 minute window โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€|
  |  req  req  req  req  req  req   |
  |  โ†‘                          โ†‘   |
  timestamp              timestamp  |
  Count requests in window.
  Reject if count > limit.

DISTRIBUTED RATE LIMITING:
  Client โ”€โ”€โ†’ API Gateway โ”€โ”€โ†’ Redis
              โ”‚                โ”‚
              โ”‚  INCR key     โ”‚
              โ”‚  EXPIRE 60s   โ”‚
              โ”‚  if count > N โ”‚
              โ”‚    โ†’ 429      โ”‚

Algorithms

  • Token Bucket: Allows bursts, smooth rate. Used by AWS, Stripe.
  • Sliding Window: Precise counting, memory intensive for high traffic.
  • Fixed Window: Simple but allows 2x burst at window boundaries.
  • Leaky Bucket: Processes at fixed rate, queues excess. Good for smoothing.
Reliability
๐Ÿ”Œ

Circuit Breaker & Resilience

Prevent cascading failures across services with fault tolerance patterns.

CIRCUIT BREAKER STATES:
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  failures > threshold  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚ CLOSED โ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’ โ”‚ OPEN โ”‚
  โ”‚(normal)โ”‚                        โ”‚(fail โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                        โ”‚fast) โ”‚
       โ–ฒ                            โ””โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜
       โ”‚                               โ”‚
       โ”‚    success                    โ”‚ timeout
       โ”‚                               โ”‚
  โ”Œโ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”                    โ”Œโ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚  CLOSED  โ”‚ โ—€โ”€โ”€ success โ”€โ”€โ”€โ”€โ”€ โ”‚HALF-OPEN โ”‚
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                    โ”‚(test one)โ”‚
                   failure โ”€โ”€โ†’    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                   back to OPEN

RESILIENCE PATTERNS:
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Retry          โ”‚ Exp backoff + jitter โ”‚
โ”‚ Timeout        โ”‚ Don't wait forever   โ”‚
โ”‚ Bulkhead       โ”‚ Isolate failures     โ”‚
โ”‚ Fallback       โ”‚ Cached/default value โ”‚
โ”‚ Circuit Breakerโ”‚ Fail fast on errors  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ดโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Concepts

  • Cascading Failure: Service A โ†’ B โ†’ C. If C is slow, A and B pile up and crash.
  • Bulkhead: Isolate thread pools per dependency. One failure doesn't exhaust all resources.
  • Retry with Jitter: Exponential backoff + random jitter prevents thundering herd.
  • Libraries: Resilience4j (Java), Polly (.NET), Hystrix (deprecated).
Reliability
๐ŸŒ

Distributed Consensus

How distributed nodes agree on state โ€” Paxos, Raft, and leader election.

RAFT CONSENSUS:
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”   vote request   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚Followerโ”‚ โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€  โ”‚Candidateโ”‚
  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜                  โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”˜
       โ”‚                           โ”‚
       โ”‚   majority votes          โ”‚
       โ”‚                      โ”Œโ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”
       โ”‚  โ—€โ”€โ”€ heartbeat โ”€โ”€โ”€โ”€โ”€ โ”‚ Leader โ”‚
       โ”‚                      โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ”‚
  Leader sends log entries to followers
  Committed when majority acknowledge

LEADER ELECTION:
  1. Follower timeout โ†’ becomes Candidate
  2. Requests votes from all nodes
  3. Majority votes โ†’ becomes Leader
  4. Sends heartbeats to maintain leadership

SPLIT BRAIN:
  Network split โ†’ two leaders!
  Solution: Quorum (majority) required
  5 nodes โ†’ need 3 to agree (survives 2 failures)

Key Concepts

  • Raft: Understandable consensus. Used in etcd, CockroachDB, TiKV.
  • Quorum: Majority agreement. W + R > N for strong consistency.
  • Vector Clocks: Track causality across distributed nodes.
  • Gossip Protocol: Nodes share state via random peer communication (Cassandra).
Distributed
๐ŸŽฏ

Probabilistic Data Structures

Space-efficient structures for approximate membership, counting, and cardinality.

BLOOM FILTER:
  "Is X in the set?"
  โ”Œโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”ฌโ”€โ”
  โ”‚0โ”‚1โ”‚0โ”‚1โ”‚0โ”‚0โ”‚1โ”‚0โ”‚1โ”‚0โ”‚0โ”‚1โ”‚  bit array
  โ””โ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”ดโ”€โ”˜
    โ†‘   โ†‘       โ†‘       โ†‘
    h1  h2      h3      h1 (hash functions)

  "Maybe in set" โ†’ could be false positive
  "Definitely not" โ†’ NEVER false negative

  Use case: Check if username taken before
  hitting database. 10M users in 12MB!

COUNT-MIN SKETCH:
  "Approximately how many times did X occur?"
  Used for: top-K, frequency estimation

HYPERLOGLOG:
  "How many unique items?" (cardinality)
  Count 1 BILLION unique items in 12 KB!
  Used by: Redis PFCOUNT, analytics

Key Concepts

  • Bloom Filter: Fast membership test. Used in databases (skip disk reads), CDNs, spam filters.
  • HyperLogLog: Count unique visitors with 0.81% error in 12KB. Redis built-in.
  • Count-Min Sketch: Frequency estimation in streaming data. Overestimates, never under.
  • Trade-off: Accuracy for space/time. Perfect when approximate is good enough.
Algorithm
๐Ÿข Real World

Real-World System Designs

End-to-end designs of popular systems you'll encounter in interviews.

๐Ÿ”—

Design URL Shortener

Like bit.ly โ€” convert long URLs to short ones with analytics.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”    POST /shorten     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚Clientโ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’  โ”‚ API Serverโ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  {"url":"long..."}   โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜
                                    โ”‚
    1. Generate short ID (Base62)   โ”‚
    2. Store mapping                โ”‚
                              โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”
                              โ”‚ Database   โ”‚
                              โ”‚ shortโ†’long โ”‚
                              โ”‚ + metadata โ”‚
                              โ””โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜
                                    โ”‚
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”    GET /abc123       โ”Œโ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”
โ”‚Clientโ”‚ โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’  โ”‚ Cache      โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜    301 Redirect      โ”‚ (Redis)    โ”‚
                              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

ID GENERATION:
  Base62: [a-zA-Z0-9] โ†’ 62^7 = 3.5 trillion
  Snowflake ID โ†’ timestamp + machine + seq
  MD5/SHA โ†’ hash + take first 7 chars

Key Decisions

  • 301 vs 302: 301 (permanent) = cached by browser. 302 (temp) = always hits server (better for analytics).
  • Read-heavy: 100:1 read/write ratio. Cache aggressively in Redis.
  • Custom aliases: Check uniqueness, reserve words blacklist.
  • Analytics: Log click events to Kafka โ†’ aggregate in analytics pipeline.
Interview Classic
๐Ÿ’ฌ

Design Chat System

Like WhatsApp/Slack โ€” real-time messaging with groups, media, and presence.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”  WebSocket  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚User Aโ”‚ โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’  โ”‚  Chat Server     โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜             โ”‚  (WS Gateway)    โ”‚
                     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”  WebSocket          โ”‚
โ”‚User Bโ”‚ โ—€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ†’  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜             โ”‚ Message Queue โ”‚
                     โ”‚   (Kafka)     โ”‚
                     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                             โ”‚
              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
              โ–ผ              โ–ผ          โ–ผ
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
        โ”‚ Message  โ”‚  โ”‚ Presence โ”‚ โ”‚  Push   โ”‚
        โ”‚   DB     โ”‚  โ”‚  Service โ”‚ โ”‚Notific. โ”‚
        โ”‚(Cassandraโ”‚  โ”‚ (Redis)  โ”‚ โ”‚ (FCM/   โ”‚
        โ”‚  / HBase)โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ”‚  APNS)  โ”‚
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜               โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

MESSAGE FLOW:
  1. User A sends via WebSocket
  2. Server publishes to Kafka topic
  3. Recipient's chat server consumes
  4. If online โ†’ deliver via WebSocket
  5. If offline โ†’ push notification + store

Key Decisions

  • WebSocket: Full-duplex, persistent connection for real-time. Fallback: long polling.
  • Message ordering: Sequence IDs per conversation. Cassandra: partition by chat_id, cluster by timestamp.
  • Group chat: Fan-out on write (small groups) vs fan-out on read (large channels).
  • E2E Encryption: Signal Protocol โ€” keys on devices, server can't read messages.
Interview Classic
๐Ÿ“ฑ

Design News Feed

Like Twitter/Instagram โ€” generate and serve personalized content feeds.

FAN-OUT ON WRITE (Push Model):
  User posts โ”€โ”€โ†’ Write to all followers' feeds
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚Post  โ”‚โ”€โ”€โ†’ Feed cache (follower 1)
  โ”‚      โ”‚โ”€โ”€โ†’ Feed cache (follower 2)
  โ”‚      โ”‚โ”€โ”€โ†’ Feed cache (follower N)
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  โœ… Fast reads  โŒ Celebrity problem (millions)

FAN-OUT ON READ (Pull Model):
  User opens app โ”€โ”€โ†’ Query all followed users
  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”
  โ”‚Readerโ”‚โ”€โ”€โ†’ Get posts from user A, B, C...
  โ”‚      โ”‚โ”€โ”€โ†’ Merge + Sort + Return top N
  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  โœ… No celebrity problem  โŒ Slow reads

HYBRID (Twitter's approach):
  Regular users โ†’ fan-out on write
  Celebrities (>10K followers) โ†’ fan-out on read
  Merge both at read time

Key Decisions

  • Feed Storage: Pre-computed in Redis (user_id โ†’ list of post_ids). Limit to 800 items.
  • Ranking: ML model scores posts by relevance, recency, engagement, relationship.
  • Cache: Feed cache + content cache + social graph cache.
  • Pagination: Cursor-based (last_seen_id) to handle real-time insertions.
Interview Classic
๐Ÿ””

Design Notification System

Multi-channel notifications โ€” push, SMS, email at scale with preferences.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Services โ”‚โ”€โ”€โ”€โ”€โ†’โ”‚  Notification Service โ”‚
โ”‚(triggers)โ”‚     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜                 โ”‚
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚  Message Queue  โ”‚
                    โ”‚    (Kafka)      โ”‚
                    โ””โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”˜
                       โ”‚     โ”‚     โ”‚
                  โ”Œโ”€โ”€โ”€โ”€โ–ผโ” โ”Œโ”€โ–ผโ”€โ”€โ” โ”Œโ–ผโ”€โ”€โ”€โ”€โ”
                  โ”‚Push โ”‚ โ”‚SMS โ”‚ โ”‚Emailโ”‚
                  โ”‚Workerโ”‚ โ”‚Wrkrโ”‚ โ”‚Wrkr โ”‚
                  โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜ โ””โ”€โ”ฌโ”€โ”€โ”˜ โ””โ”€โ”€โ”ฌโ”€โ”€โ”˜
                     โ”‚      โ”‚       โ”‚
                  โ”Œโ”€โ”€โ–ผโ”€โ”€โ”โ”Œโ”€โ”€โ–ผโ”€โ”€โ”โ”Œโ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”
                  โ”‚APNS/โ”‚ โ”‚Twilโ”‚ โ”‚SES/   โ”‚
                  โ”‚FCM  โ”‚ โ”‚io  โ”‚ โ”‚Sendgrdโ”‚
                  โ””โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

FEATURES:
  โœ“ User preferences (channel, frequency)
  โœ“ Rate limiting (max 3 push/hour)
  โœ“ Template engine (personalization)
  โœ“ Analytics (delivered, opened, clicked)
  โœ“ Retry with exponential backoff

Key Decisions

  • Priority Queue: Urgent (OTP, security) โ†’ high priority. Marketing โ†’ low priority.
  • Deduplication: Idempotency key prevents duplicate notifications.
  • User Preferences: Channel preferences, quiet hours, frequency caps.
  • Delivery Tracking: Sent โ†’ Delivered โ†’ Opened โ†’ Clicked funnel.
Interview Classic
๐Ÿ”

Design Search System

Full-text search with autocomplete, ranking, and real-time indexing.

INDEXING PIPELINE:
  Data Source โ”€โ”€โ†’ Crawler/CDC โ”€โ”€โ†’ Processor
                                     โ”‚
                              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”
                              โ”‚  Tokenize   โ”‚
                              โ”‚  Normalize  โ”‚
                              โ”‚  Stem/Lemma โ”‚
                              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜
                                     โ”‚
                              โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”
                              โ”‚ Inverted    โ”‚
                              โ”‚ Index       โ”‚
                              โ”‚ (Elastic-   โ”‚
                              โ”‚  search)    โ”‚
                              โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

INVERTED INDEX:
  "distributed" โ†’ [doc1, doc5, doc9]
  "system"      โ†’ [doc1, doc3, doc5]
  "design"      โ†’ [doc1, doc2, doc7]

  Search "distributed system":
  โ†’ Intersection: [doc1, doc5] โ† results!

SEARCH FLOW:
  Query โ†’ Parse โ†’ Search Index โ†’ Rank โ†’ Return
           โ”‚                       โ”‚
        Spell     TF-IDF, BM25, PageRank,
        correct   personalization, freshness

Key Concepts

  • Inverted Index: Maps terms to documents. Core of all search engines.
  • BM25: Industry-standard ranking algorithm. Considers term frequency and document length.
  • Autocomplete: Trie data structure + top-K queries by frequency.
  • Typeahead: Prefix search on pre-computed suggestions, cached aggressively.
Interview Classic
๐Ÿค–

Design AI Agent Platform

Multi-tenant AI agent orchestration โ€” reasoning, planning, and tool execution.

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ Slack/Teams  โ”‚โ”€โ”€โ†’ API Gateway (Auth, Rate Limit)
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”˜          โ”‚
       โ”‚          Session Manager (Redis)
       โ”‚                 โ”‚
       โ”‚       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
       โ”‚       โ”‚  REASONING ENGINE   โ”‚
       โ”‚       โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
       โ”‚       โ”‚  โ”‚ Planning (LLM)โ”‚  โ”‚โ†’ Decompose
       โ”‚       โ”‚  โ”‚ Execution Eng โ”‚  โ”‚โ†’ Run tools
       โ”‚       โ”‚  โ”‚ Observation   โ”‚  โ”‚โ†’ Evaluate
       โ”‚       โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
       โ”‚       โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”˜
       โ”‚            โ”‚          โ”‚
       โ”‚   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ–ผโ”€โ”€โ”  โ”Œโ”€โ”€โ”€โ–ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
       โ”‚   โ”‚Tool Registryโ”‚  โ”‚State Managerโ”‚
       โ”‚   โ”‚(per tenant) โ”‚  โ”‚(Redis + PG) โ”‚
       โ”‚   โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
       โ”‚        โ”‚
       โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
       โ”‚  โ–ผ     โ–ผ      โ–ผ          โ–ผ
       โ”‚ SNOW  Jira  Salesforce  Okta
       โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Decisions

  • ReAct Pattern: Reason โ†’ Act โ†’ Observe โ†’ loop until done.
  • Plugin System: Each tenant configures their own connectors with isolated credentials.
  • Model Routing: Fast model for simple tasks, powerful model for complex reasoning.
  • Guardrails: Permission enforcement, output validation, cost controls per tenant.
Hot Topic ๐Ÿ”ฅ
๐Ÿ“‹ Quick Reference

System Design Cheat Sheet

Everything you need at a glance โ€” the ultimate quick reference.

๐ŸŽฏ System Design Interview Framework (RESHADED)

1

Requirements

Functional & non-functional. Ask clarifying questions. Scope the problem. Don't assume.

2

Estimation

DAU, QPS, storage, bandwidth. Back-of-envelope math. Know your powers of 2.

3

Storage Schema

Data model, SQL vs NoSQL choice, key entities and relationships.

4

High-Level Design

Draw the architecture. Main components, data flow, APIs. Keep it simple first.

5

API Design

Define key API endpoints. REST/gRPC. Request/response schemas.

6

Deep Dive

Pick 2-3 interesting components. Discuss trade-offs, alternatives, edge cases.

7

Evaluate

Bottlenecks, single points of failure, scaling strategies, monitoring.

8

Discuss Trade-offs

Every decision has trade-offs. Show you understand the "why" behind choices.

๐Ÿ—„๏ธ Database Quick Pick

NeedChooseExample
ACID TransactionsPostgreSQLPayments, Orders
High Write ThroughputCassandraIoT, Time-series
Flexible SchemaMongoDBCMS, Catalogs
Cache / SessionsRedisSessions, Leaderboards
Full-Text SearchElasticsearchProduct Search
Graph QueriesNeo4jSocial Networks
Global Scale SQLSpanner/CockroachDBMulti-region apps
Vector Search (AI)Pinecone/pgvectorRAG, Similarity

โšก Scaling Strategies

ProblemSolution
Too many readsCache (Redis) + CDN + Read replicas
Too many writesSharding + Message queue + Batch writes
Single point of failureRedundancy + Failover + Multi-AZ
Slow API responsesAsync processing + Pagination + Compression
Data growing too fastData partitioning + Archival + TTL
Cross-region latencyMulti-region deploy + CDN + Edge computing
Thundering herdRate limiting + Circuit breaker + Backpressure

๐Ÿ”ข Numbers Every Engineer Should Know

~1ms Redis GET
~5ms SQL Simple Query
~50ms SSD Random Read
~150ms Cross-continent RTT
~1-3s LLM Response
1M+ Redis Ops/sec
10K Single Server RPS
86,400 Seconds in a Day