InkdownInkdown
Start writing

Study

59 files·8 subfolders

Shared Workspace

Study
core

11-cap-theorem

Shared from "Study" on Inkdown

11 - CAP Theorem & Distributed Systems

Understanding CAP Theorem

The CAP Theorem states that a distributed data system can only guarantee two of the following three properties:

Plain text
┌─────────────────────────────────────────────────────────────┐
│                    CAP Theorem                               │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│              ┌─────────────┐                                 │
│             /    C         \                                │
│            /  Consistency   \                               │
│           /   (All nodes see    \                            │
│          /     same data at      \                           │
│         /       same time)        \                          │
│        /                           \                         │
│       └─────────────┬───────────────┘                        │
│                     / \                                      │
│                    /   \                                     │
│                   /     \                                    │
│                  /       \                                   │
│                 /    A   \                                  │
│                / Available\                                 │
│               /  (Every request  \                           │
│              /   gets a response, \                         │
│             /    even if not latest) \                        │
│            /                           \                     │
│           └─────────────┬───────────────┘                    │
│                        /                                     │
│                       /                                      │
│                      /                                       │
│                     / P                                      │
│                    / Partition Tolerance                  │
│                   /  (System works despite                  │
│                  /    network failures)                     │
│                 /                                           │
│                                                              │
│  In practice, you must choose CP or AP                      │
│  (You can't sacrifice partition tolerance in real systems)   │
│                                                              │
└─────────────────────────────────────────────────────────────┘
programming-language-concepts.md
zero-language-explanation.md
DB
01-introduction.md
02-relational-databases.md
03-database-design.md
04-indexing.md
05-transactions-acid.md
06-nosql-databases.md
07-query-optimization.md
08-replication-ha.md
09-sharding-partitioning.md
10-caching-strategies.md
11-cap-theorem.md
12-connection-pooling.md
13-backup-recovery.md
14-monitoring.md
15-database-selection.md
README.md
JS
Event loop
Merlin Backend
01-Orchestration.md
02-DeepResearch.md
03-Search.md
04-Scraping.md
05-Streaming.md
06-MultiProviderLLM.md
07-MemoryAndContext.md
08-ErrorHandling.md
09-RateLimiting.md
10-TaskQueue.md
11-SecurityAndAuth.md
Orchestration-2nd-draft
OpenAI Agents Python
00_OVERVIEW.md
01_AGENT_SYSTEM.md
02_RUNNER_SYSTEM.md
03_TOOL_SYSTEM.md
04_ITEMS_SYSTEM.md
05_GUARDRAILS.md
06_HANDOFFS.md
07_MEMORY_SESSIONS.md
08_MODEL_PROVIDERS.md
09_SANDBOX_SYSTEM.md
10_TRACING.md
11_RUN_STATE.md
12_CONTEXT.md
13_LIFECYCLE_HOOKS.md
14_CONFIGURATION.md
15_ERROR_HANDLING.md
16_STREAMING.md
17_EXTENSIONS.md
18_MCP_INTEGRATION.md
19_BEST_PRACTICES.md
20_ARCHITECTURE_PATTERNS.md
opencode-study
context-handling
core
Python
Alembic
Basics
sqlalchemy - fastapi
SQLAlchemy overview
tweets
system_design_for_agentic_apps.md
Breaking Down the Properties

Consistency (C):

  • All nodes see the same data at the same time
  • Read returns the most recent write
  • Linearizability

Availability (A):

  • Every request receives a response
  • No guarantee the data is the latest
  • System remains operational

Partition Tolerance (P):

  • System continues to operate despite network partitions
  • Messages may be lost between nodes
  • This is not optional in distributed systems

CP vs AP Systems

CP (Consistency + Partition Tolerance)
Plain text
┌─────────────────────────────────────────────────────────────┐
│              CP Systems                                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  When partition occurs:                                      │
│  - Sacrifice availability                                  │
│  - Return error or timeout instead of stale data            │
│                                                              │
│  Examples:                                                   │
│  - PostgreSQL (with sync replication)                         │
│  - MongoDB (configured for majority)                        │
│  - Redis Cluster (with WAIT)                                │
│  - etcd, ZooKeeper, Consul                                  │
│                                                              │
│  Use when:                                                   │
│  - Data consistency is critical                           │
│  - Financial transactions                                    │
│  - Inventory management                                      │
│  - User authentication                                       │
│                                                              │
│  Trade-off: System may become unavailable                    │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Scenario:
┌─────────┐         Partition         ┌─────────┐
│ Node A  │◄───────X───────►│ Node B  │
│ (Primary)│                      │ (Replica)│
└─────────┘                      └─────────┘

Write to A:
- A attempts to replicate to B
- Network partition prevents it
- CP Choice: Return error to client
  (rather than risk inconsistency)
- System is unavailable for writes
AP (Availability + Partition Tolerance)
Plain text
┌─────────────────────────────────────────────────────────────┐
│              AP Systems                                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  When partition occurs:                                      │
│  - Sacrifice consistency                                     │
│  - Continue serving requests with potentially stale data    │
│                                                              │
│  Examples:                                                   │
│  - Cassandra                                                │
│  - DynamoDB (default)                                        │
│  - Couchbase                                                │
│  - Riak                                                     │
│                                                              │
│  Use when:                                                   │
│  - Availability is paramount                                │
│  - Eventual consistency is acceptable                         │
│  - Social media, analytics, recommendations                   │
│  - Caching, session stores                                   │
│                                                              │
│  Trade-off: May serve stale data                             │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Scenario:
┌─────────┐         Partition         ┌─────────┐
│ Node A  │◄───────X───────►│ Node B  │
│ (Replica)│                      │ (Replica)│
└─────────┘                      └─────────┘

Write to A:
- A accepts write
- Can't replicate to B (partitioned)
- AP Choice: Return success to client
- B still serves reads (may be stale)

Later: When partition heals
- Reconcile differences (last-write-wins,
  vector clocks, application logic)

PACELC Theorem

An extension of CAP for when there's no partition:

Plain text
If there is a Partition (P), 
   how does the system trade off between 
   Availability (A) and Consistency (C)?

Else (E) when system is running normally,
   how does it trade off between 
   Latency (L) and Consistency (C)?

Examples:
- DynamoDB: PA/EL (sacrifice consistency for availability OR latency)
- MongoDB: PC/EC (sacrifice availability for consistency, 
                   or latency for consistency)
- Cassandra: PA/EL
- BigTable: PC/EC

Eventual Consistency

Plain text
┌─────────────────────────────────────────────────────────────┐
│              Eventual Consistency                            │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Definition:                                                 │
│  If no new updates are made, eventually all replicas          │
│  will converge to the same value.                            │
│                                                              │
│  Timeline:                                                   │
│                                                              │
│  T1: Write X=10 to Node A                                    │
│      Node A: X=10                                            │
│      Node B: X=5  (stale)                                    │
│                                                              │
│  T2: Read from Node B returns X=5 (stale)                  │
│                                                              │
│  T3: Replication completes                                   │
│      Node A: X=10                                            │
│      Node B: X=10  (now consistent)                         │
│                                                              │
│  T4: All reads return X=10                                   │
│                                                              │
│  The "inconsistency window" = replication lag               │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Eventual Consistency Patterns

Read Repair:

Plain text
Client reads from 3 replicas:
┌─────────┐
│Replica 1│─── Value = 10
├─────────┤
│Replica 2│─── Value = 10
├─────────┤
│Replica 3│─── Value = 5   (stale!)
└─────────┘

Client detects inconsistency (quorum read)
Sends write to Replica 3 to update to 10

Hinted Handoff:

Plain text
If node B is down when A tries to replicate:
1. Node C (coordinator) stores "hint"
2. When B comes back, C sends the hinted write
3. B catches up

Anti-Entropy:

Plain text
Background process compares replicas
Uses Merkle trees to efficiently find differences
Repairs inconsistencies automatically

Consistency Levels

Cassandra Example
Sql
-- Write consistency levels
CONSISTENCY ONE;      -- Wait for 1 replica acknowledgment
CONSISTENCY QUORUM;  -- Wait for majority (RF/2 + 1)
CONSISTENCY ALL;     -- Wait for all replicas
CONSISTENCY ANY;     -- Wait for any node (even hint)

-- Read consistency levels
CONSISTENCY ONE;      -- Read from 1 replica (fastest)
CONSISTENCY QUORUM;  -- Read from majority
CONSISTENCY ALL;     -- Read from all (slowest, consistent)

-- Strong consistency (R + W > RF)
-- QUORUM read + QUORUM write with RF=3
-- (2 + 2 > 3) = true, so strongly consistent
DynamoDB Consistency
Python
# DynamoDB read consistency

# Eventually consistent (default)
# May return stale data
response = table.get_item(
    Key={'id': '123'},
    ConsistentRead=False
)

# Strongly consistent
# Returns most recent data
response = table.get_item(
    Key={'id': '123'},
    ConsistentRead=True  # 2x read capacity, higher latency
)

Conflict Resolution

Last-Write-Wins (LWW)
Plain text
┌─────────────────────────────────────────────────────────────┐
│              Last-Write-Wins                                 │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Strategy: Keep the write with the latest timestamp        │
│                                                              │
│  Node A: X = 10, timestamp = T1                              │
│  Node B: X = 20, timestamp = T2 (T2 > T1)                    │
│                                                              │
│  Result after reconciliation: X = 20                        │
│                                                              │
│  Problem: Clock skew!                                        │
│  - If Node B's clock is fast, it always wins                │
│  - If clocks differ by more than network latency,          │
│    valid writes can be lost                                 │
│                                                              │
│  Mitigation: Logical clocks (Lamport timestamps)           │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Vector Clocks
Plain text
Track which nodes have seen which versions:

Version 1: [A:1]  (Node A, version 1)
Version 2: [A:1, B:1]  (B sees A:1, adds its own)
Version 3: [A:2, B:1]  (A updates again)

Conflict detected: [A:2, B:1] vs [A:1, B:2]
Neither dominates the other = concurrent updates

Resolution:
1. Siblings: Return both versions to application
2. Application decides (merge function)
3. Or: CRDTs (Conflict-free Replicated Data Types)
CRDTs (Conflict-free Replicated Data Types)
Plain text
Data structures that automatically merge:

G-Counter (Grow-only Counter):
┌─────────────────────────────────────────┐
│  Node A: {A:5, B:3, C:2} = 10          │
│  Node B: {A:5, B:4, C:2} = 11          │
│                                         │
│  Merge: {A:5, B:MAX(3,4), C:2}          │
│       = {A:5, B:4, C:2} = 11            │
└─────────────────────────────────────────┘

Other CRDTs:
- PN-Counter (increment and decrement)
- G-Set (grow-only set)
- LWW-Register (last-write-wins register)
- OR-Set (observed-remove set)

Distributed Transactions

Two-Phase Commit (2PC)
Plain text
┌─────────────────────────────────────────────────────────────┐
│              Two-Phase Commit                                  │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Coordinator                    Participants                 │
│  (Transaction Manager)          (Databases A, B, C)          │
│                                                              │
│  Phase 1: PREPARE                                            │
│  ─────────────────                                           │
│      ┌─────────────────────────►                              │
│      │ "Can you commit?"                                     │
│      │                          Each participant:           │
│      │                          - Write to WAL              │
│      │                          - Acquire locks            │
│      │                          - Reply YES or NO            │
│      ◄─────────────────────────                               │
│                                                              │
│  Phase 2: COMMIT or ABORT                                    │
│  ─────────────────────────                                   │
│  If all YES:                                                 │
│      ┌─────────────────────────►                              │
│      │ "COMMIT"                                              │
│      │                          Each participant:           │
│      │                          - Make changes permanent     │
│      │                          - Release locks              │
│      │                          - Reply ACK                │
│      ◄─────────────────────────                               │
│                                                              │
│  If any NO:                                                  │
│      ┌─────────────────────────►                              │
│      │ "ABORT"                                               │
│      │                          Each participant:           │
│      │                          - Rollback                   │
│      │                          - Release locks              │
│      ◄─────────────────────────                               │
│                                                              │
│  Problems:                                                   │
│  - Blocking: If coordinator fails during commit,           │
│    participants hold locks until coordinator recovers       │
│  - Not partition tolerant                                    │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Saga Pattern
Plain text
For long-running transactions across services:

Sequence of local transactions:

┌─────────┐    ┌─────────┐    ┌─────────┐    ┌─────────┐
│ Service │───►│ Service │───►│ Service │───►│ Service │
│    A    │    │    B    │    │    C    │    │    D    │
└─────────┘    └─────────┘    └─────────┘    └─────────┘

If C fails:
┌─────────┐    ┌─────────┐    ╔═════════╗    ┌─────────┐
│ Service │───►│ Service │───►║ Service ║    │ Service │
│    A    │    │    B    │◄───╚═════════╝    │    D    │
└─────────┘    └─────────┘    Compensate    └─────────┘
     ▲              │
     └──────────────┘
   Compensate A

Each service must implement:
- Transaction: Do the work
- Compensation: Undo the work (idempotent)

Types:
1. Choreography: Events trigger next step
2. Orchestration: Central coordinator manages flow

Consensus Algorithms

Raft (Simpler than Paxos)
Plain text
┌─────────────────────────────────────────────────────────────┐
│              Raft Consensus                                  │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Roles:                                                      │
│  - Leader: Handles all client requests                      │
│  - Follower: Replicates leader's log                        │
│  - Candidate: Temporary state during election               │
│                                                              │
│  Leader Election:                                            │
│  1. If follower doesn't hear from leader, becomes candidate │
│  2. Requests votes from all nodes                           │
│  3. If majority votes, becomes leader                        │
│  4. New leader handles requests                              │
│                                                              │
│  Log Replication:                                            │
│  1. Client sends request to leader                          │
│  2. Leader appends to its log                               │
│  3. Leader sends to followers (AppendEntries)               │
│  4. When majority acknowledge, leader commits                │
│  5. Leader notifies followers of commit                     │
│  6. Response to client                                       │
│                                                              │
│  Used by: etcd, Consul, TiKV, CockroachDB                    │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Paxos
Plain text
More complex but proven correct consensus algorithm.

Roles:
- Proposers: Suggest values
- Acceptors: Vote on values
- Learners: Record chosen values

Phases:
1. Prepare: Proposer asks acceptors to promise not to accept lower proposals
2. Promise: Acceptors respond with highest accepted value (or none)
3. Accept: Proposer sends value, acceptors accept if promise kept
4. Accepted: Value is chosen when majority accept

Used by: Chubby (Google), ZooKeeper (ZAB similar)

Practical System Design

Choosing Consistency Model
Plain text
┌─────────────────────────────────────────────────────────────┐
│  Strong Consistency (CP)                                     │
│  ├─ Financial transactions                                 │
│  ├─ Inventory management                                     │
│  ├─ User authentication                                      │
│  ├─ Reservation systems                                    │
│  └─ Configuration data                                       │
├─────────────────────────────────────────────────────────────┤
│  Eventual Consistency (AP)                                   │
│  ├─ Social media feeds                                       │
│  ├─ Product catalogs                                         │
│  ├─ Analytics data                                           │
│  ├─ Session stores                                           │
│  ├─ Caching                                                  │
│  └─ Recommendation engines                                   │
├─────────────────────────────────────────────────────────────┤
│  Causal Consistency (middle ground)                          │
│  ├─ Comments on posts (see your own immediately)             │
│  ├─ Collaborative editing                                    │
│  └─ Shopping carts                                           │
└─────────────────────────────────────────────────────────────┘
Design Patterns
Plain text
1. CQRS (Command Query Responsibility Segregation)
   ┌─────────┐         ┌─────────┐
   │ Command │────────►│ Write DB│ (strong consistency)
   │ Handler │         │ (SQL)   │
   └─────────┘         └────┬────┘
                            │ Async replication
                            ▼
                       ┌─────────┐
   ┌─────────┐         │ Read DB │ (eventual consistency)
   │  Query  │────────►│ (NoSQL) │
   │ Handler │         └─────────┘
   └─────────┘

2. Event Sourcing
   - Store events instead of state
   - Replay events to reconstruct state
   - Natural fit for distributed systems
   
   UserCreated ──► EmailChanged ──► PasswordChanged
   Current state = foldl applyEvent events initialState

3. Lambda Architecture
   Speed layer: Process in real-time (AP, fast)
   Batch layer: Reprocess for accuracy (CP, correct)
   Serving layer: Merge results

Next: Connection Pooling & Performance