InkdownInkdown
Start writing

Study

59 files·8 subfolders

Shared Workspace

Study
core

09-sharding-partitioning

Shared from "Study" on Inkdown

09 - Sharding & Partitioning

The Scaling Problem

As data grows, a single database server eventually hits limits:

  • CPU: Query processing, connection handling
  • Memory: Cache, working set size
  • Disk: Storage capacity, IOPS
  • Network: Bandwidth for replication
Plain text
Single Server Limits:
┌─────────────────────────────────────────────────────────────┐
│  CPU      │ ~64 cores, but can't parallelize single query   │
│  Memory   │ ~1TB RAM for cache                              │
│  Disk     │ ~20TB NVMe, ~100K IOPS                         │
│  Network  │ ~10 Gbps                                        │
│  Connections│ ~1000 concurrent (more = slower)               │
└─────────────────────────────────────────────────────────────┘

When you hit these limits, you need to scale out (shard/partition)
programming-language-concepts.md
zero-language-explanation.md
DB
01-introduction.md
02-relational-databases.md
03-database-design.md
04-indexing.md
05-transactions-acid.md
06-nosql-databases.md
07-query-optimization.md
08-replication-ha.md
09-sharding-partitioning.md
10-caching-strategies.md
11-cap-theorem.md
12-connection-pooling.md
13-backup-recovery.md
14-monitoring.md
15-database-selection.md
README.md
JS
Event loop
Merlin Backend
01-Orchestration.md
02-DeepResearch.md
03-Search.md
04-Scraping.md
05-Streaming.md
06-MultiProviderLLM.md
07-MemoryAndContext.md
08-ErrorHandling.md
09-RateLimiting.md
10-TaskQueue.md
11-SecurityAndAuth.md
Orchestration-2nd-draft
OpenAI Agents Python
00_OVERVIEW.md
01_AGENT_SYSTEM.md
02_RUNNER_SYSTEM.md
03_TOOL_SYSTEM.md
04_ITEMS_SYSTEM.md
05_GUARDRAILS.md
06_HANDOFFS.md
07_MEMORY_SESSIONS.md
08_MODEL_PROVIDERS.md
09_SANDBOX_SYSTEM.md
10_TRACING.md
11_RUN_STATE.md
12_CONTEXT.md
13_LIFECYCLE_HOOKS.md
14_CONFIGURATION.md
15_ERROR_HANDLING.md
16_STREAMING.md
17_EXTENSIONS.md
18_MCP_INTEGRATION.md
19_BEST_PRACTICES.md
20_ARCHITECTURE_PATTERNS.md
opencode-study
context-handling
core
Python
Alembic
Basics
sqlalchemy - fastapi
SQLAlchemy overview
tweets
system_design_for_agentic_apps.md

Partitioning vs Sharding

Plain text
┌─────────────────────────────────────────────────────────────┐
│              Partitioning vs Sharding                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  PARTITIONING (Single Node)                                  │
│  ┌─────────────────────────────────────────────────────┐   │
│  │           Single Database Server                      │   │
│  │  ┌─────────┐ ┌─────────┐ ┌─────────┐ ┌─────────┐   │   │
│  │  │Part 2022│ │Part 2023│ │Part 2024│ │Part 2025│   │   │
│  │  └─────────┘ └─────────┘ └─────────┘ └─────────┘   │   │
│  │                                                      │   │
│  │  Split large table into smaller pieces               │   │
│  │  Still one server, but better query performance      │   │
│  │  and easier maintenance (drop old partitions)          │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                              │
│  SHARDING (Multiple Nodes)                                   │
│  ┌─────────┐  ┌─────────┐  ┌─────────┐  ┌─────────┐         │
│  │Shard A  │  │Shard B  │  │Shard C  │  │Shard D  │         │
│  │Users A-C│  │Users D-F│  │Users G-I│  │Users J-L│         │
│  └─────────┘  └─────────┘  └─────────┘  └─────────┘         │
│       │            │            │            │               │
│       └────────────┴────────────┴────────────┘               │
│                   Router/Application                           │
│                                                              │
│  Data distributed across multiple servers                   │
│  Each shard is an independent database                        │
│  Enables horizontal scaling                                  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Table Partitioning

Range Partitioning
Sql
-- Partition by date range
CREATE TABLE events (
    id BIGSERIAL,
    event_time TIMESTAMPTZ NOT NULL,
    user_id INTEGER,
    event_type VARCHAR(50),
    data JSONB
) PARTITION BY RANGE (event_time);

-- Create partitions
CREATE TABLE events_2024_01 PARTITION OF events
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

CREATE TABLE events_2024_02 PARTITION OF events
    FOR VALUES FROM ('2024-02-01') TO ('2024-03-01');

CREATE TABLE events_2024_03 PARTITION OF events
    FOR VALUES FROM ('2024-03-01') TO ('2024-04-01');

-- Default partition (catches anything else)
CREATE TABLE events_default PARTITION OF events DEFAULT;

-- Insert works transparently
INSERT INTO events (event_time, user_id, event_type)
VALUES ('2024-01-15 10:30:00', 1, 'login');
-- Automatically goes to events_2024_01

-- Query pruning: Only scans relevant partitions
SELECT * FROM events 
WHERE event_time >= '2024-01-01' 
  AND event_time < '2024-02-01';
-- Only scans events_2024_01!
List Partitioning
Sql
-- Partition by region
CREATE TABLE sales (
    id SERIAL,
    region VARCHAR(20) NOT NULL,
    amount DECIMAL(10,2),
    sale_date DATE
) PARTITION BY LIST (region);

CREATE TABLE sales_north PARTITION OF sales
    FOR VALUES IN ('NYC', 'BOS', 'PHI');

CREATE TABLE sales_south PARTITION OF sales
    FOR VALUES IN ('ATL', 'MIA', 'HOU');

CREATE TABLE sales_west PARTITION OF sales
    FOR VALUES IN ('LAX', 'SFO', 'SEA');
Hash Partitioning
Sql
-- Distribute evenly across partitions
CREATE TABLE measurements (
    id BIGSERIAL,
    sensor_id INTEGER,
    reading DECIMAL(10,4),
    measured_at TIMESTAMPTZ
) PARTITION BY HASH (sensor_id);

CREATE TABLE measurements_p0 PARTITION OF measurements
    FOR VALUES WITH (MODULUS 4, REMAINDER 0);

CREATE TABLE measurements_p1 PARTITION OF measurements
    FOR VALUES WITH (MODULUS 4, REMAINDER 1);

CREATE TABLE measurements_p2 PARTITION OF elections
    FOR VALUES WITH (MODULUS 4, REMAINDER 2);

CREATE TABLE measurements_p3 PARTITION OF measurements
    FOR VALUES WITH (MODULUS 4, REMAINDER 3);

-- Data distributed based on hash(sensor_id) % 4
Sub-partitioning
Sql
-- Partition by range, then by hash
CREATE TABLE events (
    id BIGSERIAL,
    event_time TIMESTAMPTZ,
    user_id INTEGER
) PARTITION BY RANGE (event_time);

CREATE TABLE events_2024_01 PARTITION OF events
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01')
    PARTITION BY HASH (user_id);

CREATE TABLE events_2024_01_p0 PARTITION OF events_2024_01
    FOR VALUES WITH (MODULUS 4, REMAINDER 0);
-- etc...
Partition Maintenance
Sql
-- Add new partition
CREATE TABLE events_2024_04 PARTITION OF events
    FOR VALUES FROM ('2024-04-01') TO ('2024-05-01');

-- Detach old partition (fast)
ALTER TABLE events DETACH PARTITION events_2024_01;

-- Drop old data instantly (vs DELETE which is slow)
DROP TABLE events_2024_01;

-- Attach existing table as partition
ALTER TABLE events ATTACH PARTITION events_archived_2023
    FOR VALUES FROM ('2023-01-01') TO ('2024-01-01');

Database Sharding

Sharding Strategies
1. Hash Sharding
Plain text
┌─────────────────────────────────────────────────────────────┐
│              Hash Sharding                                   │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  shard = hash(user_id) % number_of_shards                    │
│                                                              │
│  Example:                                                    │
│  hash(12345) % 4 = 2 ──► Shard 2                            │
│  hash(67890) % 4 = 1 ──► Shard 1                            │
│                                                              │
│  Pros:                                                       │
│  - Even distribution (if good hash function)                  │
│  - Simple to implement                                       │
│                                                              │
│  Cons:                                                       │
│  - Can't do range queries across shards                      │
│  - Resharding is expensive (need to move data)             │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Python
# Application-level sharding
import hashlib

def get_shard(user_id, num_shards=4):
    """Determine which shard for user_id"""
    hash_val = int(hashlib.md5(str(user_id).encode()).hexdigest(), 16)
    return hash_val % num_shards

def get_db_connection(user_id):
    shard = get_shard(user_id)
    return db_pools[shard].get_connection()

# Usage
conn = get_db_connection(user_id=12345)
cursor = conn.cursor()
cursor.execute("SELECT * FROM users WHERE id = %s", (user_id,))
2. Range Sharding
Plain text
┌─────────────────────────────────────────────────────────────┐
│              Range Sharding                                  │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Shard 1: User IDs 1 - 1,000,000                            │
│  Shard 2: User IDs 1,000,001 - 2,000,000                    │
│  Shard 3: User IDs 2,000,001 - 3,000,000                    │
│                                                              │
│  Pros:                                                       │
│  - Efficient range queries                                   │
│  - Easy to add new shard (just extend range)                │
│  - Can optimize shards for data characteristics              │
│                                                              │
│  Cons:                                                       │
│  - Hot spots (Shard 1 if new users are sequential)          │
│  - Uneven distribution if IDs not uniform                    │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Python
# Range-based routing
SHARD_RANGES = [
    (0, 1000000, 'shard1'),
    (1000001, 2000000, 'shard2'),
    (2000001, 3000000, 'shard3'),
]

def get_shard_by_range(user_id):
    for min_id, max_id, shard in SHARD_RANGES:
        if min_id <= user_id <= max_id:
            return shard
    return 'shard_overflow'  # For new range
3. Directory-Based (Lookup) Sharding
Plain text
┌─────────────────────────────────────────────────────────────┐
│              Directory Sharding                              │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────────────────────────────┐                    │
│  │         Shard Directory             │                    │
│  ├─────────────────────────────────────┤                    │
│  │  user_id │ shard │ reason           │                    │
│  ├──────────┼───────┼──────────────────┤                    │
│  │  12345   │   1   │ default          │                    │
│  │  67890   │   2   │ moved (hot spot) │                    │
│  │  11111   │   3   │ VIP customer     │                    │
│  └──────────┴───────┴──────────────────┘                    │
│                                                              │
│  Pros:                                                       │
│  - Flexible: can move individual users                       │
│  - Can handle hot spots by migration                         │
│  - Can keep related data together                            │
│                                                              │
│  Cons:                                                       │
│  - Extra lookup on every query                               │
│  - Directory becomes bottleneck/single point                 │
│  - Directory needs to be highly available                    │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Sharding Architecture
Plain text
┌─────────────────────────────────────────────────────────────┐
│              Sharded Database Architecture                   │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│                     ┌─────────────┐                        │
│                     │   Client    │                        │
│                     └──────┬──────┘                        │
│                            │                                │
│                     ┌──────▼──────┐                         │
│                     │  Router     │                         │
│                     │ (PgBouncer,  │  Determines shard        │
│                     │  App layer, │  from shard key         │
│                     │  ProxySQL)  │                         │
│                     └──────┬──────┘                         │
│                            │                                │
│         ┌──────────────────┼──────────────────┐            │
│         │                  │                  │            │
│    ┌────▼────┐       ┌────▼────┐       ┌────▼────┐       │
│    │Shard 1  │       │Shard 2  │       │Shard 3  │       │
│    │Primary  │       │Primary  │       │Primary  │       │
│    │ + Replica│       │ + Replica│       │ + Replica│       │
│    └─────────┘       └─────────┘       └─────────┘       │
│                                                              │
│  Each shard:                                                 │
│  - Independent database                                      │
│  - Has its own primary and replicas                          │
│  - Contains subset of data                                   │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Cross-Shard Operations
Plain text
┌─────────────────────────────────────────────────────────────┐
│              Cross-Shard Challenges                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Problem 1: JOINs across shards                              │
│  ┌─────────┐              ┌─────────┐                       │
│  │Shard 1  │              │Shard 2  │                       │
│  │User:123 │              │User:456 │                       │
│  │         │              │Order:1  │                       │
│  │         │              │Order:2  │                       │
│  └─────────┘              └─────────┘                       │
│                                                              │
│  Can't do: SELECT * FROM users JOIN orders ...              │
│  Because users and orders are on different shards!         │
│                                                              │
│  Solutions:                                                  │
│  1. Denormalize (copy user info to orders shard)             │
│  2. Application-level join (fetch from both, combine)        │
│  3. Avoid needing the query (different data model)          │
│                                                              │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Problem 2: Transactions across shards                     │
│  Can't do ACID transaction that spans multiple shards       │
│                                                              │
│  Solutions:                                                  │
│  1. Saga pattern (compensating transactions)                 │
│  2. Two-phase commit (slow, complex)                        │
│  3. Design to avoid cross-shard transactions                 │
│     (same shard key for related data)                       │
│                                                              │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Problem 3: Auto-increment IDs                               │
│  Can't use SERIAL across shards (would collide)             │
│                                                              │
│  Solutions:                                                  │
│  1. UUID (recommended)                                       │
│  2. Snowflake IDs (timestamp + shard + sequence)            │
│  3. Separate ID service                                      │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Consistent Hashing
Plain text
┌─────────────────────────────────────────────────────────────┐
│              Consistent Hashing                              │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Problem: When adding shard with simple hash,              │
│  almost all data needs to move!                             │
│                                                              │
│  hash(key) % 3 = shard                                      │
│  hash(key) % 4 = ?  ← Almost all keys get new shard!       │
│                                                              │
│  Solution: Consistent Hashing                               │
│                                                              │
│  ┌─────────────────────────────────────────────────────┐    │
│  │              Hash Ring (0 to 2^32-1)               │    │
│  │                                                      │    │
│  │     Shard A          Shard B         Shard C       │    │
│  │        ●                ●                ●           │    │
│  │       100             2000            4000         │    │
│  │        │                │                │           │    │
│  │   ┌────┴────┐      ┌────┴────┐      ┌────┴────┐     │    │
│  │   │ Keys    │      │ Keys    │      │ Keys    │     │    │
│  │   │0-100   │      │100-2000 │      │2000-4000│     │    │
│  │   │4000-max│      │         │      │         │     │    │
│  │   └─────────┘      └─────────┘      └─────────┘     │    │
│  │                                                      │    │
│  │  When adding Shard D at 3000:                        │    │
│  │  - Only keys between 2000-3000 move from B to D       │    │
│  │  - Everything else stays!                             │    │
│  └─────────────────────────────────────────────────────┘    │
│                                                              │
│  Benefits:                                                   │
│  - Minimal data movement when scaling                        │
│  - Virtual nodes reduce imbalance                            │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Re-sharding (Adding Shards)
Plain text
┌─────────────────────────────────────────────────────────────┐
│              Re-sharding Process                             │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Phase 1: Setup                                              │
│  - Add new empty shard                                       │
│  - Update routing table (new shard gets no traffic yet)     │
│                                                              │
│  Phase 2: Data Migration                                     │
│  - Copy data from existing shards to new shard               │
│  - Track changes (CDC - Change Data Capture)                │
│  - Keep copying until caught up                             │
│                                                              │
│  Phase 3: Cutover                                            │
│  - Brief maintenance window or dual-write                  │
│  - Update routing: new keys go to new shard                 │
│  - Migrated keys now read from new shard                    │
│                                                              │
│  Phase 4: Cleanup                                            │
│  - Remove migrated data from old shards                     │
│  - Verify consistency                                        │
│                                                              │
│  Tools: pg_dump, logical replication, custom ETL             │
│                                                              │
└─────────────────────────────────────────────────────────────┘

When to Shard

Plain text
Don't shard too early! Start with:
┌─────────────────────────────────────────────────────────────┐
│  1. Vertical scaling (bigger server)                        │
│  2. Read replicas (offload reads)                          │
│  3. Partitioning (split large tables)                      │
│  4. Query optimization                                        │
│  5. Caching (Redis/Memcached)                               │
└─────────────────────────────────────────────────────────────┘

Shard when:
┌─────────────────────────────────────────────────────────────┐
│  - Single server can't handle write load                     │
│  - Data size exceeds single server storage                    │
│  - Need geographic data distribution                         │
│  - Already optimized everything else                         │
└─────────────────────────────────────────────────────────────┘

Next: Caching Strategies