InkdownInkdown
Start writing

Study

59 files·8 subfolders

Shared Workspace

Study
core

06-nosql-databases

Shared from "Study" on Inkdown

06 - NoSQL Databases

Why NoSQL?

While relational databases excel at structured data and complex relationships, they face challenges with:

  • Massive scale (billions of rows)
  • Flexible schemas (data structure varies)
  • High velocity writes (IoT, logs, events)
  • Geographic distribution (low latency worldwide)
  • Unstructured/semi-structured data (documents, graphs)

NoSQL Categories

Plain text
┌─────────────────────────────────────────────────────────────┐
│                     NoSQL Landscape                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐       │
│  │  Document    │  │    Key-Value │  │   Wide-Column│       │
│  │  MongoDB     │  │    Redis     │  │   Cassandra  │       │
│  │  Couchbase   │  │    DynamoDB  │  │   HBase      │       │
│  │  PostgreSQL  │  │    Riak      │  │   Bigtable   │       │
│  │  (JSONB)     │  │              │  │              │       │
│  └──────────────┘  └──────────────┘  └──────────────┘       │
│         │                 │                  │              │
│         │    ┌────────────┴────────────┐     │              │
│         │    │                         │     │              │
│         └────► Flexible schemas        │     │              │
│              │ JSON documents          │     │              │
│              └─────────────────────────┘     │              │
│                                              │              │
│  ┌──────────────┐  ┌──────────────┐         │              │
│  │    Graph     │  │  Time-Series │  ┌───────┘              │
│  │   Neo4j      │  │ InfluxDB     │  │ Massive scale        │
│  │  Neptune     │  │ TimescaleDB  │  │ High throughput       │
│  │  ArangoDB    │  │   Prometheus │  │ Distributed           │
│  └──────────────┘  └──────────────┘  └──────────────────────┘
│       Relationships   Time-ordered data                      │
│                                                              │
└─────────────────────────────────────────────────────────────┘
programming-language-concepts.md
zero-language-explanation.md
DB
01-introduction.md
02-relational-databases.md
03-database-design.md
04-indexing.md
05-transactions-acid.md
06-nosql-databases.md
07-query-optimization.md
08-replication-ha.md
09-sharding-partitioning.md
10-caching-strategies.md
11-cap-theorem.md
12-connection-pooling.md
13-backup-recovery.md
14-monitoring.md
15-database-selection.md
README.md
JS
Event loop
Merlin Backend
01-Orchestration.md
02-DeepResearch.md
03-Search.md
04-Scraping.md
05-Streaming.md
06-MultiProviderLLM.md
07-MemoryAndContext.md
08-ErrorHandling.md
09-RateLimiting.md
10-TaskQueue.md
11-SecurityAndAuth.md
Orchestration-2nd-draft
OpenAI Agents Python
00_OVERVIEW.md
01_AGENT_SYSTEM.md
02_RUNNER_SYSTEM.md
03_TOOL_SYSTEM.md
04_ITEMS_SYSTEM.md
05_GUARDRAILS.md
06_HANDOFFS.md
07_MEMORY_SESSIONS.md
08_MODEL_PROVIDERS.md
09_SANDBOX_SYSTEM.md
10_TRACING.md
11_RUN_STATE.md
12_CONTEXT.md
13_LIFECYCLE_HOOKS.md
14_CONFIGURATION.md
15_ERROR_HANDLING.md
16_STREAMING.md
17_EXTENSIONS.md
18_MCP_INTEGRATION.md
19_BEST_PRACTICES.md
20_ARCHITECTURE_PATTERNS.md
opencode-study
context-handling
core
Python
Alembic
Basics
sqlalchemy - fastapi
SQLAlchemy overview
tweets
system_design_for_agentic_apps.md

1. Document Databases

Concept

Store data as self-contained documents (usually JSON), with flexible schemas.

JavaScript
// MongoDB Document Example
{
    "_id": ObjectId("..."),
    "email": "john@example.com",
    "name": "John Doe",
    "address": {
        "street": "123 Main St",
        "city": "NYC",
        "zip": "10001"
    },
    "orders": [
        { "id": "ord1", "total": 150, "items": [...] },
        { "id": "ord2", "total": 75, "items": [...] }
    ],
    "tags": ["premium", "verified"],
    "metadata": {
        "signup_source": "google",
        "referrer": "friend"
    },
    "created_at": ISODate("2024-01-15")
}
When to Use
  • Data has nested structure
  • Schema evolves frequently
  • Read-heavy with few joins
  • Content management, catalogs, user profiles
MongoDB Query Examples
JavaScript
// Find documents
db.users.find({ "email": "john@example.com" })

// Nested query
db.users.find({ "address.city": "NYC" })

// Array query
db.users.find({ "tags": "premium" })
db.users.find({ "orders.total": { $gt: 100 } })

// Aggregation pipeline
db.orders.aggregate([
    { $match: { status: "completed" } },
    { $group: { 
        _id: "$user_id", 
        total: { $sum: "$amount" } 
    }},
    { $sort: { total: -1 } },
    { $limit: 10 }
])

// Update with nested fields
db.users.updateOne(
    { "_id": ObjectId("...") },
    { 
        $set: { "address.city": "Boston" },
        $push: { "tags": "loyal" }
    }
)
Schema Design Patterns
JavaScript
// Pattern 1: Embedding (Denormalized)
// Good: Read together, limited size
{
    "user_id": 1,
    "name": "John",
    "addresses": [
        { "type": "home", "city": "NYC" },
        { "type": "work", "city": "LA" }
    ]
}

// Pattern 2: Referencing (Normalized)
// Good: Large arrays, frequent independent updates
{
    "_id": 1,
    "name": "John",
    // Store only IDs, fetch separately
    "order_ids": [101, 102, 103]
}

// Pattern 3: Hybrid
// Frequently accessed data embedded
// Less frequently accessed referenced
{
    "_id": 1,
    "name": "John",
    "recent_orders": [  // Embedded (last 5)
        { "id": 103, "total": 150, "date": "..." }
    ],
    "total_order_count": 25,  // Computed
    "all_order_ids": [101, 102, 103, ...]  // All orders referenced
}

// Pattern 4: Bucketing (Time-series)
// One document per time period
{
    "sensor_id": "temp-1",
    "date": "2024-01-15",
    "readings": [
        { "hour": 0, "temp": 20.5 },
        { "hour": 1, "temp": 20.3 },
        // ... 24 readings
    ]
}

2. Key-Value Stores

Concept

Simplest data model: key → value. Lightning fast, highly scalable.

Plain text
┌─────────────────────────────────────────┐
│           Key-Value Store               │
├─────────────────────────────────────────┤
│                                         │
│  Key              │ Value               │
│  ─────────────────┼────────────────     │
│  user:100        │ {name: "John", ...}  │
│  session:abc123  │ {user_id: 100, ...}  │
│  cache:page:home │ <html>...</html>     │
│  counter:visits  │ 15243                │
│  queue:tasks     │ [task1, task2, ...]  │
│                                         │
└─────────────────────────────────────────┘
Redis Examples
Bash
# Strings
SET user:100:name "John"
GET user:100:name
INCR counter:visits
SETEX session:token 3600 "user_data"  # Expires in 1 hour

# Hashes (like objects)
HSET user:100 name "John" email "john@example.com" age 30
HGET user:100 name
HGETALL user:100

# Lists (ordered, duplicates allowed)
LPUSH queue:tasks "send_email"
RPUSH queue:tasks "process_payment"
LRANGE queue:tasks 0 10  # Get first 10
LPOP queue:tasks  # Remove and return first

# Sets (unordered, unique)
SADD user:100:tags "premium" "verified"
SMEMBERS user:100:tags
SISMEMBER user:100:tags "premium"

# Sorted Sets (ordered by score)
ZADD leaderboard 1000 "player1"
ZADD leaderboard 1500 "player2"
ZREVRANGE leaderboard 0 10 WITHSCORES  # Top 10

# Pub/Sub
SUBSCRIBE notifications
PUBLISH notifications "New order received"
Use Cases
  • Caching: Session storage, page cache, query results
  • Rate Limiting: Sliding window counters
  • Real-time: Leaderboards, live updates
  • Queues: Task queues, message passing
  • Pub/Sub: Real-time notifications

3. Wide-Column Stores (Column-Family)

Concept

Store data in columns rather than rows. Excellent for massive scale and high throughput.

Plain text
┌─────────────────────────────────────────────────────────────┐
│              Wide-Column Structure                          │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Row Key │ Column Family: "profile" │ Column Family: "posts"│
│  ────────┼──────────────────────────┼───────────────────────│
│  user:1  │ name: "John"            │ post:1: "Hello"       │
│          │ email: "john@..."       │ post:2: "World"       │
│          │ age: 30                 │                       │
│  ────────┼──────────────────────────┼───────────────────────│
│  user:2  │ name: "Jane"            │ post:1: "First post"  │
│          │ city: "NYC"             │                       │
│          │                         │                       │
│                                                              │
│  Each row can have different columns!                        │
│  Column families stored separately (efficient)               │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Cassandra Data Model
Sql
-- Keyspace (like database)
CREATE KEYSPACE ecommerce 
WITH replication = {
    'class': 'SimpleStrategy', 
    'replication_factor': 3
};

-- Table design: Query-first approach
-- Design tables for specific queries, not entities

-- Query: Get orders by user, ordered by date
CREATE TABLE orders_by_user (
    user_id UUID,
    order_date TIMESTAMP,
    order_id UUID,
    total DECIMAL,
    status TEXT,
    PRIMARY KEY (user_id, order_date, order_id)
) WITH CLUSTERING ORDER BY (order_date DESC);

-- Query it
SELECT * FROM orders_by_user 
WHERE user_id = ? 
ORDER BY order_date DESC 
LIMIT 10;

-- Query: Get order by ID (different table!)
-- In Cassandra, denormalize rather than join
CREATE TABLE orders_by_id (
    order_id UUID PRIMARY KEY,
    user_id UUID,
    order_date TIMESTAMP,
    total DECIMAL,
    items LIST<FROZEN<map<TEXT, TEXT>>>
);
Cassandra Principles
  1. Query-first design: Create tables for each query pattern
  2. Denormalize: Data duplication is OK for read performance
  3. Partition key: Determines which node stores data
  4. Clustering columns: Sort order within partition
  5. No JOINs: Application handles relationships

4. Graph Databases

Concept

Store entities (nodes) and relationships (edges) as first-class citizens.

Plain text
┌─────────────────────────────────────────────────────────────┐
│              Graph Structure                                  │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│          (Person: Alice)                                    │
│                │                                              │
│                │ KNOWS {since: "2020"}                       │
│                ▼                                              │
│          (Person: Bob) ──WORKS_AT──► (Company: TechCorp)    │
│                │                               ▲              │
│                │ KNOWS                          │              │
│                ▼                               │              │
│          (Person: Carol) ──WORKS_AT───────────┘              │
│                                                              │
│  Nodes: People, Companies, Products                         │
│  Edges: KNOWS, WORKS_AT, BOUGHT, LOCATED_IN                 │
│  Both can have properties                                   │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Neo4j Cypher Query Language
Cypher
// Create nodes and relationship
CREATE (a:Person {name: "Alice", age: 30})
CREATE (b:Person {name: "Bob", age: 25})
CREATE (c:Company {name: "TechCorp"})
CREATE (a)-[:KNOWS {since: "2020"}]->(b)
CREATE (b)-[:WORKS_AT {role: "Engineer"}]->(c)

// Find Alice's friends
MATCH (alice:Person {name: "Alice"})-[:KNOWS]->(friend)
RETURN friend.name

// Find friends of friends
MATCH (alice:Person {name: "Alice"})-[:KNOWS*2]->(fof)
RETURN fof.name

// Find colleagues (people working at same company)
MATCH (p1:Person)-[:WORKS_AT]->(c:Company)<-[:WORKS_AT]-(p2:Person)
WHERE p1.name = "Bob" AND p1 <> p2
RETURN p2.name

// Shortest path between two people
MATCH path = shortestPath(
    (a:Person {name: "Alice"})-[:KNOWS*]-(b:Person {name: "Carol"})
)
RETURN path
When to Use Graph DB
  • Social networks: Friend recommendations, influence analysis
  • Fraud detection: Transaction patterns, ring detection
  • Recommendation engines: "People who bought X also bought Y"
  • Network/IT operations: Dependency analysis, impact analysis
  • Knowledge graphs: Entity relationships, semantic search

5. Time-Series Databases

Concept

Optimized for time-stamped data: metrics, events, IoT sensor readings.

Plain text
┌─────────────────────────────────────────────────────────────┐
│              Time-Series Data                               │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Timestamp           │ Metric          │ Value    │ Tags   │
│  ────────────────────┼─────────────────┼──────────┼────────│
│  2024-01-15 10:00  │ cpu.usage       │ 45.2     │ host:A │
│  2024-01-15 10:01  │ cpu.usage       │ 47.1     │ host:A │
│  2024-01-15 10:02  │ cpu.usage       │ 44.8     │ host:A │
│  2024-01-15 10:00  │ memory.usage    │ 78.5     │ host:A │
│  2024-01-15 10:00  │ cpu.usage       │ 52.3     │ host:B │
│                                                              │
│  Characteristics:                                            │
│  - Massive write volume (millions of points/second)         │
│  - Mostly append-only (recent data)                         │
│  - Queries over time ranges                                 │
│  - Automatic downsampling/retention                         │
│                                                              │
└─────────────────────────────────────────────────────────────┘
InfluxDB Example
Sql
-- Write data
INSERT cpu_usage,host=serverA,region=us-west value=45.2 1642233600000000000
INSERT cpu_usage,host=serverA,region=us-west value=47.1 1642233660000000000

-- Query with time range
SELECT mean(value) FROM cpu_usage 
WHERE host = 'serverA' 
AND time > now() - 1h 
GROUP BY time(5m)

-- Continuous query (auto-aggregate)
CREATE CONTINUOUS QUERY "cq_5m" ON "mydb"
BEGIN
    SELECT mean(value) INTO "cpu_usage_5m" 
    FROM "cpu_usage" 
    GROUP BY time(5m), host
END

-- Retention policy (auto-delete old data)
CREATE RETENTION POLICY "one_week" 
ON "mydb" 
DURATION 7d 
REPLICATION 1 
DEFAULT

NoSQL vs SQL: When to Choose

Plain text
┌─────────────────────────────────────────────────────────────┐
│              Choose SQL When:                               │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ✓ ACID compliance is critical (financial transactions)     │
│  ✓ Complex relationships and joins                         │
│  ✓ Schema is stable and well-defined                       │
│  ✓ Need complex aggregations and reporting                   │
│  ✓ Data integrity constraints are important                  │
│  ✓ Team is familiar with SQL                               │
│                                                              │
├─────────────────────────────────────────────────────────────┤
│              Choose NoSQL When:                               │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ✓ Massive scale (billions of records)                       │
│  ✓ Schema is fluid or evolving                             │
│  ✓ Need horizontal scalability                             │
│  ✓ High write throughput (events, logs)                    │
│  ✓ Geographic distribution (low latency)                   │
│  ✓ Specific data model fits use case (graph, time-series)  │
│  ✓ Rapid prototyping with flexible schema                  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Polyglot Persistence

Modern architectures often use multiple databases:

Plain text
┌─────────────────────────────────────────────────────────────┐
│              Typical Architecture                           │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│   ┌──────────┐   ┌──────────┐   ┌──────────┐                │
│   │  Redis   │   │PostgreSQL│   │   S3     │                │
│   │  (Cache) │   │ (Primary)│   │ (Files)  │                │
│   └────┬─────┘   └────┬─────┘   └──────────┘                │
│        │              │                                      │
│        └──────────────┘                                      │
│               │                                              │
│               ▼                                              │
│        ┌──────────────┐                                      │
│        │  Application │                                      │
│        └──────┬───────┘                                      │
│               │                                              │
│       ┌───────┴───────┐                                      │
│       ▼               ▼                                      │
│   ┌─────────┐   ┌──────────┐                                │
│   │Elasticsearch│ │ ClickHouse │                                │
│   │ (Search)  │   │ (Analytics)│                                │
│   └─────────┘   └──────────┘                                │
│                                                              │
│   Also common:                                               │
│   - MongoDB for content/catalog                              │
│   - InfluxDB/Prometheus for metrics                          │
│   - Neo4j for recommendations                                │
│   - Kafka for event streaming                                │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Next: Query Optimization