InkdownInkdown
Start writing

Study

59 files·8 subfolders

Shared Workspace

Study
core

15-database-selection

Shared from "Study" on Inkdown

15 - Choosing the Right Database

The Database Selection Process

Plain text
┌─────────────────────────────────────────────────────────────┐
│              Database Selection Framework                    │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  1. Understand Requirements                                  │
│     ├── Data structure (structured, semi-structured, blob) │
│     ├── Query patterns (OLTP, OLAP, search, time-series)     │
│     ├── Scale (data size, QPS, growth rate)                │
│     ├── Consistency needs (ACID vs eventual)               │
│     ├── Latency requirements (ms vs seconds)               │
│     ├── Geographic distribution (single region vs global)  │
│     └── Team expertise                                       │
│                                                              │
│  2. Evaluate Options                                         │
│     ├── Fit for use case                                     │
│     ├── Operational complexity                               │
│     ├── Ecosystem and tooling                                │
│     ├── Cost (license, hardware, managed service)          │
│     └── Community/support                                     │
│                                                              │
│  3. Prototype & Validate                                     │
│     ├── Load testing with realistic data                   │
│     ├── Failover testing                                     │
│     ├── Backup/restore testing                               │
│     └── Migration path validation                           │
│                                                              │
│  4. Decision & Documentation                                 │
│     ├── Document decision rationale                          │
│     ├── Define success metrics                               │
│     └── Plan for future re-evaluation                       │
│                                                              │
└─────────────────────────────────────────────────────────────┘
programming-language-concepts.md
zero-language-explanation.md
DB
01-introduction.md
02-relational-databases.md
03-database-design.md
04-indexing.md
05-transactions-acid.md
06-nosql-databases.md
07-query-optimization.md
08-replication-ha.md
09-sharding-partitioning.md
10-caching-strategies.md
11-cap-theorem.md
12-connection-pooling.md
13-backup-recovery.md
14-monitoring.md
15-database-selection.md
README.md
JS
Event loop
Merlin Backend
01-Orchestration.md
02-DeepResearch.md
03-Search.md
04-Scraping.md
05-Streaming.md
06-MultiProviderLLM.md
07-MemoryAndContext.md
08-ErrorHandling.md
09-RateLimiting.md
10-TaskQueue.md
11-SecurityAndAuth.md
Orchestration-2nd-draft
OpenAI Agents Python
00_OVERVIEW.md
01_AGENT_SYSTEM.md
02_RUNNER_SYSTEM.md
03_TOOL_SYSTEM.md
04_ITEMS_SYSTEM.md
05_GUARDRAILS.md
06_HANDOFFS.md
07_MEMORY_SESSIONS.md
08_MODEL_PROVIDERS.md
09_SANDBOX_SYSTEM.md
10_TRACING.md
11_RUN_STATE.md
12_CONTEXT.md
13_LIFECYCLE_HOOKS.md
14_CONFIGURATION.md
15_ERROR_HANDLING.md
16_STREAMING.md
17_EXTENSIONS.md
18_MCP_INTEGRATION.md
19_BEST_PRACTICES.md
20_ARCHITECTURE_PATTERNS.md
opencode-study
context-handling
core
Python
Alembic
Basics
sqlalchemy - fastapi
SQLAlchemy overview
tweets
system_design_for_agentic_apps.md

Decision Tree

Plain text
┌─────────────────────────────────────────────────────────────┐
│              Quick Decision Guide                             │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Q1: Is your data structure well-defined and relational?   │
│                                                              │
│  ├─ YES ──► Q2: Do you need complex transactions & joins?   │
│  │          │                                                │
│  │          ├─ YES ──► Q3: Need horizontal scaling?          │
│  │          │          │                                     │
│  │          │          ├─ NO ──► PostgreSQL / MySQL         │
│  │          │          │          (Standard choice)          │
│  │          │          │                                     │
│  │          │          └─ YES ──► CockroachDB / YugabyteDB  │
│  │          │                     (Distributed SQL)          │
│  │          │                                                │
│  │          └─ NO ──► Q4: Is it simple key-value access?    │
│  │                     │                                     │
│  │                     ├─ YES ──► Redis / DynamoDB         │
│  │                     │                                     │
│  │                     └─ NO ──► Continue below...           │
│  │                                                           │
│  └─ NO ──► Q5: What is the primary data pattern?             │
│             │                                                │
│             ├─ Documents with varying schema ──► MongoDB    │
│             │                                                │
│             ├─ Time-series data ──► TimescaleDB / InfluxDB  │
│             │                                                │
│             ├─ Graph relationships ──► Neo4j / Neptune       │
│             │                                                │
│             ├─ Search/indexing ──► Elasticsearch / OpenSearch│
│             │                                                │
│             ├─ Wide-column, massive scale ──► Cassandra     │
│             │                                                │
│             └─ Cache/sessions/rate limiting ──► Redis       │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Database Comparison

Relational Databases
Plain text
┌─────────────────────────────────────────────────────────────┐
│              PostgreSQL                                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Best for:                                                   │
│  - Complex queries and analytics                             │
│  - Full ACID compliance                                      │
│  - Geographic data (PostGIS)                                 │
│  - JSON support with indexing (JSONB)                        │
│  - Advanced features (window functions, CTEs)                │
│                                                              │
│  Strengths:                                                   │
│  - Most feature-rich open-source RDBMS                       │
│  - Excellent extensibility (extensions)                     │
│  - Strong consistency and reliability                         │
│  - Great documentation and community                          │
│                                                              │
│  Considerations:                                              │
│  - Single-node writes (use read replicas for scale)          │
│  - Learning curve for advanced features                       │
│                                                              │
│  When to choose: Your default choice for relational data     │
│                                                              │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│              MySQL / MariaDB                                 │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Best for:                                                   │
│  - Web applications (LAMP stack)                             │
│  - Read-heavy workloads                                      │
│  - Simple replication scenarios                              │
│                                                              │
│  Strengths:                                                   │
│  - Wide hosting support                                       │
│  - Simple to get started                                      │
│  - Good read scaling with replication                         │
│  - Large ecosystem (WordPress, etc.)                          │
│                                                              │
│  Considerations:                                              │
│  - MySQL 8.0 has improved significantly                     │
│  - Choose InnoDB storage engine                               │
│                                                              │
│  When to choose: Legacy systems, web hosting, team familiar  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│              CockroachDB / YugabyteDB                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Best for:                                                   │
│  - Globally distributed SQL                                  │
│  - Horizontal write scaling                                  │
│  - Multi-region deployments                                   │
│                                                              │
│  Strengths:                                                   │
│  - PostgreSQL-compatible protocol                             │
│  - Automatic sharding and replication                         │
│  - Survive node/region failures                               │
│  - Consistent distributed transactions                         │
│                                                              │
│  Considerations:                                              │
│  - Higher write latency (consensus)                          │
│  - Complex to operate self-hosted                            │
│  - CockroachDB: Managed (CockroachCloud) recommended         │
│                                                              │
│  When to choose: Need SQL + global scale + availability      │
│                                                              │
└─────────────────────────────────────────────────────────────┘
NoSQL Databases
Plain text
┌─────────────────────────────────────────────────────────────┐
│              MongoDB                                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Best for:                                                   │
│  - Rapidly evolving schemas                                  │
│  - Document-oriented data                                     │
│  - Content management, catalogs, user profiles               │
│  - JavaScript/JSON-native stack                              │
│                                                              │
│  Strengths:                                                   │
│  - Flexible document model                                     │
│  - Strong query capabilities (aggregation pipeline)          │
│  - Horizontal scaling with sharding                          │
│  - Rich indexing options                                      │
│                                                              │
│  Considerations:                                              │
│  - Transactions available but have limitations               │
│  - Schema design requires careful thought                      │
│  - Memory-mapped files can be memory-intensive               │
│                                                              │
│  When to choose: Flexible schema, document data, JS stack    │
│                                                              │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│              Redis                                           │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Best for:                                                   │
│  - Caching (primary use case)                                 │
│  - Session storage                                            │
│  - Real-time leaderboards/rankings                           │
│  - Rate limiting                                              │
│  - Pub/sub messaging                                          │
│  - Temporary data                                             │
│                                                              │
│  Strengths:                                                   │
│  - In-memory = extremely fast (microseconds)                 │
│  - Rich data structures (strings, hashes, lists, sets, etc.) │
│  - Pub/sub built-in                                          │
│  - Atomic operations                                          │
│                                                              │
│  Considerations:                                              │
│  - Data fits in memory (or use Redis on Flash)              │
│  - Persistence options (RDB snapshots, AOF)                  │
│  - Not for primary data store (unless using Redis Enterprise)│
│                                                              │
│  When to choose: Caching, real-time features, temporary data │
│                                                              │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│              Cassandra                                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Best for:                                                   │
│  - Write-heavy workloads at massive scale                    │
│  - Time-series data                                            │
│  - Globally distributed data                                   │
│  - High availability (AP system)                             │
│                                                              │
│  Strengths:                                                   │
│  - Linear scalability (add nodes = add capacity)            │
│  - No single point of failure                                 │
│  - Tunable consistency                                         │
│  - Excellent cross-datacenter replication                     │
│                                                              │
│  Considerations:                                              │
│  - Eventual consistency model                                  │
│  - Query-first data modeling (not entity-first)               │
│  - No joins, limited transactions                              │
│  - Operational complexity                                      │
│                                                              │
│  When to choose: Massive write scale, global distribution,   │
│  high availability is paramount                              │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Specialized Databases
Plain text
┌─────────────────────────────────────────────────────────────┐
│              Elasticsearch / OpenSearch                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Best for:                                                   │
│  - Full-text search                                            │
│  - Log aggregation and analysis                                │
│  - Faceted search (e-commerce)                               │
│  - Real-time analytics                                         │
│                                                              │
│  Strengths:                                                   │
│  - Powerful full-text search                                   │
│  - Distributed by design                                       │
│  - Schema-free (dynamic mapping)                               │
│  - Aggregation queries                                         │
│                                                              │
│  Considerations:                                              │
│  - Not ACID-compliant (not a primary store)                  │
│  - Memory intensive                                            │
│  - Split-brain issues in older versions                       │
│                                                              │
│  When to choose: Search, logging, analytics on text data      │
│                                                              │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│              Time-Series Databases                           │
│              (TimescaleDB, InfluxDB, ClickHouse)             │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Best for:                                                   │
│  - IoT sensor data                                             │
│  - Metrics and monitoring data                                 │
│  - Financial tick data                                         │
│  - Event tracking                                              │
│                                                              │
│  TimescaleDB (PostgreSQL extension):                         │
│  - SQL interface                                               │
│  - Automatic partitioning                                       │
│  - Continuous aggregation                                       │
│  - Data retention policies                                      │
│                                                              │
│  InfluxDB:                                                     │
│  - Purpose-built for time-series                               │
│  - InfluxQL or Flux query languages                             │
│  - Built-in retention and downsampling                          │
│                                                              │
│  ClickHouse:                                                   │
│  - Columnar storage (OLAP)                                     │
│  - Extremely fast aggregations                                 │
│  - Good for analytics workloads                                │
│                                                              │
│  When to choose: Time-ordered data, metrics, analytics        │
│                                                              │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│              Graph Databases (Neo4j)                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Best for:                                                   │
│  - Relationship-heavy data                                     │
│  - Recommendation engines                                      │
│  - Fraud detection                                             │
│  - Social networks                                             │
│  - Knowledge graphs                                            │
│                                                              │
│  Strengths:                                                   │
│  - Native graph storage and processing                         │
│  - Efficient traversal of relationships                        │
│  - Pattern matching queries (Cypher)                           │
│                                                              │
│  Considerations:                                              │
│  - Specialized use case                                        │
│  - Not a general-purpose database                               │
│                                                              │
│  When to choose: Data is relationships, path finding needed  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

When to Use Multiple Databases

Plain text
┌─────────────────────────────────────────────────────────────┐
│              Polyglot Persistence Example                    │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  E-commerce Application:                                     │
│                                                              │
│  ┌─────────────┐                                              │
│  │PostgreSQL   │  Primary data: users, orders, inventory      │
│  │             │  ACID transactions, complex queries           │
│  └──────┬──────┘                                              │
│         │                                                     │
│  ┌──────┴──────┐                                              │
│  │    Redis    │  Sessions, shopping cart (temporary)         │
│  │             │  Hot product cache                            │
│  └─────────────┘                                              │
│                                                              │
│  ┌─────────────┐                                              │
│  │Elasticsearch│  Product search, autocomplete                  │
│  │             │  Faceted filtering                            │
│  └─────────────┘                                              │
│                                                              │
│  ┌─────────────┐                                              │
│  │ ClickHouse  │  Analytics: sales reports, user behavior       │
│  │             │  Aggregated from PostgreSQL                   │
│  └─────────────┘                                              │
│                                                              │
│  Data Flow:                                                   │
│  - Writes go to PostgreSQL                                    │
│  - Redis caches hot data                                      │
│  - Elasticsearch indexes products                             │
│  - CDC (Change Data Capture) to ClickHouse for analytics      │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Migration Considerations

Plain text
┌─────────────────────────────────────────────────────────────┐
│              Database Migration Checklist                    │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Before Migration:                                            │
│  □ Document current pain points                             │
│  □ Define success criteria                                    │
│  □ Benchmark target database with production-like data       │
│  □ Test failover and backup procedures                        │
│  □ Train team on new database                                 │
│  □ Plan rollback strategy                                     │
│                                                              │
│  During Migration:                                            │
│  □ Dual-write period (old + new)                              │
│  □ Data consistency verification                              │
│  □ Performance monitoring                                     │
│  □ Gradual traffic shifting                                   │
│  □ Quick rollback capability                                    │
│                                                              │
│  After Migration:                                             │
│  □ Monitor for edge cases                                     │
│  □ Optimize based on real workload                            │
│  □ Document lessons learned                                   │
│  □ Retire old system (after confidence period)                │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Default Recommendations

Plain text
┌─────────────────────────────────────────────────────────────┐
│              "I need a database for..."                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  General web application:                                     │
│  → PostgreSQL (default choice)                                │
│                                                              │
│  High write volume, global scale:                             │
│  → Cassandra or CockroachDB                                     │
│                                                              │
│  Cache/session store:                                         │
│  → Redis                                                      │
│                                                              │
│  Search functionality:                                        │
│  → Elasticsearch                                              │
│                                                              │
│  Time-series/metrics:                                         │
│  → TimescaleDB or InfluxDB                                    │
│                                                              │
│  Document store with flexible schema:                         │
│  → MongoDB                                                    │
│                                                              │
│  Graph relationships:                                         │
│  → Neo4j                                                      │
│                                                              │
│  Not sure / prototyping:                                      │
│  → PostgreSQL (can handle most needs initially)              │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Final Advice

Plain text
┌─────────────────────────────────────────────────────────────┐
│              Golden Rules                                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  1. Start simple                                               │
│     - PostgreSQL can handle most use cases initially         │
│     - Add specialized databases when you hit real limits       │
│                                                              │
│  2. Optimize before scaling                                    │
│     - Indexes, query optimization, caching first            │
│     - Then read replicas                                       │
│     - Then partitioning                                          │
│     - Then sharding/specialized DBs                            │
│                                                              │
│  3. Operational expertise matters                              │
│     - Can you run it in production?                           │
│     - Managed services reduce operational burden               │
│                                                              │
│  4. Avoid premature optimization                               │
│     - Don't shard on day 1                                      │
│     - Don't use 5 different databases for simple app          │
│                                                              │
│  5. Plan for the future                                        │
│     - Migration path to scale                                   │
│     - Data export capabilities                                  │
│     - Avoid vendor lock-in when possible                        │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Summary: End-to-End Database Knowledge

You now have a comprehensive understanding of databases from the ground up:

  1. Foundations - What databases are, types, ACID
  2. SQL - Querying, design, normalization
  3. Performance - Indexing, optimization, caching
  4. Scale - Replication, sharding, partitioning
  5. Operations - Backups, monitoring, selection

Remember: The best database is the one that fits your specific needs while your team can operate it effectively.

Keep learning, keep experimenting, and happy data managing!