InkdownInkdown
Start writing

Study

59 filesยท8 subfolders

Shared Workspace

Study
core

01-introduction

Shared from "Study" on Inkdown

01 - Introduction to Databases

What is a Database?

At its core, a database is an organized collection of data that can be easily accessed, managed, and updated. Think of it as a digital filing system that's smarter, faster, and more reliable than storing data in files.

Why Not Just Use Files?
Python
# Without database - storing in files
# Problems:
# 1. Data duplication (John's email in 10 places)
# 2. No concurrent access (two people can't write simultaneously)
# 3. Searching is slow (read entire file)
# 4. No data validation (can insert "invalid" data)
# 5. No relationships between data

users.txt:
---
id|name|email|department_id
1|John|john@email.com|1
2|Jane|jane@email.com|2
3|Bob|bob@email.com|1

# To find John's department name:
# 1. Read users.txt to get department_id=1
# 2. Read departments.txt to find id=1
# 3. This is slow and error-prone
programming-language-concepts.md
zero-language-explanation.md
DB
01-introduction.md
02-relational-databases.md
03-database-design.md
04-indexing.md
05-transactions-acid.md
06-nosql-databases.md
07-query-optimization.md
08-replication-ha.md
09-sharding-partitioning.md
10-caching-strategies.md
11-cap-theorem.md
12-connection-pooling.md
13-backup-recovery.md
14-monitoring.md
15-database-selection.md
README.md
JS
Event loop
Merlin Backend
01-Orchestration.md
02-DeepResearch.md
03-Search.md
04-Scraping.md
05-Streaming.md
06-MultiProviderLLM.md
07-MemoryAndContext.md
08-ErrorHandling.md
09-RateLimiting.md
10-TaskQueue.md
11-SecurityAndAuth.md
Orchestration-2nd-draft
OpenAI Agents Python
00_OVERVIEW.md
01_AGENT_SYSTEM.md
02_RUNNER_SYSTEM.md
03_TOOL_SYSTEM.md
04_ITEMS_SYSTEM.md
05_GUARDRAILS.md
06_HANDOFFS.md
07_MEMORY_SESSIONS.md
08_MODEL_PROVIDERS.md
09_SANDBOX_SYSTEM.md
10_TRACING.md
11_RUN_STATE.md
12_CONTEXT.md
13_LIFECYCLE_HOOKS.md
14_CONFIGURATION.md
15_ERROR_HANDLING.md
16_STREAMING.md
17_EXTENSIONS.md
18_MCP_INTEGRATION.md
19_BEST_PRACTICES.md
20_ARCHITECTURE_PATTERNS.md
opencode-study
context-handling
core
Python
Alembic
Basics
sqlalchemy - fastapi
SQLAlchemy overview
tweets
system_design_for_agentic_apps.md

Types of Data Storage

1. Flat Files (CSV, JSON, XML)
  • Good for: Small datasets, simple configurations
  • Bad for: Concurrent access, complex queries, relationships
2. Relational Databases (SQL)
  • Good for: Structured data, complex relationships, ACID transactions
  • Examples: PostgreSQL, MySQL, Oracle, SQL Server
3. NoSQL Databases
  • Good for: Unstructured/semi-structured data, massive scale, rapid iteration
  • Examples: MongoDB, Cassandra, Redis, DynamoDB
4. Specialized Databases
  • Time-series: InfluxDB, TimescaleDB (metrics, IoT)
  • Graph: Neo4j, Amazon Neptune (relationships)
  • Search: Elasticsearch, OpenSearch (full-text search)
  • Vector: Pinecone, Weaviate (AI embeddings)

The Database Management System (DBMS)

A DBMS is software that interacts with end-users, applications, and the database itself to capture and analyze data. It provides:

Plain text
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚           Application               โ”‚
โ”‚      (Your code / Backend)          โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
              โ”‚ SQL/Query
              โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚      Database Management System     โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚   Query Parser & Optimizer  โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚   Execution Engine            โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚   Storage Engine              โ”‚  โ”‚
โ”‚  โ”‚   (B-trees, LSM-trees, etc.)  โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ”‚  โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”  โ”‚
โ”‚  โ”‚   Transaction Manager         โ”‚  โ”‚
โ”‚  โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
              โ”‚
              โ–ผ
โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚         Physical Storage            โ”‚
โ”‚    (Disk / SSD / Memory / Network)  โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜

Key Components Explained

1. Storage Engine

The heart of the database - decides HOW data is stored and retrieved.

B-Tree Engine (InnoDB, PostgreSQL)

Plain text
         [50]
        /    \
    [20,30]  [60,70,80]
    /   |     /   |   \
 [10] [25] [55] [65] [75] [85]
 
# Balanced tree structure
# Good for: Read-heavy, range queries
# Every operation: O(log n)

LSM-Tree Engine (Cassandra, RocksDB)

Plain text
# Write goes to in-memory first (memtable)
# Then flushed to disk as immutable SSTables
# Good for: Write-heavy workloads
# Trade-off: Reads might need to check multiple files
2. Query Processor
Sql
-- Your SQL query
SELECT * FROM users WHERE age > 25 AND city = 'NYC';

-- Steps:
-- 1. Parse: Convert to internal representation (AST)
-- 2. Validate: Do tables/columns exist? Permissions?
-- 3. Optimize: What's the best way to execute?
-- 4. Execute: Run the plan
3. Transaction Manager

Ensures data integrity when multiple operations happen together.

Database Architecture Patterns

1. Single Node (Monolith)
Plain text
App โ”€โ”€โ–บ Single Database Server
  • Simplest setup
  • Good for: Startups, small applications (< 1000 users)
  • Limits: CPU, memory, disk on one machine
2. Primary-Replica (Read Replicas)
Plain text
        โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
        โ”‚   Primary    โ”‚โ—„โ”€โ”€โ”€โ”€ Writes
        โ”‚   (Master)   โ”‚
        โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
               โ”‚ Replication
       โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
       โ–ผ       โ–ผ       โ–ผ
   โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ” โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”
   โ”‚Replicaโ”‚ โ”‚Replicaโ”‚ โ”‚Replicaโ”‚โ—„โ”€โ”€โ”€โ”€ Reads
   โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜ โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  • Primary handles writes
  • Replicas handle reads
  • Good for: Read-heavy applications
3. Sharded Architecture
Plain text
                    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
                    โ”‚ Router  โ”‚
                    โ””โ”€โ”€โ”€โ”€โ”ฌโ”€โ”€โ”€โ”€โ”˜
         โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ผโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
         โ–ผ               โ–ผ               โ–ผ
    โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”     โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
    โ”‚ Shard 1 โ”‚     โ”‚ Shard 2 โ”‚     โ”‚ Shard 3 โ”‚
    โ”‚Users A-Gโ”‚     โ”‚Users H-Pโ”‚     โ”‚Users Q-Zโ”‚
    โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜     โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
  • Data partitioned across multiple servers
  • Each shard holds a subset of data
  • Good for: Very large datasets, high write throughput

Database Selection Flowchart

Plain text
Is your data structure known upfront?
โ”‚
โ”œโ”€ YES โ”€โ”€โ–บ Need complex relationships & joins?
โ”‚          โ”‚
โ”‚          โ”œโ”€ YES โ”€โ”€โ–บ PostgreSQL / MySQL (Relational)
โ”‚          โ”‚
โ”‚          โ””โ”€ NO โ”€โ”€โ–บ Redis / DynamoDB (Key-Value)
โ”‚
โ””โ”€ NO โ”€โ”€โ–บ Need to store massive unstructured data?
          โ”‚
          โ”œโ”€ YES โ”€โ”€โ–บ MongoDB / Couchbase (Document)
          โ”‚
          โ””โ”€ NO โ”€โ”€โ–บ Need to handle time-series data?
                    โ”‚
                    โ”œโ”€ YES โ”€โ”€โ–บ InfluxDB / TimescaleDB
                    โ”‚
                    โ””โ”€ NO โ”€โ”€โ–บ Need graph relationships?
                              โ”‚
                              โ”œโ”€ YES โ”€โ”€โ–บ Neo4j
                              โ”‚
                              โ””โ”€ NO โ”€โ”€โ–บ Start with PostgreSQL

Important Terminology

TermDefinitionAnalogy
SchemaStructure/blueprint of databaseHouse floor plan
Table/CollectionGroup of related dataExcel sheet
Row/DocumentSingle recordOne row in Excel
Column/FieldAttribute of dataColumn in Excel
IndexData structure for fast lookupBook index
Primary KeyUnique identifier for each rowSocial Security Number
Foreign KeyReference to another table's PKPointer to related data
QueryRequest for dataAsking a question
TransactionGroup of operations treated as oneBank transfer (debit + credit)

Performance Metrics to Know

  1. Throughput: Queries per second (QPS) / Transactions per second (TPS)
  2. Latency: Time for a single query (p50, p95, p99 percentiles)
  3. Concurrency: Number of simultaneous connections
  4. Storage: Disk space used, I/O operations per second (IOPS)

Next Steps

Now let's dive deeper into each topic:

  • 02 - Relational Databases & SQL
  • 03 - Database Design & Normalization
  • 04 - Indexing Deep Dive
  • 05 - Transactions & ACID
  • 06 - NoSQL Databases
  • 07 - Query Optimization
  • 08 - Replication & High Availability
  • 09 - Sharding & Partitioning
  • 10 - Caching Strategies
  • 11 - CAP Theorem & Distributed Systems
  • 12 - Connection Pooling & Performance
  • 13 - Backup & Recovery
  • 14 - Monitoring & Observability
  • 15 - Choosing the Right Database