InkdownInkdown
Start writing

Study

59 files·8 subfolders

Shared Workspace

Study
core

04-indexing

Shared from "Study" on Inkdown

04 - Indexing Deep Dive

What is an Index?

An index is a data structure that improves the speed of data retrieval operations on a database table at the cost of additional storage and slower writes.

Analogy: Book Index vs Database Index
Plain text
Without Book Index:
┌─────────────────────────────────────┐
│  Chapter 1: Introduction............│  ▲
│  Chapter 2: Setup...................│  │
│  ... (300 pages) ...................│  │
│  Topic X on page 285................│  │ Must scan
│  ... (more pages) ..................│  │ everything
│  Chapter 10: Conclusion.............│  ▼
└─────────────────────────────────────┘

With Book Index:
┌─────────────────────────────────────┐
│  Index:                             │
│  Topic A ............. page 12     │  ┌─────────────┐
│  Topic B ............. page 45     │──►│ Page 285    │
│  ...                               │  │ Topic X     │
│  Topic X ............. page 285 ◄────│  └─────────────┘
│  Topic Y ............. page 300    │
└─────────────────────────────────────┘
programming-language-concepts.md
zero-language-explanation.md
DB
01-introduction.md
02-relational-databases.md
03-database-design.md
04-indexing.md
05-transactions-acid.md
06-nosql-databases.md
07-query-optimization.md
08-replication-ha.md
09-sharding-partitioning.md
10-caching-strategies.md
11-cap-theorem.md
12-connection-pooling.md
13-backup-recovery.md
14-monitoring.md
15-database-selection.md
README.md
JS
Event loop
Merlin Backend
01-Orchestration.md
02-DeepResearch.md
03-Search.md
04-Scraping.md
05-Streaming.md
06-MultiProviderLLM.md
07-MemoryAndContext.md
08-ErrorHandling.md
09-RateLimiting.md
10-TaskQueue.md
11-SecurityAndAuth.md
Orchestration-2nd-draft
OpenAI Agents Python
00_OVERVIEW.md
01_AGENT_SYSTEM.md
02_RUNNER_SYSTEM.md
03_TOOL_SYSTEM.md
04_ITEMS_SYSTEM.md
05_GUARDRAILS.md
06_HANDOFFS.md
07_MEMORY_SESSIONS.md
08_MODEL_PROVIDERS.md
09_SANDBOX_SYSTEM.md
10_TRACING.md
11_RUN_STATE.md
12_CONTEXT.md
13_LIFECYCLE_HOOKS.md
14_CONFIGURATION.md
15_ERROR_HANDLING.md
16_STREAMING.md
17_EXTENSIONS.md
18_MCP_INTEGRATION.md
19_BEST_PRACTICES.md
20_ARCHITECTURE_PATTERNS.md
opencode-study
context-handling
core
Python
Alembic
Basics
sqlalchemy - fastapi
SQLAlchemy overview
tweets
system_design_for_agentic_apps.md

Why Indexes Matter

Sql
-- Table: users (10 million rows)
-- Scenario: Find user by email

-- Without index: Full Table Scan
-- Database reads EVERY row
-- Cost: O(n) - 10 million comparisons
-- Time: ~30 seconds

SELECT * FROM users WHERE email = 'john@example.com';

-- With index on email: Index Seek + Lookup
-- Database uses B-tree to find row
-- Cost: O(log n) - ~24 comparisons
-- Time: ~5 milliseconds

-- Speedup: 6000x faster!

How Indexes Work: B-Trees

Most database indexes use B-Tree (Balanced Tree) structure:

Plain text
                    [50]
                   /    \
              [20,30]    [60,70,80]
              /   |      /   |   \
          [10] [25]  [55] [65] [75] [85]
          
Properties:
- Balanced: All leaf nodes at same depth
- Sorted: Left < Parent < Right
- Multi-way: Each node can have multiple keys
- Self-maintaining: Insert/Delete rebalances automatically

Search for 65:
1. Start at root [50], 65 > 50, go right
2. At [60,70,80], 60 < 65 < 70, go middle
3. Found 65! (3 comparisons vs scanning)
B-Tree Node Structure
Plain text
┌─────────────────────────────────────────┐
│           Internal Node                │
├─────────────────────────────────────────┤
│ Key 1 │ Key 2 │ Key 3 │ ... │ Key N    │
├───────┼───────┼───────┼───┼───────────┤
│ Ptr 0 │ Ptr 1 │ Ptr 2 │...│ Ptr N     │
└───────┴───────┴───────┴───┴───────────┘

Ptr i points to child where: Key i < value < Key i+1

┌─────────────────────────────────────────┐
│             Leaf Node                  │
├─────────────────────────────────────────┤
│ Key 1 │ Key 2 │ Key 3 │ ... │ Key N    │
├───────┼───────┼───────┼───┼───────────┤
│ RID 1 │ RID 2 │ RID 3 │...│ RID N     │
└───────┴───────┴───────┴───┴───────────┘

RID = Row ID (pointer to actual data row)
Leaf nodes often linked for range scans

Types of Indexes

1. Single-Column Index
Sql
-- Basic index on one column
CREATE INDEX idx_users_email ON users(email);

-- Queries that use it:
SELECT * FROM users WHERE email = 'x';           -- Equality
SELECT * FROM users WHERE email LIKE 'a%';       -- Prefix match
SELECT * FROM users WHERE email > 'a' AND email < 'm'; -- Range
2. Composite (Multi-Column) Index
Sql
-- Index on multiple columns (order matters!)
CREATE INDEX idx_orders_user_status ON orders(user_id, status);

-- Query patterns that use it:
WHERE user_id = 1;                          -- ✓ Uses index
WHERE user_id = 1 AND status = 'pending'; -- ✓ Uses index (best)
WHERE status = 'pending';                   -- ✗ Doesn't use (wrong order)

-- Rule: Columns used in equality first, then range
CREATE INDEX idx_optimal ON table(a, b, c);
-- Best for: WHERE a = ? AND b = ? AND c > ?
3. Unique Index
Sql
-- Enforces uniqueness, automatically created for PK/Unique constraints
CREATE UNIQUE INDEX idx_unique_email ON users(email);

-- Benefits:
-- 1. Fast lookup (same as regular index)
-- 2. Prevents duplicates
-- 3. Allows upsert operations
4. Partial Index
Sql
-- Index only subset of rows (smaller, faster)
CREATE INDEX idx_active_users ON users(email) 
WHERE status = 'active';

-- Perfect for queries that always filter
SELECT * FROM users 
WHERE status = 'active' AND email = 'x';

-- Size: Only ~20% of rows (if 80% inactive)
-- Speed: Much faster than full index
5. Expression Index
Sql
-- Index on computed expression
CREATE INDEX idx_lower_email ON users(LOWER(email));

-- Case-insensitive search now fast
SELECT * FROM users WHERE LOWER(email) = LOWER('John@Example.COM');

-- Other examples
CREATE INDEX idx_name_length ON users(LENGTH(name));
CREATE INDEX idx_date_trunc ON orders(DATE(created_at));
6. Covering Index (Index Only Scan)
Sql
-- Include extra columns in index for faster queries
CREATE INDEX idx_orders_covering ON orders(user_id, status) 
INCLUDE (total, created_at);  -- PostgreSQL 11+

-- Query can be satisfied entirely from index
SELECT status, total, created_at 
FROM orders 
WHERE user_id = 1;

-- No need to access table (heap) at all!
-- Called "Index Only Scan" in EXPLAIN output
7. Full-Text Index
Sql
-- PostgreSQL
CREATE INDEX idx_search ON articles 
USING GIN (to_tsvector('english', content));

-- Query
SELECT * FROM articles 
WHERE to_tsvector('english', content) @@ to_tsquery('database & indexing');

-- MySQL
CREATE FULLTEXT INDEX idx_content ON articles(content);
SELECT * FROM articles WHERE MATCH(content) AGAINST('database indexing');
8. GiST / GIN / SP-GiST / BRIN (PostgreSQL Special)
Sql
-- GIN: Generalized Inverted Index (arrays, JSONB, full-text)
CREATE INDEX idx_tags ON posts USING GIN (tags);
-- tags is an array column
SELECT * FROM posts WHERE tags && ARRAY['postgres', 'database'];

-- GiST: Generalized Search Tree (geospatial, ranges)
CREATE INDEX idx_location ON places USING GiST (location);
-- Find places near point
SELECT * FROM places WHERE location <-> point '(0,0)' < 1000;

-- BRIN: Block Range Index (very large, naturally ordered tables)
CREATE INDEX idx_timestamp ON events USING BRIN (created_at);
-- Good for time-series data (billions of rows, timestamp always increasing)

Index Selectivity

Selectivity = (Number of unique values) / (Total rows)

Higher selectivity = better index performance

Plain text
Perfect Selectivity (1.0):
┌────────────────────────┐
│ PK ID: 1, 2, 3, 4...   │
│ Each lookup finds 1 row│
└────────────────────────┘

Good Selectivity (>0.1):
┌────────────────────────┐
│ Email: unique mostly   │
│ Status: active/inactive│ ◄── Don't index low cardinality
│ 95% active, 5% inactive│
│ Selectivity: 0.02 (bad)│
└────────────────────────┘

Bad Selectivity (<0.01):
┌────────────────────────┐
│ Gender: M/F/O          │
│ Boolean flags          │
│ Selectivity: ~0.0001   │
│ Full scan often faster │
└────────────────────────┘

Index Strategies by Query Type

Equality Queries
Sql
WHERE email = 'john@example.com'
-- → B-tree index on email
Range Queries
Sql
WHERE created_at BETWEEN '2024-01-01' AND '2024-01-31'
-- → B-tree index on created_at
-- Composite: (user_id, created_at) for user-specific ranges
Pattern Matching
Sql
WHERE email LIKE 'john%'   -- Prefix: ✓ Index works
WHERE email LIKE '%john%'  -- Contains: ✗ Index doesn't help
WHERE email LIKE '%@gmail.com' -- Suffix: ✗ Full scan

-- For suffix/containment, consider:
-- 1. Reverse stored column + index
-- 2. Full-text search
-- 3. Trigram indexes (PostgreSQL pg_trgm)
CREATE INDEX idx_email_trigram ON users USING GIN (email gin_trgm_ops);
Sorting & Pagination
Sql
ORDER BY created_at DESC LIMIT 10
-- → Index on created_at DESC (avoids sort operation)

ORDER BY user_id, created_at DESC
-- → Composite index: (user_id, created_at DESC)
Joins
Sql
-- Foreign keys should always be indexed
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- JOIN queries need indexes on join columns
SELECT * FROM users 
JOIN orders ON users.id = orders.user_id;
-- Needs: PK on users.id, index on orders.user_id

How to Create Good Indexes

Step-by-Step Process
Sql
-- 1. Identify slow queries from logs
-- Look for queries with high total_time * calls

-- 2. Check query plan
EXPLAIN (ANALYZE, BUFFERS) 
SELECT * FROM orders WHERE user_id = 123 AND status = 'pending';

-- Output might show:
-- Seq Scan on orders (cost=0.00..18406.00 rows=1 width=244)
--   Filter: ((user_id = 123) AND (status = 'pending'::text))
--   Rows Removed by Filter: 184055
-- This means: scanned 184K rows to find 1!

-- 3. Create appropriate index
CREATE INDEX idx_orders_user_status ON orders(user_id, status);

-- 4. Verify improvement
EXPLAIN (ANALYZE, BUFFERS)
SELECT * FROM orders WHERE user_id = 123 AND status = 'pending';

-- Should now show:
-- Index Scan using idx_orders_user_status
--   Index Cond: ((user_id = 123) AND (status = 'pending'::text))
EXPLAIN Output Decoded
Plain text
┌─────────────────────────────────────────────────────────────┐
│  Seq Scan                                                   │
│  └── Sequential read of ALL rows                            │
│  └── Bad for large tables (use index instead)              │
├─────────────────────────────────────────────────────────────┤
│  Index Scan                                                 │
│  ├── Index lookup to find row IDs                           │
│  └── Then fetch rows from table                            │
│  └── Good for selective queries                            │
├─────────────────────────────────────────────────────────────┤
│  Index Only Scan                                            │
│  └── All data in index, no table access                     │
│  └── Fastest possible!                                     │
├─────────────────────────────────────────────────────────────┤
│  Bitmap Index Scan + Bitmap Heap Scan                       │
│  └── For queries returning many rows                        │
│  └── Efficient bulk retrieval                              │
└─────────────────────────────────────────────────────────────┘

Index Maintenance

1. Fragmentation
Sql
-- PostgreSQL: Check bloat
SELECT 
    schemaname, tablename, attname as column,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stats 
WHERE tablename = 'orders';

-- Rebuild index (PostgreSQL)
REINDEX INDEX CONCURRENTLY idx_orders_user_id;

-- MySQL
OPTIMIZE TABLE orders;
2. Unused Indexes
Sql
-- PostgreSQL: Find unused indexes
SELECT 
    schemaname || '.' || relname AS table,
    indexrelname AS index,
    pg_size_pretty(pg_relation_size(indexrelid)) AS index_size,
    idx_scan AS index_scans
FROM pg_stat_user_indexes
WHERE idx_scan = 0 
AND indexrelname NOT LIKE 'pg_toast%'
ORDER BY pg_relation_size(indexrelid) DESC;

-- Consider dropping if truly unused
DROP INDEX CONCURRENTLY idx_unused;
3. Duplicate/Redundant Indexes
Sql
-- These are redundant:
CREATE INDEX idx_a ON table(a);
CREATE INDEX idx_a_b ON table(a, b);  -- First column already indexed above

-- PostgreSQL: Find duplicates
SELECT 
    indrelid::regclass AS table,
    array_agg(indexrelid::regclass) AS indexes
FROM pg_index
GROUP BY indrelid, indkey
HAVING COUNT(*) > 1;

Index Trade-offs

Storage Cost
Sql
-- Indexes use significant disk space
-- Typical: 20-50% of table size

-- Check sizes
SELECT 
    relname AS table,
    pg_size_pretty(pg_relation_size(oid)) AS table_size,
    pg_size_pretty(pg_indexes_size(oid)) AS indexes_size
FROM pg_class 
WHERE relname = 'orders';

-- Result might show:
-- orders | 500 MB | 200 MB (40% overhead)
Write Performance
Plain text
Without Index:
INSERT: Write to table only
        Cost: 1 write

With 3 Indexes:
INSERT: Write to table + 3 index updates
        Cost: 4 writes
        Slower by ~4x for writes
Locking
Sql
-- Creating indexes blocks writes (in most databases)
-- Solution: CONCURRENTLY (PostgreSQL) or ONLINE (MySQL/Oracle)

CREATE INDEX CONCURRENTLY idx_new ON table(column);
-- Slower creation, but doesn't lock table

Special Indexing Scenarios

1. Indexing for JSON/JSONB
Sql
CREATE TABLE events (
    id SERIAL PRIMARY KEY,
    data JSONB
);

-- Index specific JSON path
CREATE INDEX idx_events_user ON events((data->>'user_id'));

-- GIN index for containment queries
CREATE INDEX idx_events_data ON events USING GIN (data);

-- Query examples
SELECT * FROM events WHERE data @> '{"type": "click"}';
SELECT * FROM events WHERE data->>'user_id' = '123';
2. Indexing Arrays
Sql
CREATE TABLE articles (
    id SERIAL PRIMARY KEY,
    tags TEXT[]
);

-- GIN index for array operations
CREATE INDEX idx_article_tags ON articles USING GIN (tags);

-- Fast queries
SELECT * FROM articles WHERE tags && ARRAY['database', 'sql'];
SELECT * FROM articles WHERE tags @> ARRAY['postgres'];
3. Partial Index for Soft Deletes
Sql
-- Only index non-deleted rows (much smaller!)
CREATE INDEX idx_active_orders ON orders(created_at) 
WHERE deleted_at IS NULL;

-- Queries on active orders are fast
SELECT * FROM orders 
WHERE deleted_at IS NULL AND created_at > '2024-01-01';
4. Index for Ordering
Sql
-- For frequent "latest items" queries
CREATE INDEX idx_posts_latest ON posts(created_at DESC, id DESC);

-- Fast pagination
SELECT * FROM posts 
ORDER BY created_at DESC, id DESC 
LIMIT 10;

-- Keyset pagination (more efficient than OFFSET)
SELECT * FROM posts 
WHERE (created_at, id) < (last_date, last_id)
ORDER BY created_at DESC, id DESC 
LIMIT 10;

Index Checklist

  • Primary keys always indexed (automatic)
  • Foreign keys indexed
  • Columns in WHERE clauses indexed
  • ORDER BY columns indexed
  • JOIN columns indexed
  • High-selectivity columns preferred
  • Composite indexes ordered correctly (equality first)
  • Covering indexes for common queries
  • Partial indexes for filtered queries
  • Expression indexes for computed WHERE
  • Regular review of unused indexes
  • Index size monitored

Common Mistakes

  1. Indexing low-cardinality columns (boolean, status with few values)
  2. Too many single-column indexes instead of composite
  3. Wrong column order in composite indexes
  4. Not analyzing (updating statistics) after bulk loads
  5. Indexing everything (wastes space, slows writes)
  6. Forgetting to index foreign keys
  7. Using indexes for small tables (sequential scan is faster)

Next: Transactions & ACID