InkdownInkdown
Start writing

Study

59 files·8 subfolders

Shared Workspace

Study
core

07-query-optimization

Shared from "Study" on Inkdown

07 - Query Optimization

The Query Lifecycle

Plain text
┌─────────────────────────────────────────────────────────────┐
│              Query Processing Pipeline                       │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  1. Parse    ──► SQL text → Query Tree (AST)               │
│                                                              │
│  2. Rewrite  ──► Transform query (views, rules)            │
│                                                              │
│  3. Plan     ──► Generate possible execution plans         │
│       └── Cost-based optimizer estimates each plan           │
│                                                              │
│  4. Optimize ──► Choose lowest cost plan                     │
│                                                              │
│  5. Execute  ──► Run the plan, return results                │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Understanding EXPLAIN

programming-language-concepts.md
zero-language-explanation.md
DB
01-introduction.md
02-relational-databases.md
03-database-design.md
04-indexing.md
05-transactions-acid.md
06-nosql-databases.md
07-query-optimization.md
08-replication-ha.md
09-sharding-partitioning.md
10-caching-strategies.md
11-cap-theorem.md
12-connection-pooling.md
13-backup-recovery.md
14-monitoring.md
15-database-selection.md
README.md
JS
Event loop
Merlin Backend
01-Orchestration.md
02-DeepResearch.md
03-Search.md
04-Scraping.md
05-Streaming.md
06-MultiProviderLLM.md
07-MemoryAndContext.md
08-ErrorHandling.md
09-RateLimiting.md
10-TaskQueue.md
11-SecurityAndAuth.md
Orchestration-2nd-draft
OpenAI Agents Python
00_OVERVIEW.md
01_AGENT_SYSTEM.md
02_RUNNER_SYSTEM.md
03_TOOL_SYSTEM.md
04_ITEMS_SYSTEM.md
05_GUARDRAILS.md
06_HANDOFFS.md
07_MEMORY_SESSIONS.md
08_MODEL_PROVIDERS.md
09_SANDBOX_SYSTEM.md
10_TRACING.md
11_RUN_STATE.md
12_CONTEXT.md
13_LIFECYCLE_HOOKS.md
14_CONFIGURATION.md
15_ERROR_HANDLING.md
16_STREAMING.md
17_EXTENSIONS.md
18_MCP_INTEGRATION.md
19_BEST_PRACTICES.md
20_ARCHITECTURE_PATTERNS.md
opencode-study
context-handling
core
Python
Alembic
Basics
sqlalchemy - fastapi
SQLAlchemy overview
tweets
system_design_for_agentic_apps.md
PostgreSQL EXPLAIN
Sql
-- Basic explain
EXPLAIN SELECT * FROM users WHERE email = 'john@example.com';

-- With actual execution stats
EXPLAIN (ANALYZE, BUFFERS, FORMAT JSON)
SELECT * FROM users WHERE email = 'john@example.com';
Reading EXPLAIN Output
Plain text
┌─────────────────────────────────────────────────────────────┐
│              EXPLAIN Output Anatomy                         │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Seq Scan on users (cost=0.00..18406.00 rows=1 width=244)  │
│    Filter: (email = 'john@example.com'::text)              │
│    Rows Removed by Filter: 184055                          │
│                                                              │
│  Components:                                                 │
│  ───────────────────────────────────────────────────────────│
│  Operation     │ Seq Scan (table scan, no index)            │
│  Table         │ users                                        │
│  Startup Cost  │ 0.00 (cost to return first row)           │
│  Total Cost    │ 18406.00 (relative units, higher = worse) │
│  Estimated Rows│ 1 (expected rows)                           │
│  Width         │ 244 bytes per row                             │
│                                                              │
│  Red Flags:                                                  │
│  - Seq Scan on large tables                                  │
│  - High "Rows Removed by Filter"                             │
│  - High actual time vs estimated                              │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Execution Plan Nodes
Plain text
┌─────────────────────────────────────────────────────────────┐
│              Common Plan Nodes                               │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Seq Scan                                           │   │
│  │  └── Read ALL rows, filter in memory                 │   │
│  │  └── Bad for: Large tables                           │   │
│  │  └── OK for: Small tables (< 1000 rows)              │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                              │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Index Scan                                         │   │
│  │  ├── Use index to find rows                         │   │
│  │  └── Then fetch actual rows from table              │   │
│  │  └── Good for: Selective queries (few rows)         │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                              │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Index Only Scan                                    │   │
│  │  └── All data in index, no table access             │   │
│  │  └── Fastest! Needs covering index                  │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                              │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Bitmap Index Scan + Bitmap Heap Scan               │   │
│  │  └── For many rows: efficient bulk retrieval        │   │
│  │  └── Better than index scan for > few % of table   │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                              │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Nested Loop Join                                   │   │
│  │  └── For each outer row, scan inner                 │   │
│  │  └── Good for: Small outer, indexed inner            │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                              │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Hash Join                                          │   │
│  │  ├── Build hash table from smaller table            │   │
│  │  └── Probe with larger table                        │   │
│  │  └── Good for: Large joins, no index needed         │   │
│  │  └── Needs: Enough memory for hash table           │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                              │
│  ┌─────────────────────────────────────────────────────┐   │
│  │  Merge Join                                         │   │
│  │  └── Both inputs sorted, merge like zipper          │   │
│  │  └── Good for: Large sorted data, equi-joins        │   │
│  └─────────────────────────────────────────────────────┘   │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Query Optimization Techniques

1. Index Optimization
Sql
-- Problem: Sequential scan
EXPLAIN SELECT * FROM orders WHERE user_id = 123;
-- Seq Scan on orders (cost=0.00..18406.00 rows=10 width=244)

-- Solution: Create index
CREATE INDEX idx_orders_user ON orders(user_id);

-- Result: Index Scan
-- Index Scan using idx_orders_user (cost=0.29..8.31 rows=10 width=244)
2. Covering Indexes
Sql
-- Problem: Index scan + table lookup
SELECT status, total FROM orders WHERE user_id = 123;

-- Index only has user_id, needs to fetch status, total from table
-- Index Scan + Heap Fetches

-- Solution: Covering index
CREATE INDEX idx_orders_user_covering ON orders(user_id) 
INCLUDE (status, total);

-- Result: Index Only Scan
-- All columns in index, no table access!
3. Composite Index Column Order
Sql
-- Query pattern: WHERE user_id = ? AND status = ? AND created_at > ?

-- Bad order (status first)
CREATE INDEX idx_bad ON orders(status, user_id, created_at);
-- Status has low cardinality (few values)
-- Index not selective

-- Good order (high selectivity first)
CREATE INDEX idx_good ON orders(user_id, status, created_at);
-- user_id is most selective
-- Then filter by status within user's orders
-- Then range scan on date

-- Rule: Equality columns first (most selective),
--        then range columns,
--        then columns for ordering
4. Avoiding Functions on Indexed Columns
Sql
-- Problem: Function prevents index use
SELECT * FROM users WHERE LOWER(email) = 'john@example.com';
-- Seq Scan (function on email)

-- Solution 1: Expression index
CREATE INDEX idx_email_lower ON users(LOWER(email));

-- Solution 2: Store normalized data
ALTER TABLE users ADD COLUMN email_lower VARCHAR(255);
UPDATE users SET email_lower = LOWER(email);
CREATE INDEX idx_email_lower ON users(email_lower);
-- Update email_lower in triggers/application

-- Problem: Date function
SELECT * FROM orders WHERE DATE(created_at) = '2024-01-15';
-- Can't use index on created_at

-- Solution: Range instead
SELECT * FROM orders 
WHERE created_at >= '2024-01-15' 
  AND created_at < '2024-01-16';
-- Can use index on created_at
5. LIMIT Optimization
Sql
-- Problem: Sorting all rows then taking few
SELECT * FROM orders ORDER BY created_at DESC LIMIT 10;
-- Without index: Sorts entire table!

-- Solution: Index on sort column
CREATE INDEX idx_orders_created ON orders(created_at DESC);

-- Query becomes: Read first 10 from index, done!
-- Index Scan using idx_orders_created (limit 10)
6. Pagination Optimization
Sql
-- Problem: OFFSET gets slower as you go deeper
SELECT * FROM orders ORDER BY id LIMIT 10 OFFSET 10000;
-- Must scan and discard 10000 rows!

-- Solution 1: Keyset Pagination (Cursor-based)
-- Remember last seen value
SELECT * FROM orders 
WHERE id > 10000  -- Last seen ID
ORDER BY id 
LIMIT 10;
-- O(log n) regardless of page number!

-- Solution 2: For complex sorts
SELECT * FROM orders 
WHERE (created_at, id) > ('2024-01-15', 12345)
ORDER BY created_at, id 
LIMIT 10;

-- Solution 3: Materialized pagination (rarely changing data)
-- Pre-compute page positions
7. Join Optimization
Sql
-- Ensure foreign keys are indexed
CREATE INDEX idx_orders_user_id ON orders(user_id);

-- Choose right join type
-- Small × Large with index: Nested Loop
-- Large × Large: Hash Join or Merge Join

-- Avoid joining on calculated columns
-- Join on raw columns with indexes

-- Filter early (push down predicates)
-- Bad: Join then filter
SELECT u.*, o.total
FROM users u
JOIN orders o ON u.id = o.user_id
WHERE o.created_at > '2024-01-01';

-- Good: Filter subquery, then join
SELECT u.*, o.total
FROM users u
JOIN (
    SELECT user_id, total 
    FROM orders 
    WHERE created_at > '2024-01-01'
) o ON u.id = o.user_id;
8. Subquery Optimization
Sql
-- Correlated subquery (runs once per row)
SELECT name,
    (SELECT COUNT(*) FROM orders WHERE user_id = users.id) as cnt
FROM users;
-- Slow for many users

-- Solution: JOIN with GROUP BY
SELECT u.name, COUNT(o.id) as cnt
FROM users u
LEFT JOIN orders o ON u.id = o.user_id
GROUP BY u.id, u.name;
-- Much faster!

-- EXISTS vs IN
-- Use EXISTS for subqueries
SELECT * FROM users 
WHERE id IN (SELECT user_id FROM orders WHERE total > 1000);

-- Better:
SELECT * FROM users u
WHERE EXISTS (
    SELECT 1 FROM orders o 
    WHERE o.user_id = u.id AND o.total > 1000
);
-- EXISTS stops at first match
9. SELECT * Considered Harmful
Sql
-- Bad: Fetches all columns
SELECT * FROM users WHERE id = 1;
-- Forces table access even with covering index

-- Good: Select only needed columns
SELECT id, name, email FROM users WHERE id = 1;
-- Can use Index Only Scan if index covers these

-- Also: TOAST columns (PostgreSQL large columns)
-- SELECT * fetches even unused large columns
10. Batch Operations
Sql
-- Problem: Individual inserts in loop
-- 1000 INSERT statements = 1000 round trips, 1000 transactions

-- Solution 1: Multi-row insert
INSERT INTO users (name, email) VALUES 
    ('A', 'a@ex.com'),
    ('B', 'b@ex.com'),
    ('C', 'c@ex.com');

-- Solution 2: COPY (PostgreSQL)
COPY users (name, email) FROM '/path/to/data.csv';

-- Solution 3: Prepared statements
PREPARE insert_user (text, text) AS
    INSERT INTO users (name, email) VALUES ($1, $2);

EXECUTE insert_user('A', 'a@ex.com');
EXECUTE insert_user('B', 'b@ex.com');
DEALLOCATE insert_user;

Statistics and the Query Planner

How the Planner Works
Plain text
┌─────────────────────────────────────────────────────────────┐
│              Cost-Based Optimization                        │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  1. Statistics Collection                                    │
│     └── ANALYZE updates table stats                        │
│     └── Row counts, distinct values, distribution            │
│                                                              │
│  2. Cost Estimation                                          │
│     └── seq_page_cost = 1.0 (reading sequential page)      │
│     └── random_page_cost = 4.0 (random page read)            │
│     └── cpu_tuple_cost = 0.01 (processing one row)         │
│     └── cpu_index_tuple_cost = 0.005 (index processing)    │
│                                                              │
│  3. Plan Generation                                          │
│     └── Generate candidate plans                             │
│     └── Estimate cost for each                               │
│     └── Pick lowest cost                                     │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Maintaining Statistics
Sql
-- Update statistics after bulk operations
ANALYZE users;
ANALYZE orders;

-- Check statistics
SELECT 
    attname,
    n_distinct,
    most_common_vals,
    most_common_freqs
FROM pg_stats 
WHERE tablename = 'users';

-- Auto-vacuum settings
SHOW autovacuum;
SHOW autovacuum_analyze_threshold;

-- For frequently updated tables, may need manual ANALYZE

Advanced Optimization

Partitioning for Query Performance
Sql
-- Partition by date for time-series queries
CREATE TABLE events (
    id BIGSERIAL,
    event_time TIMESTAMPTZ,
    data JSONB
) PARTITION BY RANGE (event_time);

CREATE TABLE events_2024_01 
    PARTITION OF events
    FOR VALUES FROM ('2024-01-01') TO ('2024-02-01');

-- Query planner prunes partitions!
SELECT * FROM events 
WHERE event_time >= '2024-01-15' 
  AND event_time < '2024-01-16';
-- Only scans events_2024_01 partition
Materialized Views for Complex Queries
Sql
-- Complex aggregation query
CREATE MATERIALIZED VIEW daily_stats AS
SELECT 
    DATE(created_at) as day,
    COUNT(*) as orders,
    SUM(total) as revenue,
    COUNT(DISTINCT user_id) as unique_customers
FROM orders
GROUP BY DATE(created_at);

CREATE INDEX idx_daily_stats_day ON daily_stats(day);

-- Fast query on pre-computed data
SELECT * FROM daily_stats 
WHERE day BETWEEN '2024-01-01' AND '2024-01-31';

-- Refresh periodically
REFRESH MATERIALIZED VIEW CONCURRENTLY daily_stats;
Query Plan Caching (Prepared Statements)
Sql
-- First execution plans, subsequent reuse
PREPARE get_user_orders (INT) AS
SELECT * FROM orders WHERE user_id = $1;

-- Uses cached plan
EXECUTE get_user_orders(123);
EXECUTE get_user_orders(456);

-- Some databases do this automatically
-- But parameter values can affect plan choice!
-- planner may use generic plan or custom plan

Optimization Checklist

  • Run EXPLAIN (ANALYZE) on slow queries
  • Check for Seq Scans on large tables
  • Verify index usage
  • Check row estimates vs actual rows
  • Look for high "Rows Removed by Filter"
  • Ensure statistics are up to date (ANALYZE)
  • Review JOIN order and types
  • Check for functions on indexed columns
  • Verify LIMIT/OFFSET usage
  • Consider covering indexes for common queries
  • Review SELECT * usage
  • Check for N+1 query patterns
  • Consider materialized views for reports
  • Review connection pooling settings

Next: Replication & High Availability