InkdownInkdown
Start writing

Study

59 files·8 subfolders

Shared Workspace

Study
core

14-monitoring

Shared from "Study" on Inkdown

14 - Monitoring & Observability

Why Monitor Databases?

Plain text
┌─────────────────────────────────────────────────────────────┐
│              Monitoring Goals                                │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  1. Availability: Is the database up?                         │
│                                                              │
│  2. Performance: Are queries fast enough?                    │
│     - Response time percentiles (p50, p95, p99)              │
│     - Throughput (QPS)                                        │
│                                                              │
│  3. Capacity Planning: When will we run out?               │
│     - Storage growth                                          │
│     - Connection usage                                        │
│                                                              │
│  4. Early Warning: Catch problems before outage              │
│     - Slow query increase                                       │
│     - Lock contention                                           │
│     - Replication lag                                             │
│                                                              │
│  5. Debugging: What went wrong?                             │
│     - Query patterns                                            │
│     - Error logs                                                │
│     - Wait events                                                 │
│                                                              │
└─────────────────────────────────────────────────────────────┘
programming-language-concepts.md
zero-language-explanation.md
DB
01-introduction.md
02-relational-databases.md
03-database-design.md
04-indexing.md
05-transactions-acid.md
06-nosql-databases.md
07-query-optimization.md
08-replication-ha.md
09-sharding-partitioning.md
10-caching-strategies.md
11-cap-theorem.md
12-connection-pooling.md
13-backup-recovery.md
14-monitoring.md
15-database-selection.md
README.md
JS
Event loop
Merlin Backend
01-Orchestration.md
02-DeepResearch.md
03-Search.md
04-Scraping.md
05-Streaming.md
06-MultiProviderLLM.md
07-MemoryAndContext.md
08-ErrorHandling.md
09-RateLimiting.md
10-TaskQueue.md
11-SecurityAndAuth.md
Orchestration-2nd-draft
OpenAI Agents Python
00_OVERVIEW.md
01_AGENT_SYSTEM.md
02_RUNNER_SYSTEM.md
03_TOOL_SYSTEM.md
04_ITEMS_SYSTEM.md
05_GUARDRAILS.md
06_HANDOFFS.md
07_MEMORY_SESSIONS.md
08_MODEL_PROVIDERS.md
09_SANDBOX_SYSTEM.md
10_TRACING.md
11_RUN_STATE.md
12_CONTEXT.md
13_LIFECYCLE_HOOKS.md
14_CONFIGURATION.md
15_ERROR_HANDLING.md
16_STREAMING.md
17_EXTENSIONS.md
18_MCP_INTEGRATION.md
19_BEST_PRACTICES.md
20_ARCHITECTURE_PATTERNS.md
opencode-study
context-handling
core
Python
Alembic
Basics
sqlalchemy - fastapi
SQLAlchemy overview
tweets
system_design_for_agentic_apps.md

Key Metrics

The Four Golden Signals
Plain text
┌─────────────────────────────────────────────────────────────┐
│              Four Golden Signals (Google SRE)                │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  1. Latency                                                   │
│     - Time to service a request                              │
│     - Measure: query execution time, transaction time        │
│     - Track: p50, p95, p99 percentiles                       │
│                                                              │
│  2. Traffic                                                   │
│     - Demand on the system                                    │
│     - Measure: queries per second, transactions per second     │
│     - Track: trends over time                                 │
│                                                              │
│  3. Errors                                                    │
│     - Rate of failed requests                                 │
│     - Measure: connection failures, query errors, timeouts     │
│     - Track: error rate percentage                            │
│                                                              │
│  4. Saturation                                                │
│     - How "full" the service is                               │
│     - Measure: CPU, memory, disk, connections                  │
│     - Track: utilization percentages                          │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Database-Specific Metrics
Sql
-- PostgreSQL: Key metrics queries

-- 1. Connection count
SELECT 
    count(*),
    state,
    wait_event_type
FROM pg_stat_activity
GROUP BY state, wait_event_type;

-- 2. Replication lag
SELECT 
    client_addr,
    state,
    sent_lsn, write_lsn, flush_lsn, replay_lsn,
    write_lag, flush_lag, replay_lag
FROM pg_stat_replication;

-- 3. Table statistics
SELECT 
    schemaname,
    relname,
    seq_scan,
    seq_tup_read,
    idx_scan,
    idx_tup_fetch,
    n_tup_ins,
    n_tup_upd,
    n_tup_del,
    n_live_tup,
    n_dead_tup
FROM pg_stat_user_tables
ORDER BY seq_scan DESC;

-- 4. Index usage
SELECT 
    schemaname,
    relname,
    indexrelname,
    idx_scan,
    idx_tup_read,
    idx_tup_fetch,
    pg_size_pretty(pg_relation_size(indexrelid)) as index_size
FROM pg_stat_user_indexes
ORDER BY idx_scan DESC;

-- 5. Slow queries (requires pg_stat_statements)
SELECT 
    query,
    calls,
    total_exec_time,
    mean_exec_time,
    rows,
    100.0 * shared_blks_hit / nullif(shared_blks_hit + shared_blks_read, 0) AS hit_percent
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;

-- 6. Lock contention
SELECT 
    blocked_locks.pid AS blocked_pid,
    blocked_activity.usename AS blocked_user,
    blocking_locks.pid AS blocking_pid,
    blocking_activity.usename AS blocking_user,
    blocked_activity.query AS blocked_statement,
    blocking_activity.query AS blocking_statement
FROM pg_catalog.pg_locks blocked_locks
JOIN pg_catalog.pg_stat_activity blocked_activity 
    ON blocked_activity.pid = blocked_locks.pid
JOIN pg_catalog.pg_locks blocking_locks 
    ON blocking_locks.locktype = blocked_locks.locktype
    AND blocking_locks.relation = blocked_locks.relation
    AND blocking_locks.pid != blocked_locks.pid
JOIN pg_catalog.pg_stat_activity blocking_activity 
    ON blocking_activity.pid = blocking_locks.pid
WHERE NOT blocked_locks.granted;

-- 7. Vacuum statistics
SELECT 
    schemaname,
    relname,
    last_vacuum,
    last_autovacuum,
    last_analyze,
    vacuum_count,
    autovacuum_count,
    n_dead_tup,
    n_live_tup,
    round(n_dead_tup::numeric/nullif(n_live_tup,0)*100, 2) as dead_pct
FROM pg_stat_user_tables
WHERE n_dead_tup > 1000
ORDER BY n_dead_tup DESC;

Monitoring Tools

Built-in Tools
Bash
# PostgreSQL
# Activity monitoring
watch -n 1 'psql -c "SELECT pid, usename, application_name, state, query FROM pg_stat_activity;"'

# Real-time query monitoring
psql -c "SELECT * FROM pg_stat_statements ORDER BY total_time DESC LIMIT 10;"

# pg_stat_statements (must be enabled in shared_preload_libraries)
# In postgresql.conf:
shared_preload_libraries = 'pg_stat_statements'
pg_stat_statements.track = all

# MySQL
# Process list
mysql -e "SHOW FULL PROCESSLIST;"

# InnoDB status
mysql -e "SHOW ENGINE INNODB STATUS;"

# Performance schema queries
mysql -e "SELECT * FROM performance_schema.events_statements_summary_by_digest ORDER BY SUM_TIMER_WAIT DESC LIMIT 10;"
External Monitoring Tools
Plain text
┌─────────────────────────────────────────────────────────────┐
│              Monitoring Stack                                  │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Metrics Collection:                                          │
│  - Prometheus (metrics scraping)                            │
│  - InfluxDB (time-series database)                          │
│  - StatsD (metric aggregation)                              │
│                                                              │
│  Visualization:                                               │
│  - Grafana (dashboards)                                       │
│  - Datadog (SaaS)                                             │
│  - New Relic (SaaS)                                           │
│                                                              │
│  Alerting:                                                    │
│  - Prometheus Alertmanager                                    │
│  - PagerDuty/Opsgenie                                         │
│  - Slack/Email                                                │
│                                                              │
│  Log Aggregation:                                             │
│  - ELK Stack (Elasticsearch, Logstash, Kibana)              │
│  - Loki (Grafana's lightweight solution)                    │
│  - Splunk (Enterprise)                                        │
│                                                              │
│  APM (Application Performance Monitoring):                   │
│  - Jaeger (distributed tracing)                               │
│  - Zipkin (distributed tracing)                               │
│  - Datadog APM                                                │
│                                                              │
└─────────────────────────────────────────────────────────────┘
Prometheus + Grafana Setup
YAML
# prometheus.yml - scrape database exporters
scrape_configs:
  - job_name: 'postgresql'
    static_configs:
      - targets: ['postgres-exporter:9187']
    
  - job_name: 'mysql'
    static_configs:
      - targets: ['mysql-exporter:9104']
      
  - job_name: 'redis'
    static_configs:
      - targets: ['redis-exporter:9121']
Bash
# Run postgres_exporter
docker run -d \
  -e DATA_SOURCE_NAME="postgresql://monitoring:password@localhost:5432/postgres?sslmode=disable" \
  -p 9187:9187 \
  prometheuscommunity/postgres-exporter

Dashboard Design

Key Dashboards
Plain text
┌─────────────────────────────────────────────────────────────┐
│              Dashboard 1: Overview                             │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Row 1: Health                                               │
│  ├── Database Status (Up/Down)                                │
│  ├── Replication Lag (seconds)                              │
│  └── Connection Utilization %                                 │
│                                                              │
│  Row 2: Performance                                          │
│  ├── QPS (queries per second)                               │
│  ├── Query Latency (p50, p95, p99)                         │
│  └── Slow Query Count                                        │
│                                                              │
│  Row 3: Resources                                            │
│  ├── CPU %                                                   │
│  ├── Memory %                                                 │
│  ├── Disk I/O (IOPS)                                         │
│  └── Disk Usage %                                            │
│                                                              │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│              Dashboard 2: Query Performance                    │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Row 1: Top Slow Queries                                     │
│  ├── Query text (truncated)                                  │
│  ├── Avg execution time                                       │
│  ├── Total execution time                                     │
│  ├── Call count                                               │
│  └── Hit ratio %                                             │
│                                                              │
│  Row 2: Query Patterns                                       │
│  ├── Queries by type (SELECT, INSERT, UPDATE, DELETE)        │
│  ├── Queries by table                                         │
│  └── Queries by application                                   │
│                                                              │
│  Row 3: Lock Wait Events                                     │
│  ├── Lock wait time by type                                   │
│  ├── Blocked queries count                                    │
│  └── Blocking queries details                                 │
│                                                              │
└─────────────────────────────────────────────────────────────┘

┌─────────────────────────────────────────────────────────────┐
│              Dashboard 3: Storage & Growth                     │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Row 1: Database Size                                        │
│  ├── Total size over time                                     │
│  ├── Size by table                                            │
│  └── WAL size                                                 │
│                                                              │
│  Row 2: Growth Rate                                          │
│  ├── Daily growth (GB/day)                                    │
│  ├── Projected fill time                                      │
│  └── Largest tables growth                                    │
│                                                              │
│  Row 3: Bloat                                                │
│  ├── Dead tuple ratio by table                               │
│  ├── Table bloat %                                           │
│  └── Index bloat %                                           │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Alerting

Critical Alerts
YAML
# Prometheus alerting rules
groups:
  - name: database
    rules:
      # Database down
      - alert: DatabaseDown
        expr: pg_up == 0
        for: 1m
        labels:
          severity: critical
        annotations:
          summary: "PostgreSQL down on {{ $labels.instance }}"
          description: "Database has been down for more than 1 minute"
      
      # High connection usage
      - alert: HighConnectionUsage
        expr: |
          pg_stat_activity_count / pg_settings_max_connections * 100 > 80
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "High connection usage on {{ $labels.instance }}"
          description: "Connection usage is above 80%"
      
      # Replication lag
      - alert: ReplicationLagHigh
        expr: |
          pg_stat_replication_pg_wal_lsn_diff / 1024 / 1024 > 1000
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "Replication lag high on {{ $labels.instance }}"
          description: "Replication lag is above 1GB"
      
      # Dead tuples high
      - alert: HighDeadTuples
        expr: |
          pg_stat_user_tables_n_dead_tup > 1000000
        for: 1h
        labels:
          severity: warning
        annotations:
          summary: "High dead tuples on {{ $labels.relname }}"
          description: "Table has over 1M dead tuples"
      
      # Slow queries
      - alert: SlowQueriesIncreasing
        expr: |
          rate(pg_stat_statements_total_time_sum[5m]) > 10000
        for: 10m
        labels:
          severity: warning
        annotations:
          summary: "Slow queries increasing"
          description: "Query execution time is trending up"
      
      # Disk space
      - alert: DiskSpaceLow
        expr: |
          pg_database_size_bytes / pg_settings_data_directory_size * 100 > 85
        for: 5m
        labels:
          severity: critical
        annotations:
          summary: "Disk space low on {{ $labels.instance }}"
          description: "Disk usage is above 85%"

Log Analysis

PostgreSQL Log Configuration
Ini
# postgresql.conf - log settings
logging_collector = on
log_directory = 'log'
log_filename = 'postgresql-%Y-%m-%d_%H%M%S.log'
log_rotation_age = 1d
log_rotation_size = 100MB

# What to log
log_min_duration_statement = 1000  # Log queries > 1s
log_checkpoints = on
log_connections = on
log_disconnections = on
log_lock_waits = on
log_temp_files = 0  # Log all temp files
log_autovacuum_min_duration = 0

# Log format
log_line_prefix = '%t [%p]: [%l-1] user=%u,db=%d,app=%a,client=%h '
log_statement = 'ddl'  # Log DDL statements
Common Log Patterns
Bash
# Find slow queries
grep "duration:" postgresql-*.log | grep -v "statement:"

# Find connection issues
grep "could not connect" postgresql-*.log

# Find deadlocks
grep "deadlock detected" postgresql-*.log

# Find autovacuum activity
grep "autovacuum:" postgresql-*.log

# Find checkpoint issues
grep "checkpoint" postgresql-*.log

Performance Insights

Wait Events Analysis
Sql
-- PostgreSQL: What is the database waiting on?
SELECT 
    wait_event_type,
    wait_event,
    count(*)
FROM pg_stat_activity
WHERE state = 'active'
GROUP BY wait_event_type, wait_event
ORDER BY count(*) DESC;

-- Common wait events:
-- IO: DataFileRead, DataFileWrite
-- Lock: relation, tuple, transactionid
-- Client: ClientRead, ClientWrite
-- IPC: ProcArrayGroupUpdate
Query Plan Analysis
Sql
-- Enable auto_explain (requires extension)
-- In postgresql.conf:
shared_preload_libraries = 'auto_explain'
auto_explain.log_min_duration = '1s'
auto_explain.log_analyze = true
auto_explain.log_buffers = true
auto_explain.log_timing = true

-- Find queries doing seq scans on large tables
-- (from logs or pg_stat_statements)

Next: Choosing the Right Database