Chapter 7 — Real World Production Problems
Chapter 7 — Real World Production Problems
Section titled “Chapter 7 — Real World Production Problems”7.1 Slow Queries
Section titled “7.1 Slow Queries”The Scenario
Section titled “The Scenario”Your API endpoint that used to respond in 50ms is now taking 5 seconds. Users are complaining. Your monitoring shows p99 latency has spiked.
How Experts Debug This
Section titled “How Experts Debug This” STEP 1: Identify the slow query ═══════════════════════════════════════════════════════
-- PostgreSQL: Find the slowest queries SELECT calls, mean_exec_time::numeric(10,2) as avg_ms, max_exec_time::numeric(10,2) as max_ms, total_exec_time::numeric(10,2) as total_ms, rows, query FROM pg_stat_statements ORDER BY mean_exec_time DESC LIMIT 10;
-- Or check the slow query log: -- postgresql.conf: log_min_duration_statement = 1000 (log queries > 1s)
STEP 2: Analyze the execution plan ═══════════════════════════════════════════════════════
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT) SELECT o.*, u.name FROM orders o JOIN users u ON u.id = o.user_id WHERE o.status = 'pending' AND o.created_at > NOW() - INTERVAL '30 days' ORDER BY o.created_at DESC LIMIT 20;
Red flags to look for: ┌──────────────────────────────────────────────────────────┐ │ ⚠ Seq Scan on large table (should be Index Scan) │ │ ⚠ actual rows >> estimated rows (stale statistics) │ │ ⚠ Nested Loop with loops=100000 (missing index) │ │ ⚠ Sort Method: external merge Disk (work_mem too low) │ │ ⚠ Buffers: read=50000 (too much disk I/O) │ │ ⚠ Filter: Rows Removed by Filter: 999000 (poor index) │ └──────────────────────────────────────────────────────────┘
STEP 3: Common fixes ═══════════════════════════════════════════════════════
Fix 1: Missing index CREATE INDEX CONCURRENTLY idx_orders_status_created ON orders (status, created_at DESC); -- CONCURRENTLY: doesn't lock the table during creation
Fix 2: Stale statistics ANALYZE orders;
Fix 3: Query rewrite -- Before (slow): SELECT * with OR conditions SELECT * FROM orders WHERE status = 'pending' OR status = 'processing';
-- After (fast): UNION ALL with index-friendly conditions SELECT * FROM orders WHERE status = 'pending' UNION ALL SELECT * FROM orders WHERE status = 'processing';
Fix 4: Increase work_mem for sorting SET work_mem = '256MB'; -- Per-operation, not per-connection -- Only for the specific session/query, don't set globally too high7.2 Database Overload (Connection Exhaustion)
Section titled “7.2 Database Overload (Connection Exhaustion)”The Scenario
Section titled “The Scenario”Application starts throwing “too many connections” errors. New requests fail with connection timeout. Database CPU at 100%.
DIAGNOSIS ═══════════════════════════════════════════════════════
-- Check current connections SELECT count(*), state, wait_event_type FROM pg_stat_activity GROUP BY state, wait_event_type ORDER BY count DESC;
Typical output during overload: count | state | wait_event_type ------+--------------------+------------------ 180 | active | Lock ← Blocked on locks! 45 | idle in transaction| Client ← App holding txns open 20 | idle | Client ← Actually available 5 | active | IO ← Doing real work
Root causes: ┌──────────────────────────────────────────────────────────┐ │ 1. No connection pooling │ │ → Each request creates a new connection │ │ → 500 concurrent requests = 500 connections │ │ → max_connections default is 100 │ │ Fix: Add PgBouncer, set pool size to 20-50 │ │ │ │ 2. "Idle in transaction" connections │ │ → Application opens transaction but doesn't commit │ │ → Often: exception in code skips COMMIT/ROLLBACK │ │ → Connection is held hostage │ │ Fix: Set idle_in_transaction_session_timeout = '30s' │ │ Fix: Use context managers (with db.transaction():) │ │ │ │ 3. Long-running queries holding locks │ │ → One slow query locks a row/table │ │ → 200 other queries wait for the lock │ │ → Connection pool fills up with waiting queries │ │ Fix: Set statement_timeout = '30s' │ │ Fix: Optimize the slow query │ │ │ │ 4. Connection leak in application │ │ → Connections checked out from pool but never returned│ │ → Pool empties over time │ │ Fix: Use connection pool leak detection │ │ Fix: Always return connections in finally block │ └──────────────────────────────────────────────────────────┘
EMERGENCY RESPONSE ═══════════════════════════════════════════════════════
-- Kill idle-in-transaction connections older than 5 minutes SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'idle in transaction' AND state_change < NOW() - INTERVAL '5 minutes';
-- Kill long-running queries SELECT pg_terminate_backend(pid) FROM pg_stat_activity WHERE state = 'active' AND query_start < NOW() - INTERVAL '2 minutes' AND query NOT LIKE '%pg_stat_activity%';7.3 Memory Leaks
Section titled “7.3 Memory Leaks”The Scenario
Section titled “The Scenario”Application memory usage grows steadily over days. Eventually OOM killer terminates the process. Restarting temporarily fixes it.
COMMON CAUSES IN BACKEND APPLICATIONS ═══════════════════════════════════════════════════════
1. Unbounded in-memory cache ───────────────────────────── # BAD: Cache grows forever cache = {} def get_user(id): if id not in cache: cache[id] = db.query(f"SELECT * FROM users WHERE id = {id}") return cache[id] # After 1M unique users: cache consumes GBs of RAM
# FIX: Use bounded cache with eviction from functools import lru_cache @lru_cache(maxsize=10000) # Max 10K entries, LRU eviction def get_user(id): return db.query("SELECT * FROM users WHERE id = %s", (id,))
2. Unclosed database connections / cursors ───────────────────────────────────────── # BAD: Connection never returned to pool def get_data(): conn = pool.getconn() cursor = conn.cursor() cursor.execute("SELECT * FROM big_table") results = cursor.fetchall() return results # Connection leaked! Never returned to pool
# FIX: Use context manager def get_data(): with pool.getconn() as conn: # Auto-returned on exit with conn.cursor() as cursor: # Auto-closed cursor.execute("SELECT * FROM big_table") return cursor.fetchall()
3. Event listeners / callbacks not cleaned up ────────────────────────────────────────────── // Node.js: Adding listeners without removing them app.get('/stream', (req, res) => { const listener = (data) => res.write(data); eventEmitter.on('data', listener); // If client disconnects, listener is never removed! // Each request adds another listener → memory leak
// FIX: req.on('close', () => { eventEmitter.removeListener('data', listener); }); });
4. Large ORM object graphs ───────────────────────── # Django: Loading entire related object tree into memory orders = Order.objects.select_related('user', 'product', 'product__category', 'product__manufacturer').all() # With 1M orders: loads millions of objects into memory
# FIX: Use pagination and only load needed fields orders = Order.objects.values('id', 'total', 'user__name')\ .filter(status='pending')[:100]
DEBUGGING MEMORY LEAKS ═══════════════════════════════════════════════════════
Python: - tracemalloc: Track memory allocations - objgraph: Visualize object reference graphs - memory_profiler: Line-by-line memory usage
Node.js: - --inspect flag + Chrome DevTools heap snapshot - heapdump module - clinic.js (doctor, bubbleprof, flame)
Go: - pprof: Built-in profiling (go tool pprof) - runtime.ReadMemStats()
Java: - VisualVM / JConsole - jmap -heap <pid> - Eclipse MAT for heap dump analysis7.4 Lock Contention
Section titled “7.4 Lock Contention”The Scenario
Section titled “The Scenario”Under high write load, throughput plateaus and latency spikes. Most database connections show wait_event_type = Lock.
DIAGNOSIS ═══════════════════════════════════════════════════════
-- Find blocking and blocked queries SELECT blocked.pid AS blocked_pid, blocked.query AS blocked_query, blocking.pid AS blocking_pid, blocking.query AS blocking_query, NOW() - blocked.query_start AS blocked_duration FROM pg_stat_activity blocked JOIN pg_locks bl ON bl.pid = blocked.pid JOIN pg_locks kl ON kl.locktype = bl.locktype AND kl.database IS NOT DISTINCT FROM bl.database AND kl.relation IS NOT DISTINCT FROM bl.relation AND kl.page IS NOT DISTINCT FROM bl.page AND kl.tuple IS NOT DISTINCT FROM bl.tuple AND kl.pid != bl.pid JOIN pg_stat_activity blocking ON kl.pid = blocking.pid WHERE NOT bl.granted;
COMMON LOCK CONTENTION PATTERNS:
Pattern 1: Hot row updates ────────────────────────── -- 1000 concurrent requests all updating the same counter: UPDATE product_stats SET view_count = view_count + 1 WHERE product_id = 42; -- Each must wait for the previous to commit
Fix: Batch updates -- Increment in Redis, flush to DB periodically INCR product:42:views -- Redis, lock-free, O(1) -- Background job every 30s: UPDATE product_stats SET view_count = view_count + $1 WHERE product_id = $2;
Pattern 2: Long transactions holding locks ────────────────────────────────────────── BEGIN; SELECT * FROM orders WHERE id = 1 FOR UPDATE; -- Locks row -- ... expensive business logic takes 30 seconds ... UPDATE orders SET status = 'processed' WHERE id = 1; COMMIT; -- Row locked for 30 seconds!
Fix: Minimize lock duration -- Do expensive work BEFORE taking the lock data = expensive_calculation() BEGIN; UPDATE orders SET status = 'processed', data = $1 WHERE id = 1; COMMIT; -- Lock held for milliseconds
Pattern 3: DDL operations on busy tables ───────────────────────────────────────── ALTER TABLE orders ADD COLUMN discount DECIMAL; -- Takes AccessExclusiveLock → blocks ALL reads and writes! -- On a table with millions of rows, this can take minutes.
Fix: Use CONCURRENTLY where possible CREATE INDEX CONCURRENTLY idx_... ON orders (...); -- For column additions in PostgreSQL: -- Adding a column with a default is fast since PG 11 -- (stored in catalog, not rewritten) ALTER TABLE orders ADD COLUMN discount DECIMAL DEFAULT 0;7.5 Race Conditions
Section titled “7.5 Race Conditions”The Scenario
Section titled “The Scenario”Two users simultaneously purchase the last item in stock. Both succeed. You now have -1 inventory.
THE RACE CONDITION ═══════════════════════════════════════════════════════
Time User A User B ──── ────── ────── T1 SELECT stock FROM products WHERE id = 42 → stock = 1 T2 SELECT stock FROM products WHERE id = 42 → stock = 1 T3 if stock > 0: proceed T4 if stock > 0: proceed T5 UPDATE products SET stock = stock - 1 WHERE id = 42 → stock = 0 T6 UPDATE products SET stock = stock - 1 WHERE id = 42 → stock = -1 !!
SOLUTIONS ═══════════════════════════════════════════════════════
Solution 1: Atomic UPDATE with condition ───────────────────────────────────────── UPDATE products SET stock = stock - 1 WHERE id = 42 AND stock > 0 RETURNING stock; -- If 0 rows affected → out of stock (no separate SELECT needed) -- The WHERE clause and UPDATE are atomic within a single statement
Solution 2: SELECT FOR UPDATE (pessimistic lock) ───────────────────────────────────────────────── BEGIN; SELECT stock FROM products WHERE id = 42 FOR UPDATE; -- Locks the row. Other transactions wait here. IF stock > 0: UPDATE products SET stock = stock - 1 WHERE id = 42; COMMIT;
Solution 3: Optimistic locking (version column) ───────────────────────────────────────────────── -- Read current state SELECT stock, version FROM products WHERE id = 42; → stock=1, version=5
-- Update only if version hasn't changed UPDATE products SET stock = stock - 1, version = version + 1 WHERE id = 42 AND version = 5; -- If 0 rows affected → someone else modified it → retry
Solution 4: Distributed lock (Redis) ───────────────────────────────────── -- Acquire lock SET lock:product:42 unique_id NX EX 10 -- NX=only if not exists, EX=10s TTL -- If OK → proceed with purchase -- If nil → someone else has lock → wait/retry -- After operation: DEL lock:product:42 -- Release (verify unique_id first!)7.6 Security Breaches — Incident Response
Section titled “7.6 Security Breaches — Incident Response”The Scenario
Section titled “The Scenario”Your monitoring alerts on unusual database activity. A large volume of data is being read from the users table at 3 AM. No deployments are scheduled.
INCIDENT RESPONSE PLAYBOOK ═══════════════════════════════════════════════════════
PHASE 1: DETECT & CONTAIN (First 15 minutes) ───────────────────────────────────────────── 1. Verify the alert is real (not false positive) - Check audit logs - Check if it's an expected batch job - Check who/what is making the queries
2. Identify the compromised component - Which server IP is making the queries? - Which database user/role? - Which application service?
3. Contain the breach - Revoke the compromised credentials immediately ALTER USER compromised_user WITH NOLOGIN; - Block the source IP at the firewall - If application is compromised: isolate the server - Do NOT shut down servers (preserve evidence)
PHASE 2: ASSESS (Next 1-2 hours) ────────────────────────────────── 4. Determine scope of access - What data was accessed? (audit logs) - How long did the attacker have access? - What privileges did the compromised account have? - Were other systems accessed? (lateral movement)
5. Determine attack vector - SQL injection? (check application logs for payloads) - Stolen credentials? (check for credential reuse, phishing) - SSRF? (check for internal requests to metadata endpoints) - Supply chain? (check for recent dependency updates)
PHASE 3: ERADICATE & RECOVER (Hours to days) ────────────────────────────────────────────── 6. Remove attacker access - Rotate ALL credentials (not just compromised ones) - Revoke all active sessions/tokens - Patch the vulnerability that was exploited - Review and remove any backdoors (new users, cron jobs, SSH keys)
7. Restore from clean state - If data was modified: restore from pre-breach backup - Verify data integrity - Re-enable services one by one with monitoring
PHASE 4: POST-MORTEM (Within 1 week) ───────────────────────────────────── 8. Document everything - Timeline of events - Root cause analysis - What worked and didn't work in response - Action items to prevent recurrence - No blame — focus on systemic improvements7.7 Cache Stampede
Section titled “7.7 Cache Stampede”The Scenario
Section titled “The Scenario”A popular cache key expires. Thousands of requests simultaneously miss the cache and hit the database. Database CPU spikes to 100%. Latency goes from 5ms to 30 seconds.
THE STAMPEDE ═══════════════════════════════════════════════════════
Time 0:00 — Cache key "homepage_products" expires (TTL was 5 minutes)
Time 0:01 — 500 requests arrive simultaneously All get cache MISS All query database: SELECT * FROM products JOIN categories ON ... WHERE featured = true ORDER BY popularity DESC LIMIT 50; (This query takes 200ms)
Time 0:01 — Database receives 500 identical heavy queries Time 0:05 — Database CPU at 100%, all queries slowing down Time 0:10 — Connection pool exhausted Time 0:15 — Application returns 503 errors Time 0:20 — First query completes, populates cache Time 0:21 — Remaining 499 queries still running Time 1:00 — System recovers
PREVENTION STRATEGIES ═══════════════════════════════════════════════════════
Strategy 1: Locking / Single-flight ┌──────────────────────────────────────────────────────┐ │ Request 1: MISS → acquire lock → query DB → set cache → release lock │ │ Request 2: MISS → lock taken → WAIT → cache now populated → return │ │ Request 3: MISS → lock taken → WAIT → cache now populated → return │ │ │ │ Only 1 request hits the database. │ │ │ │ Redis implementation: │ │ SET cache_lock:homepage_products unique_id NX EX 10 │ │ If acquired: query DB, set cache, release lock │ │ If not acquired: wait 50ms, retry GET cache key │ └──────────────────────────────────────────────────────┘
Strategy 2: Stale-while-revalidate ┌──────────────────────────────────────────────────────┐ │ Cache has two TTLs: │ │ - Soft TTL (5 min): triggers background refresh │ │ - Hard TTL (1 hour): actual expiration │ │ │ │ Request arrives, key is past soft TTL but not hard: │ │ → Return stale cached data immediately │ │ → Trigger async background refresh │ │ → Next request gets fresh data │ │ │ │ Users always get a fast response (stale or fresh). │ │ Database never gets stampeded. │ └──────────────────────────────────────────────────────┘
Strategy 3: Probabilistic early expiration ┌──────────────────────────────────────────────────────┐ │ Before cache expires, each request has a small │ │ probability of refreshing: │ │ │ │ should_refresh = (random() < exp(-ttl_remaining / │ │ beta)) │ │ │ │ As TTL approaches 0, probability → 1 │ │ One request will "volunteer" to refresh before │ │ actual expiry, preventing mass stampede. │ └──────────────────────────────────────────────────────┘7.8 DDoS Attacks
Section titled “7.8 DDoS Attacks”The Scenario
Section titled “The Scenario”Your API is receiving 10x normal traffic from distributed IPs. Legitimate users can’t access the service.
DDOS TYPES & DEFENSES ═══════════════════════════════════════════════════════
TYPE 1: Volumetric (L3/L4) ── UDP flood, SYN flood, amplification attacks ── Saturates network bandwidth ── Defense: CDN/DDoS provider (Cloudflare, AWS Shield) ── You cannot handle this at application level
TYPE 2: Protocol (L4) ── SYN flood, Slowloris ── Exhausts server connection tables ── Defense: - SYN cookies (kernel-level) - Connection timeouts (keep-alive limits) - Load balancer connection limits
TYPE 3: Application (L7) ── HTTP flood, expensive API calls, login brute force ── Looks like legitimate traffic ── Most dangerous — hardest to distinguish from real users ── Defense: ┌──────────────────────────────────────────────────┐ │ Layer 1: WAF rules │ │ - Block known bad patterns │ │ - Challenge suspicious IPs (CAPTCHA) │ │ │ │ Layer 2: Rate limiting │ │ - Per IP: 100 req/min │ │ - Per user: 1000 req/min │ │ - Per endpoint: /login = 5 req/min │ │ │ │ Layer 3: Adaptive throttling │ │ - Monitor overall request rate │ │ - When above threshold, increase strictness │ │ - Prioritize authenticated users │ │ - Serve cached responses for GET requests │ │ │ │ Layer 4: Resource protection │ │ - Limit request body size │ │ - Timeout slow requests │ │ - Queue expensive operations │ │ - Circuit breaker on overloaded services │ └──────────────────────────────────────────────────┘7.9 Hot Partitions
Section titled “7.9 Hot Partitions”The Scenario
Section titled “The Scenario”You sharded your database by user_id. A celebrity user with 50M followers causes their shard to be overwhelmed while other shards are idle.
THE PROBLEM ═══════════════════════════════════════════════════════
Shard 0: 5% of traffic (normal users) Shard 1: 5% of traffic (normal users) Shard 2: 80% of traffic (celebrity user + followers) ← HOT! Shard 3: 10% of traffic (normal users)
SOLUTIONS ═══════════════════════════════════════════════════════
Solution 1: Sub-sharding / Salting ────────────────────────────────── Instead of shard_key = user_id Use shard_key = user_id + random(0..N)
Celebrity user's data spread across N sub-shards: user_42_0 → Shard 0 user_42_1 → Shard 1 user_42_2 → Shard 2 user_42_3 → Shard 3
Reads must query all N sub-shards and merge. Tradeoff: Write distribution vs read complexity.
Solution 2: Caching layer absorbs hot reads ──────────────────────────────────────────── Celebrity's profile and posts cached with longer TTL. Cache handles 99% of reads, DB only handles writes.
Solution 3: Read replicas per hot shard ─────────────────────────────────────── Shard 2 gets 3 read replicas, other shards get 1. Scale the hot shard independently.
Solution 4: Separate hot data into dedicated infrastructure ────────────────────────────────────────────────────────── Detect "hot" entities automatically. Route them to a dedicated, scaled-up instance. Route normal entities to normal shards.7.10 Token Leaks
Section titled “7.10 Token Leaks”The Scenario
Section titled “The Scenario”An engineer accidentally commits an API key or JWT secret to a public GitHub repository. The key is harvested by automated bots within minutes.
TIMELINE OF A TOKEN LEAK ═══════════════════════════════════════════════════════
T+0s: Developer commits .env file with API keys T+30s: Pushes to public GitHub repository T+60s: GitHub secret scanning alerts (if enabled) T+90s: Automated bots scrape GitHub for leaked secrets T+2min: Attacker uses AWS keys to spin up crypto miners T+5min: $10,000 in cloud charges already accruing T+1hr: AWS billing alert fires T+2hr: Team notices and revokes keys
Cost: $10,000+ and a weekend of incident response
PREVENTION ═══════════════════════════════════════════════════════
Pre-commit: ✓ git-secrets or truffleHog as pre-commit hook ✓ .gitignore includes .env, *.pem, *.key, credentials.* ✓ IDE plugins that warn on secret patterns
Repository: ✓ GitHub secret scanning (enabled by default for public repos) ✓ Branch protection rules (require review) ✓ Git history scanning (secrets in old commits are still exposed!)
Runtime: ✓ Use environment variables or secret managers, never hardcoded ✓ Short-lived tokens (expire in minutes/hours, not months/years) ✓ Rotate secrets on every deploy
Response plan: ✓ Immediately revoke the leaked credential (don't just remove from code) ✓ Audit usage (check CloudTrail/audit logs for unauthorized use) ✓ Force-rotate all secrets of the same type ✓ Scrub from git history: git filter-branch or BFG Repo-Cleaner (But assume the secret is already compromised!)7.11 Debugging Checklist (Senior Engineer’s Playbook)
Section titled “7.11 Debugging Checklist (Senior Engineer’s Playbook)” ┌──────────────────────────────────────────────────────────────┐ │ PRODUCTION DEBUGGING PLAYBOOK │ │ │ │ 1. DON'T PANIC. Take a breath. Assess. │ │ │ │ 2. Is this a NEW problem or a RECURRING one? │ │ → Check for recent deployments (git log, deploy history) │ │ → Check for infrastructure changes │ │ → Check for data volume changes │ │ │ │ 3. SCOPE the impact: │ │ → Which users are affected? All? Some? One? │ │ → Which endpoints? All? Specific ones? │ │ → Which regions? All? Specific DCs? │ │ │ │ 4. CHECK the usual suspects (in this order): │ │ a. Recent deploys → rollback if obvious │ │ b. Database (slow queries, locks, connections) │ │ c. External dependencies (third-party APIs down?) │ │ d. Infrastructure (disk full, network issues) │ │ e. Traffic spike (DDoS, viral content, bot) │ │ │ │ 5. USE observability tools: │ │ → Metrics dashboard (Grafana): overall health │ │ → Distributed traces (Jaeger): specific request path │ │ → Logs (ELK): error details │ │ → Database monitoring: query performance │ │ │ │ 6. MITIGATE before fixing: │ │ → Can we cache it? (temporary cache for hot endpoint) │ │ → Can we rate limit it? (protect downstream) │ │ → Can we kill it? (disable non-critical feature) │ │ → Can we scale it? (add instances) │ │ │ │ 7. FIX the root cause: │ │ → Fix in development, test, then deploy │ │ → Don't hotfix production directly unless critical │ │ │ │ 8. POST-MORTEM: │ │ → What happened? Timeline. │ │ → Why? Root cause. │ │ → How do we prevent it? Action items. │ │ → Blameless. Focus on systems, not people. │ └──────────────────────────────────────────────────────────────┘