Chapter 9 — Questions for Deep Understanding
Chapter 9 — Questions for Deep Understanding
Section titled “Chapter 9 — Questions for Deep Understanding”These questions are designed to test whether you truly understand the concepts, not just memorized definitions. For each question, think about the WHY before answering.
Section A: Database Internals [E]
Section titled “Section A: Database Internals [E]”A1. Index Selection
Section titled “A1. Index Selection”You have a table orders with 50M rows and this query:
SELECT order_id, total FROM ordersWHERE user_id = 42 AND status = 'shipped'AND created_at BETWEEN '2024-01-01' AND '2024-06-30'ORDER BY created_at DESCLIMIT 20;Design the optimal index. Explain:
- Which columns should be in the index and in what order? Why that order?
- Should any columns be INCLUDE’d rather than part of the key?
- Would a partial index help here? Under what conditions?
- How would the optimal index change if the
LIMIT 20were removed?
A2. MVCC and Bloat
Section titled “A2. MVCC and Bloat”A PostgreSQL table receives 10,000 updates per second to a hot set of 1,000 rows. After a week, the table has grown from 100MB to 10GB even though the actual live data is still 100MB.
- Why did this happen? Explain the MVCC mechanism that causes this.
- What is the role of VACUUM? Why doesn’t autovacuum prevent this?
- What are the performance implications of this bloat?
- How would you fix this without downtime?
- How would you prevent this from happening again?
A3. Isolation Level Puzzle
Section titled “A3. Isolation Level Puzzle”Consider this scenario with two concurrent transactions at READ COMMITTED isolation:
T1: BEGIN;T1: SELECT balance FROM accounts WHERE id = 1; -- Returns 1000T2: BEGIN;T2: UPDATE accounts SET balance = balance - 200 WHERE id = 1;T2: COMMIT;T1: UPDATE accounts SET balance = balance - 500 WHERE id = 1;T1: COMMIT;- What is the final balance? Is it 300 or 500? Explain why.
- Would the answer change under
REPEATABLE READ? - Would the answer change under
SERIALIZABLE? - How does PostgreSQL’s
READ COMMITTEDhandle the UPDATE in T1 after T2 has committed?
A4. Join Performance
Section titled “A4. Join Performance”You have:
userstable: 1M rows,idis primary keyorderstable: 50M rows,user_idhas a B-tree index- Query:
SELECT u.name, COUNT(*) FROM users u JOIN orders o ON u.id = o.user_id GROUP BY u.name
- What join algorithm will the planner likely choose? Why?
- If you add
WHERE u.country = 'US'(reducing users to 300K), does the join strategy change? - If you add
WHERE o.created_at > '2024-01-01'(reducing orders to 5M), does it change? - What happens if statistics are stale and the planner thinks
u.country = 'US'matches only 100 rows?
Section B: Security [A/E]
Section titled “Section B: Security [A/E]”B1. JWT Vulnerability Analysis
Section titled “B1. JWT Vulnerability Analysis”A developer implements JWT authentication with this code:
const token = jwt.sign({ userId: user.id, role: user.role }, SECRET, { expiresIn: '30d' });And verification:
const decoded = jwt.verify(token, SECRET);req.user = decoded;List at least 5 security issues with this implementation and how to fix each one.
B2. SQL Injection Bypass
Section titled “B2. SQL Injection Bypass”An application uses this “security” measure:
def sanitize(input): return input.replace("'", "''").replace("--", "").replace(";", "")- Can you still perform SQL injection? How?
- What are at least 3 different bypass techniques for this type of sanitization?
- Why is this approach fundamentally flawed regardless of how many characters you blacklist?
B3. OAuth Flow Attack
Section titled “B3. OAuth Flow Attack”A web application implements OAuth 2.0 Authorization Code flow but makes these mistakes:
- Does not use the
stateparameter - Stores tokens in localStorage
- Uses the Implicit flow for the SPA instead of Authorization Code + PKCE
For each mistake:
- Describe the specific attack that becomes possible
- Describe the attack flow step by step
- Describe the correct implementation
B4. Race Condition Exploit
Section titled “B4. Race Condition Exploit”A money transfer API:
@app.route('/transfer', methods=['POST'])def transfer(): amount = request.json['amount'] from_account = Account.query.get(request.json['from']) to_account = Account.query.get(request.json['to']) if from_account.balance >= amount: from_account.balance -= amount to_account.balance += amount db.session.commit() return {'status': 'success'} return {'status': 'insufficient funds'}, 400- How can an attacker exploit this to transfer more money than they have?
- Write the exact sequence of concurrent requests that demonstrates the exploit.
- Provide 3 different approaches to fix this, with tradeoffs for each.
Section C: Architecture & System Design [A/E]
Section titled “Section C: Architecture & System Design [A/E]”C1. Caching Strategy Design
Section titled “C1. Caching Strategy Design”You’re building a social media timeline feature. Users see posts from people they follow, ordered by recency. Requirements:
- 100M users, average 200 followers each
- 1M new posts per hour
- Timeline must load in < 200ms
- Posts should appear within 30 seconds of creation
Design the caching strategy:
- What do you cache? (user timelines? individual posts? both?)
- When do you invalidate? (push vs. pull model)
- How do you handle celebrity users with 50M followers?
- What happens during a cache failure? (graceful degradation)
- How do you handle the cold start problem? (new user, empty cache)
C2. Scaling Puzzle
Section titled “C2. Scaling Puzzle”Your application has grown to 100K requests per second. The database is the bottleneck:
- 80% reads, 20% writes
- Average read query: 3ms
- Average write query: 15ms
- Current: single PostgreSQL instance, 32 cores, 256GB RAM
Design a scaling strategy:
- What is the theoretical maximum throughput of the current setup?
- What changes would you make first? Second? Third?
- At what point would you consider sharding? What would you shard by?
- How do you handle cross-shard queries?
- What monitoring would you add to validate each scaling step?
C3. Incident Design
Section titled “C3. Incident Design”Design the monitoring and alerting for a payment processing system:
- What metrics would you track? (list at least 10)
- What are the alerting thresholds?
- What’s the difference between a warning and a critical alert?
- How would you detect a subtle data integrity issue (e.g., double-charging customers 0.1% of the time)?
- Design the on-call runbook for the top 3 most likely incidents.
C4. Consistency Challenge
Section titled “C4. Consistency Challenge”You’re building an e-commerce checkout system across microservices:
- Inventory Service (decrements stock)
- Payment Service (charges credit card)
- Order Service (creates order record)
- Notification Service (sends email)
- What happens if Payment succeeds but Inventory fails?
- Design a saga pattern for this checkout flow.
- How do you handle the compensation transaction for each step?
- How do you ensure idempotency (user double-clicks checkout)?
- What isolation level and locking strategy would you use for the inventory check?
Section D: Performance Deep Dive [E]
Section titled “Section D: Performance Deep Dive [E]”D1. Query Optimization Puzzle
Section titled “D1. Query Optimization Puzzle”This query runs in 30 seconds:
SELECT u.name, u.email, (SELECT COUNT(*) FROM orders o WHERE o.user_id = u.id) as order_count, (SELECT MAX(o.created_at) FROM orders o WHERE o.user_id = u.id) as last_orderFROM users uWHERE u.status = 'active'ORDER BY order_count DESCLIMIT 50;- Why is it slow? What is the execution pattern?
- Rewrite it to be fast. Show at least 2 different approaches.
- What indexes support each rewrite?
- Which approach is better for 1M users? For 100M users?
D2. Connection Pool Sizing
Section titled “D2. Connection Pool Sizing”Your application has 10 app server instances, each with a connection pool of 50 connections, hitting a single PostgreSQL database with max_connections = 200.
- What’s wrong with this configuration?
- What will happen under high load?
- What is the optimal pool size per instance?
- If you add PgBouncer, how does the math change?
- How do you handle connection pool exhaustion gracefully?
D3. Cache Coherency
Section titled “D3. Cache Coherency”You have two application instances reading from and writing to the same PostgreSQL database, with a shared Redis cache.
Instance A updates a user’s email:
- Writes new email to PostgreSQL
- Deletes cache key for that user
Between steps 1 and 2, Instance B reads the user:
-
Checks Redis cache → HIT (stale data with old email)
-
Returns old email to the client
-
Is this a real problem? Under what circumstances?
-
How would you fix this?
-
What are the tradeoffs between delete-then-write vs. write-then-delete?
-
Can you design a solution with zero-inconsistency window? What’s the cost?
Section E: Thought Experiments [E]
Section titled “Section E: Thought Experiments [E]”E1. Zero-Downtime Migration
Section titled “E1. Zero-Downtime Migration”You need to change a column type from VARCHAR(100) to TEXT on a table with 500M rows in PostgreSQL, with zero downtime.
- What happens if you just run
ALTER TABLE ... ALTER COLUMN ... TYPE TEXT? - Design a migration strategy that doesn’t lock the table.
- What if the column change was
INTEGERtoBIGINT? - How do you verify data integrity after the migration?
E2. The Impossible Request
Section titled “E2. The Impossible Request”Your VP asks: “Make the API respond in under 10ms for all requests, globally.”
Your system: US-East database, users worldwide. Physics says the speed of light from US-East to Singapore is ~120ms round-trip.
- Explain why this is physically impossible with a centralized database.
- Design an architecture that gets as close as possible to the requirement.
- What consistency tradeoffs are you making?
- What data can be served from the edge vs. what must hit the origin?
E3. Security Audit
Section titled “E3. Security Audit”You inherit a backend codebase. You have 1 week to do a security audit.
- What do you check first? (Priority order)
- What automated tools do you run?
- What manual checks are most important?
- How do you prioritize fixing the issues you find?
- What quick wins can you implement in week 1?
Answer Key Hints
Section titled “Answer Key Hints”These are not full answers, but hints to check your thinking:
- A1: Think about the leftmost prefix rule. Equality columns before range columns. The ORDER BY can use the index if it matches.
- A3: PostgreSQL re-evaluates the WHERE clause of the UPDATE after the conflicting transaction commits. The final balance depends on whether the UPDATE uses the old snapshot or the new committed value.
- B2: Think about encoding bypasses (Unicode, URL encoding), numeric context (no quotes needed), second-order injection, and encoding-specific attacks (GBK multibyte).
- C2: Don’t jump to sharding. Read replicas + caching can handle most read-heavy workloads. Calculate: 100K req/s × 80% reads ÷ 3ms = need ~240 concurrent read capacity.
- D1: The correlated subqueries execute once per row in the outer query. With 1M active users, that’s 2M subquery executions.
- E1: In PostgreSQL, changing VARCHAR to TEXT is a metadata-only change (no table rewrite). But INTEGER to BIGINT requires rewriting every row.