Complex SQL Queries
This learning is based on a real production incident that affected customers.
Let’s examine how to ensure design and code reviews properly consider scalability and downstream product impact, using a real-world example where an un-optimized SQL query and an underscaled database caused hundreds of failed wallet transactions in the Paymaster Gas Station (PGS) service.
The Scenario
A single customer generated ~14,000 transactions within minutes. Each transaction triggered a complex 24-hour policy-check query. The writer Aurora Postgres instance (db.t4g.medium, 2 vCPU/4 GiB) could not sustain the read-heavy sequential scans, causing CPU saturation, elevated latencies, and 942 transaction failures. Tech-Ops mitigated by manually scaling the instance to db.r6i.xlarge (4 vCPU/32 GiB), after which service health recovered.
-- Simplified version of the expensive policy-check query
SELECT sender,
SUM(network_fee) AS daily_fee
FROM user_operations_transactions
WHERE paymaster_id = $1
AND created_at >= NOW() - INTERVAL '24 hours'
GROUP BY sender
HAVING SUM(network_fee) > $2;The incident revealed that the design team had noted scalability risk and even spiked a Redis-based replacement, but the work was de-prioritized in favour of new features. Periodic load testing covered average traffic but not worst-case daily-growth scenarios, and query/DB scaling recommendations were never executed.
PR Comment
Choose the comment that you think is the most constructive and helpful.
Click here to learn more
Key Lessons
1. Understanding Product Impact
- Consider how changes affect the entire system
- Map out dependencies and downstream effects
- Verify changes don’t break existing functionality
- Document product impact in PR descriptions
2. Testing Strategy
- Don’t just update unit tests without understanding why
- Include end-to-end testing for critical flows
- Test integration points between systems
- Verify data transformations maintain required formats
3. Code Review Best Practices
- Reviewers should understand the product context
- Ask questions about system-wide impact
- Verify test coverage matches the change scope
- Consider both technical and product implications
Tips for Reviewers
1. Ask Product-Focused Questions
- How does this change affect the user experience?
- What systems depend on this data?
- Are there any downstream effects we should test?
- Example: “How does this affect the payment confirmation flow?“
2. Verify Testing Strategy
- Are the tests comprehensive enough?
- Do they cover all affected systems?
- Is there end-to-end testing for critical flows?
- Example: “Let’s add integration tests for the email system”
3. Document Dependencies
- List systems that might be affected
- Note any data format requirements
- Document integration points
- Example: “Payment emails expect user data in format X”
Common Pitfalls to Avoid
1. Focusing Only on Code
- ❌ “The code looks good, let’s merge.”
- ✅ “The code looks good, but let’s verify the product impact.”
2. Assuming Tests Are Sufficient
- ❌ “The tests pass, so it’s good.”
- ✅ “The tests pass, but let’s verify the end-to-end flow.”
3. Missing Integration Points
- ❌ “This is just a data change.”
- ✅ “This data change affects multiple systems, let’s verify them all.”
Remember: A good code review considers both technical correctness and product impact. Understanding how changes affect the entire system helps prevent issues before they reach production!