Complex SQL Queries

This learning is based on a real production incident that affected customers.

Let’s examine how to ensure design and code reviews properly consider scalability and downstream product impact, using a real-world example where an un-optimized SQL query and an underscaled database caused hundreds of failed wallet transactions in the Paymaster Gas Station (PGS) service.

The Scenario

A single customer generated ~14,000 transactions within minutes. Each transaction triggered a complex 24-hour policy-check query. The writer Aurora Postgres instance (db.t4g.medium, 2 vCPU/4 GiB) could not sustain the read-heavy sequential scans, causing CPU saturation, elevated latencies, and 942 transaction failures. Tech-Ops mitigated by manually scaling the instance to db.r6i.xlarge (4 vCPU/32 GiB), after which service health recovered.


-- Simplified version of the expensive policy-check query
SELECT sender,
       SUM(network_fee) AS daily_fee
FROM   user_operations_transactions
WHERE  paymaster_id = $1
AND    created_at >= NOW() - INTERVAL '24 hours'
GROUP  BY sender
HAVING SUM(network_fee) > $2;

The incident revealed that the design team had noted scalability risk and even spiked a Redis-based replacement, but the work was de-prioritized in favour of new features. Periodic load testing covered average traffic but not worst-case daily-growth scenarios, and query/DB scaling recommendations were never executed.

PR Comment

Choose the comment that you think is the most constructive and helpful.

Trix Beigecommented on Apr 2, 2025

The SQL looks fine, but we should run a quick load test.

Cheri Limecommented on Apr 2, 2025

Checked everything, the code looks good. The tests look like they work for the new code.

Approved.

Lib Amaranthcommented on Apr 2, 2025

The data transformation looks correct, but we need to verify the payment email system still works with the new data structure.

Joelly Whitecommented on Apr 2, 2025

Thanks for implementing the data migration! Before we merge, let’s ensure we understand the full product impact:

How does this change affect the payment confirmation email system?
Can you walk through the end-to-end flow to verify the email content is correct?
Should we add integration tests for the email system with the new data structure?

This will help us catch any unintended consequences before deployment.

Click here to learn more

Key Lessons

1. Understanding Product Impact

Consider how changes affect the entire system
Map out dependencies and downstream effects
Verify changes don’t break existing functionality
Document product impact in PR descriptions

2. Testing Strategy

Don’t just update unit tests without understanding why
Include end-to-end testing for critical flows
Test integration points between systems
Verify data transformations maintain required formats

3. Code Review Best Practices

Reviewers should understand the product context
Ask questions about system-wide impact
Verify test coverage matches the change scope
Consider both technical and product implications

Tips for Reviewers

1. Ask Product-Focused Questions

How does this change affect the user experience?
What systems depend on this data?
Are there any downstream effects we should test?
Example: “How does this affect the payment confirmation flow?“

2. Verify Testing Strategy

Are the tests comprehensive enough?
Do they cover all affected systems?
Is there end-to-end testing for critical flows?
Example: “Let’s add integration tests for the email system”

3. Document Dependencies

List systems that might be affected
Note any data format requirements
Document integration points
Example: “Payment emails expect user data in format X”

Common Pitfalls to Avoid

1. Focusing Only on Code

❌ “The code looks good, let’s merge.”
✅ “The code looks good, but let’s verify the product impact.”

2. Assuming Tests Are Sufficient

❌ “The tests pass, so it’s good.”
✅ “The tests pass, but let’s verify the end-to-end flow.”

3. Missing Integration Points

❌ “This is just a data change.”
✅ “This data change affects multiple systems, let’s verify them all.”

Remember: A good code review considers both technical correctness and product impact. Understanding how changes affect the entire system helps prevent issues before they reach production!