Cross-Service Impact
This learning is based on a real production incident that affected multiple services, including Circle Mint and Web3 developers.
Let’s examine how to properly review changes that affect critical user permissions and access controls, using a real-world example where a data corruption incident blocked developers from upgrading to mainnet and customers from completing onboarding.
The Scenario
A data corruption incident in the identity service affected the authorized representative (AR) flags, which led to:
- Mint users are unable to continue onboarding after login.
- Developers unable to upgrade to mainnet.
- Diameter users not seeing AR flags.
- Periodic review submission data potentially impacted.
- Extended resolution time due to complex data dependencies.
While the immediate trigger was an untracked PR that accidentally removed an SQL WHERE clause, the broader root causes were:
- Lack of proper error handling and fallbacks in dependent services.
- No validation of affected row counts in SQL updates.
- Missing monitoring for permission state changes.
- Insufficient safeguards against unintended data modifications.
- No automated checks for critical permission changes.
The incident started with a seemingly simple PR to fix a button in Diameter for re-designating ARs, but cascaded into a system-wide issue due to these underlying gaps.
The Code Change
Before
-- Original query with proper WHERE clause
UPDATE user_records
SET is_authorized_representative = false
WHERE entity_id = :entityId
AND email != :newArEmail;Impact on Services
The lack of proper error handling and fallbacks meant that when AR flags were corrupted:
Before
const MainnetButton = () => {
const { data } = useQuery(USER_PERMISSIONS_QUERY);
if (!data?.canUpgradeToMainnet) {
return (
<Tooltip content="Only the owner can upgrade this account to a production environment">
<Button disabled>Upgrade to Mainnet</Button>
</Tooltip>
)
}
return <Button>Upgrade to Mainnet</Button>
}PR Comment
Choose the comment that you think is the most constructive and helpful.
Why Wasn’t This Caught?
Several factors contributed to this issue making it to production:
- Untracked Change: The PR was created outside the normal process, bypassing initial review gates.
- Missing Safety Checks: No automated validation of SQL changes that affect all records.
- Limited Error Handling: Services assumed the AR flags would always be valid.
- No Monitoring: No alerts for unusual permission changes or AR flag modifications.
- Testing Gaps: Test data didn’t represent production scale.
- Review Focus: Code reviews focused on the UI change, not the underlying data modification.
Click here to learn more
Key Lessons
1. Access Control Fundamentals
- Validate data integrity constraints
- Add comprehensive monitoring
- Consider cross-service impact
- Document service dependencies
2. Testing Strategy
- Test all permission scenarios
- Verify data integrity
- Include cross-service tests
- Monitor data consistency
3. Review Best Practices
- Require multiple reviewers
- Use automated code analysis
- Consider data safety
- Plan for service disruptions
Tips for Reviewers
1. Ask Data-Focused Questions
- Is data integrity maintained?
- How are constraints enforced?
- What services are affected?
- Example: “How do we ensure exactly one AR per entity?“
2. Verify Testing Approach
- Are all scenarios tested?
- Is data validation verified?
- Are there integration tests?
- Example: “Can we test AR flag consistency?“
3. Document Requirements
- List service dependencies
- Note data constraints
- Document recovery procedures
- Example: “Document AR flag requirements”
Common Pitfalls to Avoid
1. Focusing Only on Happy Path
- ❌ “The access check works for valid cases.”
- ✅ “The access check maintains data integrity and handles errors.”
2. Insufficient Validation
- ❌ “The flag is updated correctly.”
- ✅ “The flag update maintains the one-AR-per-entity constraint.”
3. Missing Cross-Service Impact
- ❌ “The change works in our service.”
- ✅ “The change is verified across all dependent services.”
Remember: A good access control review considers data integrity, cross-service impact, and proper validation. Understanding the full scope of permission changes and implementing proper safeguards helps prevent widespread issues!