Component Upgrades
This learning is based on a real production incident that affected customers.
Let’s examine how to properly manage component upgrades, using a real-world example where an upgrade caused service inaccessibility.
The Scenario
A version upgrade to nginx caused a subset of requests to fail with a 504 error, resulting in signup errors in the developer console and other services.
- The upgrade caused a subset of requests to fail with a 504 error, leading to signup errors.
- The unanticipated security change invalidated certain annotations used by services
- Manual rollbacks were not well-coordinated, causing confusion about the system’s state.
- The breaking change was reintroduced due to a lack of immediate commit reverts.
- Testing in non-production environments was not exhaustive
cluster.tf
module "cluster" {
- source = "git@github.com:crcl-main/terraform-aws-eks-cluster?ref=34.20.1"
+ source = "git@github.com:crcl-main/terraform-aws-eks-cluster?ref=34.22.1"
}PR Comment
Choose the comment that you think is the most constructive and helpful.
Click here to learn more
Key Lessons
1. Importance of Thorough Testing and Validation
- Exhaustive Testing: Thoroughly test all changes in non-production environments
- Automated Monitoring: Implement automated monitoring to catch errors early
- Strict PR Reviews: Enforce strict PR reviews for critical changes
- Post-Deployment Checks: Include checks to verify changes post-deployment
2. Effective Communication and Coordination
- Clear Rollback Communication: Coordinate and communicate rollbacks clearly
- On-Call Handoff: Improve communication during on-call handoffs
- Incident Escalation: Establish clear escalation protocols
- Change Documentation: Document all changes and rollbacks thoroughly
3. Risk Management and Mitigation
- Risk Assessment: Conduct risk assessments before upgrades
- Rrr on the side of caution: Treat all versions as breaking until changes can be verified
- Fallback Plans: Develop and document fallback plans
- Audit Annotations: Regularly audit critical annotations
- Proactive Monitoring: Implement proactive monitoring to detect issues early
4. Continuous Improvement and Learning
- Root Cause Analysis: Perform detailed root cause analysis for all incidents
- Actionable Insights: Translate lessons learned from incidents into actionable insights and improvements
- Training and Awareness: Provide ongoing training and awareness programs for engineers
- Feedback Loop: Establish a feedback loop to continuously improve incident response and prevention strategies
Tips for Reviewers
1. Confirm Compatibility
- Are there any breaking changes in the new package version?
- Will the upgrade affect other dependencies?
- Are there any deprecations that need addressing?
- Example: “Does this package upgrade introduce any breaking changes that could affect our current setup?“
2. Validate Testing Coverage
- Are there tests covering the new package version?
- Have all existing tests been run and passed?
- Are there any new tests needed for the upgraded package?
- Example: “Have you run all tests to ensure compatibility with the new package version?“
3. Assess Documentation and Changelog
- Is the changelog for the new package version reviewed?
- Are there any new features or changes that need documentation updates?
- Is there a clear migration path provided by the package maintainers?
- Example: “Have you reviewed the changelog and updated our documentation accordingly?”
Common Pitfalls to Avoid
1. Relying on Semantic Versioning
- ❌ “It’s just a minor bump, so should be ok”
- ✅ “Even though it’s just a minor bump, we should verify the actual changes”
2. Environment Parity
- ❌ “It works on my local, so it should be good to be released to production”
- ✅ “It works on my local. We’ll need to verify on staging before releasing to production”
3. Skipping explicit verification of changes
- ❌ “The upgrade doesn’t seem to have broken our existing tests, so we’re good to go”
- ✅ “Let’s make sure all of the changes in the upgrade have corresponding passing tests”
Last updated on