Skip to Content
LearnCode Review5 WhysComponent Upgrades

Component Upgrades

This learning is based on a real production incident that affected customers.

Let’s examine how to properly manage component upgrades, using a real-world example where an upgrade caused service inaccessibility.

The Scenario

A version upgrade to nginx caused a subset of requests to fail with a 504 error, resulting in signup errors in the developer console and other services.

  1. The upgrade caused a subset of requests to fail with a 504 error, leading to signup errors.
  2. The unanticipated security change invalidated certain annotations used by services
  3. Manual rollbacks were not well-coordinated, causing confusion about the system’s state.
  4. The breaking change was reintroduced due to a lack of immediate commit reverts.
  5. Testing in non-production environments was not exhaustive
cluster.tf
module "cluster" { - source = "git@github.com:crcl-main/terraform-aws-eks-cluster?ref=34.20.1" + source = "git@github.com:crcl-main/terraform-aws-eks-cluster?ref=34.22.1" }

PR Comment

Choose the comment that you think is the most constructive and helpful.

Click here to learn more

Key Lessons

1. Importance of Thorough Testing and Validation

  • Exhaustive Testing: Thoroughly test all changes in non-production environments
  • Automated Monitoring: Implement automated monitoring to catch errors early
  • Strict PR Reviews: Enforce strict PR reviews for critical changes
  • Post-Deployment Checks: Include checks to verify changes post-deployment

2. Effective Communication and Coordination

  • Clear Rollback Communication: Coordinate and communicate rollbacks clearly
  • On-Call Handoff: Improve communication during on-call handoffs
  • Incident Escalation: Establish clear escalation protocols
  • Change Documentation: Document all changes and rollbacks thoroughly

3. Risk Management and Mitigation

  • Risk Assessment: Conduct risk assessments before upgrades
  • Rrr on the side of caution: Treat all versions as breaking until changes can be verified
  • Fallback Plans: Develop and document fallback plans
  • Audit Annotations: Regularly audit critical annotations
  • Proactive Monitoring: Implement proactive monitoring to detect issues early

4. Continuous Improvement and Learning

  • Root Cause Analysis: Perform detailed root cause analysis for all incidents
  • Actionable Insights: Translate lessons learned from incidents into actionable insights and improvements
  • Training and Awareness: Provide ongoing training and awareness programs for engineers
  • Feedback Loop: Establish a feedback loop to continuously improve incident response and prevention strategies

Tips for Reviewers

1. Confirm Compatibility

  • Are there any breaking changes in the new package version?
  • Will the upgrade affect other dependencies?
  • Are there any deprecations that need addressing?
  • Example: “Does this package upgrade introduce any breaking changes that could affect our current setup?“

2. Validate Testing Coverage

  • Are there tests covering the new package version?
  • Have all existing tests been run and passed?
  • Are there any new tests needed for the upgraded package?
  • Example: “Have you run all tests to ensure compatibility with the new package version?“

3. Assess Documentation and Changelog

  • Is the changelog for the new package version reviewed?
  • Are there any new features or changes that need documentation updates?
  • Is there a clear migration path provided by the package maintainers?
  • Example: “Have you reviewed the changelog and updated our documentation accordingly?”

Common Pitfalls to Avoid

1. Relying on Semantic Versioning

  • ❌ “It’s just a minor bump, so should be ok”
  • ✅ “Even though it’s just a minor bump, we should verify the actual changes”

2. Environment Parity

  • ❌ “It works on my local, so it should be good to be released to production”
  • ✅ “It works on my local. We’ll need to verify on staging before releasing to production”

3. Skipping explicit verification of changes

  • ❌ “The upgrade doesn’t seem to have broken our existing tests, so we’re good to go”
  • ✅ “Let’s make sure all of the changes in the upgrade have corresponding passing tests”
Last updated on