This incident resulted in an interruption of our editorial tools and impacted our editors twice, from 7:15 AM EST to 7:45 AM EST and from 8:35 AM EST to 9:40 AM EST.
Incident Summary:
The incident during the Redis upgrade stemmed from two primary issues:
- Failover Switch Configuration: During the switch to the new Redis version, most of the instances did not automatically transition to the master role, requiring manual intervention. This unexpected behavior occurred despite successful testing in our staging environment. The errors related to the failover switch proved to be unpredictable. Fortunately, our vigilant DevOps department quickly identified the problem, and we promptly resolved it.
- Backend Application Restart: The second service outage was attributed to our backend application not restarting correctly, resulting in the continued use of old Redis endpoints. This problem was traced back to our deployment procedure script, which was not set to restart the application when no changes were made. Identifying the root cause took some time, but once identified, we resolved it promptly.
Impact:
The incident led to the unavailability of our editorial tools. Periodic tasks like feeds, newsletters, post scheduling and others were delayed due to the pause of Celery processes.
Resolution and Mitigation:
To prevent such incidents in the future, we are taking the following measures:
Deployment Procedure Enhancement: Our deployment procedure will be improved to ensure that the backend application is correctly restarted after deployment, even when no changes have been made.