Users Unable to Publish Content

Incident Report for RebelMouse

Postmortem

This incident resulted in an interruption of our editorial tools and impacted our editors twice, from 7:15 AM EST to 7:45 AM EST and from 8:35 AM EST to 9:40 AM EST.

Incident Summary:

The incident during the Redis upgrade stemmed from two primary issues:

Failover Switch Configuration: During the switch to the new Redis version, most of the instances did not automatically transition to the master role, requiring manual intervention. This unexpected behavior occurred despite successful testing in our staging environment. The errors related to the failover switch proved to be unpredictable. Fortunately, our vigilant DevOps department quickly identified the problem, and we promptly resolved it.
Backend Application Restart: The second service outage was attributed to our backend application not restarting correctly, resulting in the continued use of old Redis endpoints. This problem was traced back to our deployment procedure script, which was not set to restart the application when no changes were made. Identifying the root cause took some time, but once identified, we resolved it promptly.

Impact:

The incident led to the unavailability of our editorial tools. Periodic tasks like feeds, newsletters, post scheduling and others were delayed due to the pause of Celery processes.

Resolution and Mitigation:

To prevent such incidents in the future, we are taking the following measures:

Deployment Procedure Enhancement: Our deployment procedure will be improved to ensure that the backend application is correctly restarted after deployment, even when no changes have been made.

Posted Sep 11, 2023 - 07:51 EDT

Resolved

This incident has been resolved.

Posted Sep 10, 2023 - 08:49 EDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Sep 10, 2023 - 08:25 EDT

Identified

We are experiencing unexpected issue with publishing workflow during our scheduled maintenance to a new version of Redis.

Only logged in users are affected - regular logout users are not experiencing any issues

The team is working on a deploy and we should have things back on track in next 15 mins

Posted Sep 10, 2023 - 07:50 EDT

This incident affected: Redis Cluster.