Periodic Tasks Delay

Incident Report for RebelMouse

Postmortem

Celery Beat Service Outage

Summary:

Incident Date & Time: September 1, 2023, from 12:01 PM EST to 1:22 PM EST
Affected Service: Celery Beat
Root Cause: Unforeseen hardware failure of the instance hosting Celery Beat

Incident Details:
During the period of September 1, 2023, from 12:01 PM EST to 1:22 PM EST, our Celery Beat service experienced an outage, resulting in periodic tasks not processing as expected. These tasks were accumulating in the queue, causing disruption to our operations. The primary cause of this incident was the unexpected failure and subsequent termination of the instance hosting the Celery Beat service.

Actions Taken:
Upon detecting the issue, our incident response team promptly initiated recovery procedures. Initially, we attempted to launch a new instance with the same configuration as the failed instance. Regrettably, these attempts ended in failure due to unanticipated complications.
To expedite service restoration and mitigate the risk of recurrence, we made the following critical decisions:

Instance Family Change: We transitioned to a more robust instance family with enhanced hardware capabilities. This change was aimed at reducing the likelihood of hardware-related failure.
Increased Instance Power: We selected a more powerful instance type to improve the launch speed and overall performance of the Celery Beat service.

Resolution:
With the aforementioned adjustments in place, we successfully launched a new instance, and the Celery Beat service was fully restored.
Root Cause Analysis:
The initial failure was challenging to predict or prevent since it was attributed to a hardware problem within the instance hosting the Celery Beat service. Hardware failures can be unpredictable and fall outside the scope of traditional preventive measures.

Posted Sep 01, 2023 - 13:51 EDT

Resolved

This incident has been resolved.

Posted Sep 01, 2023 - 13:26 EDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Sep 01, 2023 - 13:21 EDT

Update

We have encountered unexpected challenges during the instance launch process, and we regret to inform you that we require additional time to resolve the issue and fully restore the Celery Beat service. Our technical team is actively working to address the underlying problems and expedite the recovery process.

Posted Sep 01, 2023 - 13:08 EDT

Identified

The issue has been identified and a fix is being implemented.

Posted Sep 01, 2023 - 12:48 EDT

Investigating

We are experiencing problems with processing of periodic tasks like sending newsletters, posts scheduling, social scheduling, feeds processing and others.

Celery beat instance responsible for managing these tasks was taken out of service by AWS due to unhealthy state. We are replacing the instance right now. The process will take up to 15 min. After we launch the instance, all the collected tasks will be processed.

Posted Sep 01, 2023 - 12:48 EDT

This incident affected: Celery.