Editorial Tools Outage
Incident Report for RebelMouse
Postmortem

On November 7, 2023, from 2:43 PM EST to 2:55 PM EST, an incident occurred, resulting in an outage of our editorial tools.

The incident was caused by a combination of two unexpected factors:

  1. Celery Queue Name Update: In one of the recent software release, an unintentional change to the name of one of the Celery queues occurred. This change resulted in the processing instance switch from a dedicated instance to a default one which processes several other queues, and the queue was inadvertently excluded from our monitoring tool rules. Consequently, the unprocessed tasks queue grew rapidly as incoming tasks exceeded the processing capacity. These unprocessed tasks were stored in Redis memory, leading to memory exhaustion and the initiation of swap usage.
  2. Script Overload: Concurrently, a routine development script was executed by one of our developers, which also utilized the same Redis infrastructure. This increased the load on the already strained Redis service, exacerbating the problem.

In response to the incident, our team took the following immediate actions:

  • Stopped the developer's script to reduce the load on Redis.
  • Deployed several powerful instances to handle the backlog of unprocessed tasks.
  • Updated the Celery queue name to its correct configuration to prevent a recurrence of this issue.
Posted Nov 08, 2023 - 07:31 EST

Resolved
The issue is fully resolved
Posted Nov 08, 2023 - 15:00 EST