Performance degradation for logged in experience
Incident Report for RebelMouse
Postmortem

Chronology of the incident (UTC)

  • Jul 25, 2024 9:56 AM Update for mitigation of message broker (Amazon MQ for RabbitMQ) connectivity issues was deployed to all applications.
  • Jul 27, 2024 1:10 AM RabbitMQ instances experienced a sudden 40% increase in CPU utilization due to an unexpected number of external requests.
  • Jul 27, 2024 1:12 AM Applications started to observe increased latencies and errors due to RabbitMQ degradation. To mitigate the issue, applications were automatically restarted based on the new connectivity monitoring update.
  • Jul 27, 2024 1:12 AM Platform began experiencing periodic service unavailability affecting the logged-in users caused by applications restarting.
  • Jul 27, 2024 2:20 AM Rollback of updates related to RabbitMQ connectivity was initiated to stabilize the platform.
  • Jul 27, 2024 2:25 AM Platform recovered after the rollback of updates related to RabbitMQ connectivity.

The impact of the incident

  • Periodic service unavailability affecting the logged-in user experience of the RebelMouse application.
  • Data processing delays occurred during the incident.

The underlying cause if known

The root cause of the incident was the newly deployed RabbitMQ connectivity update. The update introduced a real-time RabbitMQ connectivity monitoring and restart mechanism to ensure system reliability. However, the update coincided with an unexpected increase in external requests, causing a significant load on the RabbitMQ instances.

Actions taken & Preventive Measures

  • Alerts have been set up to notify the team if RabbitMQ CPU consumption exceeds a certain threshold.
  • Changes to connectivity monitoring will be re-evaluated to prevent similar incidents in the future.
  • Consider replacing RabbitMQ with an AWS managed solutions to improve stability
Posted Jul 30, 2024 - 12:10 EDT

Resolved
The issue is resolved. We will prepare a full explanation and send it in a postmortem message
Posted Jul 26, 2024 - 23:00 EDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jul 26, 2024 - 22:40 EDT
Update
We are turning off not essential services to cover overloaded production environment
Posted Jul 26, 2024 - 21:53 EDT
Identified
The issue has been identified and a fix is being implemented.
Posted Jul 26, 2024 - 21:46 EDT
Monitoring
We have identified and resolved the issue
Posted Jul 26, 2024 - 21:35 EDT
Investigating
We are experiencing a performance degradation for logged in experience and editorial tools. We are checking what is the source of the issue
Posted Jul 26, 2024 - 21:27 EDT
This incident affected: AWS ec2-us-east-1.