Performance degradation for logged in experience

Incident Report for RebelMouse

Postmortem

Chronology of the incident (UTC)

Jul 25, 2024 9:56 AM Update for mitigation of message broker (Amazon MQ for RabbitMQ) connectivity issues was deployed to all applications.
Jul 27, 2024 1:10 AM RabbitMQ instances experienced a sudden 40% increase in CPU utilization due to an unexpected number of external requests.
Jul 27, 2024 1:12 AM Applications started to observe increased latencies and errors due to RabbitMQ degradation. To mitigate the issue, applications were automatically restarted based on the new connectivity monitoring update.
Jul 27, 2024 1:12 AM Platform began experiencing periodic service unavailability affecting the logged-in users caused by applications restarting.
Jul 27, 2024 2:20 AM Rollback of updates related to RabbitMQ connectivity was initiated to stabilize the platform.
Jul 27, 2024 2:25 AM Platform recovered after the rollback of updates related to RabbitMQ connectivity.

The impact of the incident

Periodic service unavailability affecting the logged-in user experience of the RebelMouse application.
Data processing delays occurred during the incident.

The underlying cause if known

The root cause of the incident was the newly deployed RabbitMQ connectivity update. The update introduced a real-time RabbitMQ connectivity monitoring and restart mechanism to ensure system reliability. However, the update coincided with an unexpected increase in external requests, causing a significant load on the RabbitMQ instances.

Actions taken & Preventive Measures

Alerts have been set up to notify the team if RabbitMQ CPU consumption exceeds a certain threshold.
Changes to connectivity monitoring will be re-evaluated to prevent similar incidents in the future.
Consider replacing RabbitMQ with an AWS managed solutions to improve stability

Posted Jul 30, 2024 - 12:10 EDT

Resolved

The issue is resolved. We will prepare a full explanation and send it in a postmortem message

Posted Jul 26, 2024 - 23:00 EDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Jul 26, 2024 - 22:40 EDT

Update

We are turning off not essential services to cover overloaded production environment

Posted Jul 26, 2024 - 21:53 EDT

Identified

The issue has been identified and a fix is being implemented.

Posted Jul 26, 2024 - 21:46 EDT

Monitoring

We have identified and resolved the issue

Posted Jul 26, 2024 - 21:35 EDT

Investigating

We are experiencing a performance degradation for logged in experience and editorial tools. We are checking what is the source of the issue

Posted Jul 26, 2024 - 21:27 EDT

This incident affected: AWS ec2-us-east-1.