Jul 25, 2024 9:56 AM Update for mitigation of message broker (Amazon MQ for RabbitMQ) connectivity issues was deployed to all applications.
Jul 27, 2024 1:10 AM RabbitMQ instances experienced a sudden 40% increase in CPU utilization due to an unexpected number of external requests.
Jul 27, 2024 1:12 AM Applications started to observe increased latencies and errors due to RabbitMQ degradation. To mitigate the issue, applications were automatically restarted based on the new connectivity monitoring update.
Jul 27, 2024 1:12 AM Platform began experiencing periodic service unavailability affecting the logged-in users caused by applications restarting.
Jul 27, 2024 2:20 AM Rollback of updates related to RabbitMQ connectivity was initiated to stabilize the platform.
Jul 27, 2024 2:25 AM Platform recovered after the rollback of updates related to RabbitMQ connectivity.
The impact of the incident
Periodic service unavailability affecting the logged-in user experience of the RebelMouse application.
Data processing delays occurred during the incident.
The underlying cause if known
The root cause of the incident was the newly deployed RabbitMQ connectivity update. The update introduced a real-time RabbitMQ connectivity monitoring and restart mechanism to ensure system reliability. However, the update coincided with an unexpected increase in external requests, causing a significant load on the RabbitMQ instances.
Actions taken & Preventive Measures
Alerts have been set up to notify the team if RabbitMQ CPU consumption exceeds a certain threshold.
Changes to connectivity monitoring will be re-evaluated to prevent similar incidents in the future.
Consider replacing RabbitMQ with an AWS managed solutions to improve stability
Posted Jul 30, 2024 - 12:10 EDT
Resolved
The issue is resolved. We will prepare a full explanation and send it in a postmortem message
Posted Jul 26, 2024 - 23:00 EDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Jul 26, 2024 - 22:40 EDT
Update
We are turning off not essential services to cover overloaded production environment
Posted Jul 26, 2024 - 21:53 EDT
Identified
The issue has been identified and a fix is being implemented.
Posted Jul 26, 2024 - 21:46 EDT
Monitoring
We have identified and resolved the issue
Posted Jul 26, 2024 - 21:35 EDT
Investigating
We are experiencing a performance degradation for logged in experience and editorial tools. We are checking what is the source of the issue