Chronology of the incident:
- Mar 27, 2024, 01:02 PM UTC
RebelMouse received an alert from internal monitoring systems about slight increased error rate
- Mar 27, 2024, 01:07 PM UTC
DevOps team checked the systems, noticed a short load spike and error rate already went to normal
- Mar 27, 2024, 02:12 PM UTC
RebelMouse received an alert from internal monitoring systems about slight increased error rate
- Mar 27, 2024, 02:14 PM UTC
RebelMouse team members observed a degradation in performance across certain services and promptly reported an incident. However, this degradation in performance was temporary and not permanent.
- Mar 27, 2024, 02:22 PM UTC
A dedicated incident resolution team was assembled, initiating an investigation
- Mar 27, 2024, 02:44 PM UTC
Significant anomalies in traffic were identified, prompting the allocation of extra resources to the cluster tasked with handling said traffic.
- Mar 27, 2024, 02:53 PM UTC
The incident resolution team transitioned into monitoring mode.
- Mar 27, 2024, 04:27 PM UTC
RebelMouse received an alert from internal monitoring systems about significant increased error rate
- Mar 27, 2024, 04:30 PM UTC
The incident resolution team decided to fully reroute the suspicious traffic to an independent cluster
- Mar 27, 2024, 04:57 PM UTC
The suspicious traffic was isolated in the independent cluster
- Mar 27, 2024, 05:03 PM UTC
RebelMouse published the status portal message
- Mar 27, 2024, 05:04 PM UTC
The incident resolution team shifted into monitoring mode and concurrently began exploring potential enhancements in case of any recurrence of the issue.
- Mar 27, 2024, 05:48 PM UTC
RebelMouse received reports from the clients about the degradation in performance and also alerts from monitoring systems.
- Mar 27, 2024, 05:58 PM UTC
The root cause of the problem was identified
- Mar 27, 2024, 06:05 PM UTC
The fix was implemented
- Mar 28, 2024
An independent cluster was established specifically for editorial traffic to safeguard its functionality from potential disruptions caused by other services
The impact of the incident
The incident resulted in intermittent performance degradation, leading to periods of unavailability for editorial tools.
The underlying cause if known
The Broken Links
service shared endpoints with critical editorial tools such as the Entry Editor
or Posts Dashboard
. Periodically, this service generated long-running requests, causing health checks to fail and Kubernetes to deem the pods unhealthy. Consequently, Kubernetes terminated these pods and initiated their recreation. This process resulted in temporary unavailability of the affected services during the restart.
Actions taken & Preventive Measures
An independent cluster was established specifically for editorial traffic to safeguard its functionality from potential disruptions caused by other services