Chronology of the incident
- Apr 25, 2024, 05:12 PM UTC: RebelMouse received an alert from internal monitoring systems about a significantly increased error rate.
- Apr 25, 2024, 05:12 PM UTC: DevOps team started to check the systems.
- Apr 25, 2024, 05:23 PM UTC: RebelMouse published the status portal about performance degradation.
- Apr 25, 2024, 05:26 PM UTC: The problem was identified as an overload of Talaria (Smart Cache Service).
- Apr 25, 2024, 05:42 PM UTC: Traffic was rerouted bypassing the Talaria. This action restored the performance for the end users.
- Apr 25, 2024, 06:00 PM UTC: Changes in the configuration were applied to increase the resources for Talaria.
- Apr 25, 2024, 06:09 PM UTC: Talaria was re-enabled.
- Apr 26, 2024, 01:06 PM UTC: Incident was marked as resolved
The impact of the incident
The incident resulted in performance degradation, leading to periods of unavailability for public pages or delays in publishing the content.
The underlying cause
Increased amount of traffic caused the overload of the Talaria.
Actions taken & Preventive Measures
We've reviewed the configuration of the Talaria service, added additional resources to it and optimized the autoscaling rules.
Our autoscaling system operates on preset rules designed to accommodate anticipated loads. However, as traffic patterns shift over time, it's essential to periodically review and adjust these rules accordingly.