Performance degradation
Incident Report for RebelMouse
Postmortem

Chronology of the incident 

  • Apr 25, 2024, 05:12 PM UTC: RebelMouse received an alert from internal monitoring systems about a significantly increased error rate.
  • Apr 25, 2024, 05:12 PM UTC: DevOps team started to check the systems.
  • Apr 25, 2024, 05:23 PM UTC: RebelMouse published the status portal about performance degradation.
  • Apr 25, 2024, 05:26 PM UTC: The problem was identified as an overload of Talaria (Smart Cache Service). 
  • Apr 25, 2024, 05:42 PM UTC: Traffic was rerouted bypassing the Talaria. This action restored the performance for the end users.
  • Apr 25, 2024, 06:00 PM UTC: Changes in the configuration were applied to increase the resources for Talaria.
  • Apr 25, 2024, 06:09 PM UTC: Talaria was re-enabled.
  • Apr 26, 2024, 01:06 PM UTC: Incident was marked as resolved

The impact of the incident

The incident resulted in performance degradation, leading to periods of unavailability for public pages or delays in publishing the content.

The underlying cause

Increased amount of traffic caused the overload of the Talaria.

Actions taken & Preventive Measures

We've reviewed the configuration of the Talaria service, added additional resources to it and optimized the autoscaling rules.

Our autoscaling system operates on preset rules designed to accommodate anticipated loads. However, as traffic patterns shift over time, it's essential to periodically review and adjust these rules accordingly.

Posted May 02, 2024 - 11:24 EDT

Resolved
This incident has been resolved.
Posted Apr 25, 2024 - 21:06 EDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Apr 25, 2024 - 14:06 EDT
Investigating
We are currently investigating this issue.
Posted Apr 25, 2024 - 13:23 EDT
This incident affected: Full Platform.