Performance degradation
Incident Report for RebelMouse
Postmortem

Chronology of the incident:

  • Mar 27, 2024, 01:02 PM UTC
    RebelMouse received an alert from internal monitoring systems about slight increased error rate
  • Mar 27, 2024, 01:07 PM UTC
    DevOps team checked the systems, noticed a short load spike and error rate already went to normal
  • Mar 27, 2024, 02:12 PM UTC
    RebelMouse received an alert from internal monitoring systems about slight increased error rate
  • Mar 27, 2024, 02:14 PM UTC
    RebelMouse team members observed a degradation in performance across certain services and promptly reported an incident. However, this degradation in performance was temporary and not permanent.
  • Mar 27, 2024, 02:22 PM UTC
    A dedicated incident resolution team was assembled, initiating an investigation
  • Mar 27, 2024, 02:44 PM UTC
    Significant anomalies in traffic were identified, prompting the allocation of extra resources to the cluster tasked with handling said traffic.
  • Mar 27, 2024, 02:53 PM UTC
    The incident resolution team transitioned into monitoring mode.
  • Mar 27, 2024, 04:27 PM UTC
    RebelMouse received an alert from internal monitoring systems about significant increased error rate
  • Mar 27, 2024, 04:30 PM UTC
    The incident resolution team decided to fully reroute the suspicious traffic to an independent cluster 
  • Mar 27, 2024, 04:57 PM UTC
    The suspicious traffic was isolated in the independent cluster 
  • Mar 27, 2024, 05:03 PM UTC
    RebelMouse published the status portal message
  • Mar 27, 2024, 05:04 PM UTC
    The incident resolution team shifted into monitoring mode and concurrently began exploring potential enhancements in case of any recurrence of the issue.
  • Mar 27, 2024, 05:48 PM UTC
    RebelMouse received reports from the clients about the degradation in performance and also alerts from monitoring systems.
  • Mar 27, 2024, 05:58 PM UTC
    The root cause of the problem was identified
  • Mar 27, 2024, 06:05 PM UTC
    The fix was implemented
  • Mar 28, 2024
    An independent cluster was established specifically for editorial traffic to safeguard its functionality from potential disruptions caused by other services

The impact of the incident

The incident resulted in intermittent performance degradation, leading to periods of unavailability for editorial tools.

The underlying cause if known

The Broken Links service shared endpoints with critical editorial tools such as the Entry Editor or Posts Dashboard. Periodically, this service generated long-running requests, causing health checks to fail and Kubernetes to deem the pods unhealthy. Consequently, Kubernetes terminated these pods and initiated their recreation. This process resulted in temporary unavailability of the affected services during the restart.

Actions taken & Preventive Measures

An independent cluster was established specifically for editorial traffic to safeguard its functionality from potential disruptions caused by other services

Posted Mar 29, 2024 - 05:58 EDT

Resolved
This incident has been resolved.
Posted Mar 27, 2024 - 15:51 EDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Mar 27, 2024 - 14:08 EDT
Investigating
We are currently investigating this issue.
Posted Mar 27, 2024 - 13:50 EDT
Monitoring
There was an isolated surge of highly suspicious traffic. While we aren't certain of its origin or source, we have isolated it away from the production clusters to its own environment. This means all productions systems should have returned to normal and we believe the problem is under control. We don't fully understand the why behind this yet though so we will be updating this with more details soon.
Posted Mar 27, 2024 - 13:31 EDT
Investigating
We've are experiencing the performance degradation for the logged in users.
Posted Mar 27, 2024 - 13:02 EDT
This incident affected: Logged In Users.