Performance degradation

Incident Report for RebelMouse

Postmortem

Chronology of the incident:

Mar 27, 2024, 01:02 PM UTC
RebelMouse received an alert from internal monitoring systems about slight increased error rate
Mar 27, 2024, 01:07 PM UTC
DevOps team checked the systems, noticed a short load spike and error rate already went to normal
Mar 27, 2024, 02:12 PM UTC
RebelMouse received an alert from internal monitoring systems about slight increased error rate
Mar 27, 2024, 02:14 PM UTC
RebelMouse team members observed a degradation in performance across certain services and promptly reported an incident. However, this degradation in performance was temporary and not permanent.
Mar 27, 2024, 02:22 PM UTC
A dedicated incident resolution team was assembled, initiating an investigation
Mar 27, 2024, 02:44 PM UTC
Significant anomalies in traffic were identified, prompting the allocation of extra resources to the cluster tasked with handling said traffic.
Mar 27, 2024, 02:53 PM UTC
The incident resolution team transitioned into monitoring mode.
Mar 27, 2024, 04:27 PM UTC
RebelMouse received an alert from internal monitoring systems about significant increased error rate
Mar 27, 2024, 04:30 PM UTC
The incident resolution team decided to fully reroute the suspicious traffic to an independent cluster
Mar 27, 2024, 04:57 PM UTC
The suspicious traffic was isolated in the independent cluster
Mar 27, 2024, 05:03 PM UTC
RebelMouse published the status portal message
Mar 27, 2024, 05:04 PM UTC
The incident resolution team shifted into monitoring mode and concurrently began exploring potential enhancements in case of any recurrence of the issue.
Mar 27, 2024, 05:48 PM UTC
RebelMouse received reports from the clients about the degradation in performance and also alerts from monitoring systems.
Mar 27, 2024, 05:58 PM UTC
The root cause of the problem was identified
Mar 27, 2024, 06:05 PM UTC
The fix was implemented
Mar 28, 2024
An independent cluster was established specifically for editorial traffic to safeguard its functionality from potential disruptions caused by other services

The impact of the incident

The incident resulted in intermittent performance degradation, leading to periods of unavailability for editorial tools.

The underlying cause if known

The Broken Links service shared endpoints with critical editorial tools such as the Entry Editor or Posts Dashboard. Periodically, this service generated long-running requests, causing health checks to fail and Kubernetes to deem the pods unhealthy. Consequently, Kubernetes terminated these pods and initiated their recreation. This process resulted in temporary unavailability of the affected services during the restart.

Actions taken & Preventive Measures

An independent cluster was established specifically for editorial traffic to safeguard its functionality from potential disruptions caused by other services

Posted Mar 29, 2024 - 05:58 EDT

Resolved

This incident has been resolved.

Posted Mar 27, 2024 - 15:51 EDT

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Mar 27, 2024 - 14:08 EDT

Investigating

We are currently investigating this issue.

Posted Mar 27, 2024 - 13:50 EDT

Monitoring

There was an isolated surge of highly suspicious traffic. While we aren't certain of its origin or source, we have isolated it away from the production clusters to its own environment. This means all productions systems should have returned to normal and we believe the problem is under control. We don't fully understand the why behind this yet though so we will be updating this with more details soon.

Posted Mar 27, 2024 - 13:31 EDT

Investigating

We've are experiencing the performance degradation for the logged in users.

Posted Mar 27, 2024 - 13:02 EDT

This incident affected: Logged In Users.