Performance issues

Incident Report for RebelMouse

Postmortem

Chronology of the incident

Feb 8, 2024, 4:20 PM EST – An increase in error rate was observed.
Feb 8, 2024, 4:25 PM EST – Monitoring systems detected anomalies, prompting the RebelMouse team to initiate an investigation.
Feb 8, 2024, 5:00 PM EST – Error rates experienced a significant surge.
Feb 8, 2024, 5:16 PM EST – The RebelMouse team officially categorized the incident as Major and communicated it through the Status Portal.
Feb 8, 2024, 5:30 PM EST – The root cause was pinpointed: unavailability in launching new instances within the EKS cluster.
Feb 8, 2024, 6:00 PM EST – The RebelMouse team rectified the issue by updating the network configuration and manually launching required instances to restore system performance.
Feb 8, 2024, 8:51 PM EST – RebelMouse initiated a support request regarding AWS services outage.
Feb 8, 2024, 9:10 PM EST – Systems reconfiguration was completed, and the team entered monitoring mode.
Feb 8, 2024, 10:10 PM EST – The incident was officially resolved.
Feb 10, 2024, 2:30 AM EST – AWS confirmed an issue with the EKS service in the us-east-1 region during the specified period, and services have been restored.

The impact of the incident

Stores multiple key services hosted on AWS us-east-1 region for RebelMouse were impacted leading to partial unavailability.

The underlying cause if known
The root cause of this problem has been identified as a networking issue within AWS, specifically affecting the EKS service within the us-east-1 region. AWS acknowledged the issue and the team was actively working on resolving it.

Actions taken

RebelMouse engineering teams were engaged as soon as the problem was identified. They worked diligently to resolve the issue in the fastest manner possible while keeping customers updated about the situation.

Preventive Measures

We have recognized the importance of enhancing our strategies for handling potential networking issues. Going forward, we will seek opportunities to mitigate these challenges by implementing extensive caching systems and boosting our redundant capacity for caching.

Posted Feb 12, 2024 - 14:49 EST

Resolved

This incident has been resolved.

Posted Feb 08, 2024 - 22:10 EST

Monitoring

We identified the root cause and deployed a fix for it and now we are monitoring application performance

Posted Feb 08, 2024 - 21:12 EST

Update

We have replaced the last servers and expecting performance to get back to normal in a couple of minutes

Posted Feb 08, 2024 - 18:47 EST

Update

Newly added servers are functioning correctly and we see an improvement in performance. We are now keep adding new servers and manually removing old one that have issues

Posted Feb 08, 2024 - 18:05 EST

Update

We are adding new servers manually to increase a capacity to resolve performance degradation

Posted Feb 08, 2024 - 17:52 EST

Identified

We identified that the issue is caused by Kubernetes cluster not being able to launch new instances. We are working on a fix for that right now

Posted Feb 08, 2024 - 17:30 EST

Investigating

We are experiencing a performance degradation. We are investigating what is a root cause of it right now.

Posted Feb 08, 2024 - 17:16 EST

This incident affected: AWS ec2-us-east-1.