System slowdown

Incident Report for RebelMouse

Postmortem

Oct 19, 2024 4:17 PM RebelMouse received an alert about significantly increased Error Rate.
Oct 19, 2024 4:23 PM RebelMouse DevOps team started to investigate the issue.
Oct 19, 2024 4:25 PM The incident was categorized as Major and the Major Incident Resolution Team was formed.
Oct 19, 2024 4:30 PM Status Portal was updated with the information about the incident.
Oct 19, 2024 4:33 PM The problem was identified: Talaria (caching service) was unavailable
Oct 19, 2024 4:36 PM The routing rules were updated to forward the traffic directly to origins bypassing the Talaria. This action restored the functionality of the platform.
Oct 19, 2024 5:40 PM Talaria was recovered and routing rules were moved back to the initial state.
Oct 19, 2024 5:48 PM The incident was marked as resolved on Status Portal.

During the incident, users experienced increased error rates on public pages.

The root cause of this incident was identified as an outage of internal caching services, caused by insufficient resource allocation.

Increased Cache Capacity: We increased the cache capacity by 50% to handle the demand.
Updated Monitoring and Alerting: We implemented new monitoring and alerting systems to detect resource-intensive processes faster and prevent similar incidents in the future.
Improve Automatic Failover: We are working on improving our automatic failover mechanisms to ensure seamless service continuity in case of similar incidents in the future.

Posted Oct 21, 2024 - 14:15 EDT

This incident has been resolved.

Posted Oct 19, 2024 - 17:48 EDT

We have pushed and update and a system is recovering now

Posted Oct 19, 2024 - 16:41 EDT

we identified an issue and looking for a solution

Posted Oct 19, 2024 - 16:36 EDT

We are experiencing a partial down time. We are looking for a reasons for it.

Posted Oct 19, 2024 - 16:30 EDT

This incident affected: AWS ec2-us-east-1.