System slowdown
Incident Report for RebelMouse
Postmortem

Chronology of the incident (EST)

  • Oct 19, 2024 4:17 PM RebelMouse received an alert about significantly increased Error Rate.
  • Oct 19, 2024 4:23 PM RebelMouse DevOps team started to investigate the issue.
  • Oct 19, 2024 4:25 PM The incident was categorized as Major and the Major Incident Resolution Team was formed.
  • Oct 19, 2024 4:30 PM Status Portal was updated with the information about the incident.
  • Oct 19, 2024 4:33 PM The problem was identified: Talaria (caching service) was unavailable
  • Oct 19, 2024 4:36 PM The routing rules were updated to forward the traffic directly to origins bypassing the Talaria. This action restored the functionality of the platform.
  • Oct 19, 2024 5:40 PM Talaria was recovered and routing rules were moved back to the initial state.
  • Oct 19, 2024 5:48 PM The incident was marked as resolved on Status Portal.

The impact of the incident

During the incident, users experienced increased error rates on public pages.

The underlying cause if known

The root cause of this incident was identified as an outage of internal caching services, caused by insufficient resource allocation.

Actions taken & Preventive Measures

  • Increased Cache Capacity: We increased the cache capacity by 50% to handle the demand.
  • Updated Monitoring and Alerting: We implemented new monitoring and alerting systems to detect resource-intensive processes faster and prevent similar incidents in the future.
  • Improve Automatic Failover: We are working on improving our automatic failover mechanisms to ensure seamless service continuity in case of similar incidents in the future.
Posted Oct 21, 2024 - 14:15 EDT

Resolved
This incident has been resolved.
Posted Oct 19, 2024 - 17:48 EDT
Update
We have pushed and update and a system is recovering now
Posted Oct 19, 2024 - 16:41 EDT
Identified
we identified an issue and looking for a solution
Posted Oct 19, 2024 - 16:36 EDT
Investigating
We are experiencing a partial down time. We are looking for a reasons for it.
Posted Oct 19, 2024 - 16:30 EDT
This incident affected: AWS ec2-us-east-1.