Performance Degradation
Incident Report for RebelMouse
Postmortem

Incident Summary:
On August 24 from 11:21 EDT to 11:25 EDT, a subset of our users encountered difficulties accessing our services. We take this matter seriously and immediately initiated a thorough investigation to identify the root cause and mitigate the impact.

Root Cause:
The incident was traced back to a failure in our AWS VPC internal DNS resolver. This failure subsequently led to connectivity issues within our Mongo cluster. The secondary instances within the cluster were unable to establish connections with the Primary instance, causing the entire traffic load, including reads and writes, to be directed solely to the Primary instance. This sudden influx of traffic overwhelmed the Primary instance and resulted in the service disruption.

Mitigation Steps:
Upon identifying the issue, our emergency response team was promptly assembled to address the problem and restore services to normal operation. The team worked diligently to resolve the DNS resolver problem and restore proper connectivity within the Mongo cluster. We understand the importance of maintaining a resilient and reliable service environment.

Preventive Measures:
We are committed to preventing similar incidents from occurring in the future. To this end, we are implementing the following measures:

Redundancy and Failover: We will enhance the redundancy and failover mechanisms within our AWS infrastructure to ensure that connectivity disruptions are quickly mitigated without impacting the user experience. We are also reviewing the MongoDB cluster setup to ensure the most efficient configuration is being used.

Additionally, we want to inform you that we have reached out to AWS to request further details about the DNS failure that occurred. We believe that this will help us fortify our systems against similar issues in the future.

Posted Aug 25, 2023 - 04:42 EDT

Resolved
This incident has been resolved.
Posted Aug 24, 2023 - 12:00 EDT
Monitoring
A fix has been implemented and we are monitoring the results.
Posted Aug 24, 2023 - 11:38 EDT
Investigating
We are currently investigating this issue.
Posted Aug 24, 2023 - 11:32 EDT
This incident affected: Mongo Cluster.