May 19, 2025 1:15 PM RebelMouse initiated regular reroll of EKS nodes to apply security patches.
May 19, 2025 1:42 PM RebelMouse monitoring tools detected an increased error rate and services performance degradation.
May 19, 2025 1:44 PM RebelMouse tech team started to investigate the issue.
May 19, 2025 1:50 PM The core services functionality was restored.
May 19, 2025 2:13 PM The incident was resolved.
CoreDNS experienced a significant performance degradation, leading to increased latency in DNS resolution and a higher rate of timeouts. This incident affected multiple services and applications causing disruptions in service availability and user experience. The incident was particularly impactful during peak usage hours, leading to widespread reports of platform unavailability.
CoreDNS acts as a primary DNS server for services within Kubernetes clusters. The incident was triggered by a termination of one of the CoreDNS pods (caused by automatic planned EC2 instance termination), which led to a failure across the DNS resolution process. The remaining CoreDNS pods were unable to handle the increased load, resulting in timeouts. Root cause analysis revealed that despite the CoreDNS was deployed with excess capacity (provided CPU and memory resources are at least 3 times the average load), the sudden spike in load due to the termination of a pod led to a bottleneck in DNS resolution on the remaining pods.
After identifying the root cause, the following steps were taken to resolve the incident:
Future improvements to prevent similar incidents include: