CoreDNS Performance Degradation

Incident Report for RebelMouse

Postmortem

Chronology of the incident (UTC)

May 19, 2025 1:15 PM RebelMouse initiated regular reroll of EKS nodes to apply security patches.

May 19, 2025 1:42 PM RebelMouse monitoring tools detected an increased error rate and services performance degradation.

May 19, 2025 1:44 PM RebelMouse tech team started to investigate the issue.

May 19, 2025 1:50 PM The core services functionality was restored.

May 19, 2025 2:13 PM The incident was resolved.

The impact of the incident

CoreDNS experienced a significant performance degradation, leading to increased latency in DNS resolution and a higher rate of timeouts. This incident affected multiple services and applications causing disruptions in service availability and user experience. The incident was particularly impactful during peak usage hours, leading to widespread reports of platform unavailability.

The root cause

CoreDNS acts as a primary DNS server for services within Kubernetes clusters. The incident was triggered by a termination of one of the CoreDNS pods (caused by automatic planned EC2 instance termination), which led to a failure across the DNS resolution process. The remaining CoreDNS pods were unable to handle the increased load, resulting in timeouts. Root cause analysis revealed that despite the CoreDNS was deployed with excess capacity (provided CPU and memory resources are at least 3 times the average load), the sudden spike in load due to the termination of a pod led to a bottleneck in DNS resolution on the remaining pods.

Actions taken & Preventive Measures

After identifying the root cause, the following steps were taken to resolve the incident:

The affected CoreDNS pod was rescheduled to a new EC2 instance.
The CoreDNS deployment was scaled up to increase the number of replicas, ensuring that the load was distributed evenly across all available pods.

Future improvements to prevent similar incidents include:

Implementing scale-in protection mechanisms to prevent unexpected terminations of CoreDNS pods during peak load times.
Review CoreDNS configuration to ensure successful vertical scaling of the pods.
Conducting regular capacity reviews to ensure that the CoreDNS deployment is provisioned to handle peak load.

Posted May 20, 2025 - 10:55 EDT

Resolved

We are currently investigating this issue.

Posted May 19, 2025 - 07:00 EDT