Incident start: February 3, 2026 09:18:00 UTC
Incident detected: February 3, 2026 09:19:30 UTC
Incident mitigated: February 3, 2026 09:46:30 UTC
Incident resolved: February 3, 2026 09:49:00 UTC
During the incident period:
During incident timeframe platform served 3357990 requests. 436753 served with errors (13.0%).
Root cause of the issue identified as AWS ALB Controller unexpected behavior when multiple security groups are attached to network interfaces (ENI) associated with the EKS cluster. The controller was unable to determine the correct security group to use, leading to failures in provisioning and managing load balancers for services within the cluster.
Due to SOC2 compliance requirements critical infrastructure such as Kubernetes nodes and underlying Operating Systems are regularly rotated. Amazon Linux 2 (which was used) EOL is scheduled for 2026-06-30. To ensure continued security and compliance, a decision was made to upgrade the EKS cluster nodes to Amazon Linux 2023. This upgrade process involved creating new nodes with the updated OS and migrating workloads from the old nodes to the new ones.
Process of node rotation intentionally made slow and usually takes 3 to 5 days. This process implements a rolling update strategy to minimize disruption to services running on the cluster. During this period, both old and new nodes coexist in the cluster, and workloads are gradually shifted to the new nodes.
Our process involves testing new nodes in staging environment before rolling them out to production. New nodes were tested in staging environment without known issues.
New nodes rollout was scheduled during the first week of the month. It happens in phases, where each phase involves rolling out a subset of nodes, monitoring their performance, and ensuring stability before proceeding to the next phase. New nodes were gradually introduced into production cluster over several days. New nodes were allocated in us-east-1c and us-east-1d availability zones. No issues were detected during initial phases of the rollout. During next phase (rolling out remaining new nodes in us-east-1a), Istio ingress pods were gradually re-deployed onto new nodes.
The incident was detected through monitoring systems that detected escalated error rates at 09:19:30 UTC. Initial investigation revealed health check failures in load balancers managed by AWS ALB Controller. Immediate mitigation step was to roll back to previous nodes running Amazon Linux 2 in us-east-1a, however this was not effective because ingress pods were deployed onto nodes based on Amazon Linux 2023 in us-east-1c and us-east-1d availability zones. Misconfiguration was identified in AWS ALB targets group configuration, where targets were marked as unhealthy however the root cause was targets were not properly registered due to AWS ALB Controller errors. AWS ALB Controller logs were reviewed to identify specific errors related to target registration. The error indicated that multiple security groups were attached to the ENI, causing confusion for the controller.
This kind of misconfiguration was not detected in staging environment because staging environment used instance IDs as targets for AWS ALB target groups, while specifically Istio Ingress load balancer in production used IP addresses as targets. AWS ALB Controller behavior difference between these two target types was unexpected and not covered by existing tests suite. AWS ALB target groups healthchecks were not properly monitored, which delayed detection of the issue.
Additional security group that was attached to Amazon Linux 2023 nodes was security group managed by AWS EKS. It was attached to nodes as required by AWS EKS for proper cluster operation. Old nodes had only the custom security group attached to their ENIs. New AL2023 nodes had both the custom security group and the EKS-managed security group attached. The ALB Controller expects exactly one security group tagged with kubernetes.io/cluster/<name>, and failed when it found two.
To mitigate the issue, the following steps were taken:
To prevent similar incidents in the future, the following actions will be taken:
kubernetes.io/cluster/<name> is attached to each node's ENI.