Performance Degradation

Incident Report for RebelMouse

Postmortem

Incident Description

Incident start: February 3, 2026 09:18:00 UTC

Incident detected: February 3, 2026 09:19:30 UTC

Incident mitigated: February 3, 2026 09:46:30 UTC

Incident resolved: February 3, 2026 09:49:00 UTC

During the incident period:

Logged in experience was unavailable.
Logged out experience was partially degraded.

During incident timeframe platform served 3357990 requests. 436753 served with errors (13.0%).

Incident Details

Root cause of the issue identified as AWS ALB Controller unexpected behavior when multiple security groups are attached to network interfaces (ENI) associated with the EKS cluster. The controller was unable to determine the correct security group to use, leading to failures in provisioning and managing load balancers for services within the cluster.

Historical Context

Due to SOC2 compliance requirements critical infrastructure such as Kubernetes nodes and underlying Operating Systems are regularly rotated. Amazon Linux 2 (which was used) EOL is scheduled for 2026-06-30. To ensure continued security and compliance, a decision was made to upgrade the EKS cluster nodes to Amazon Linux 2023. This upgrade process involved creating new nodes with the updated OS and migrating workloads from the old nodes to the new ones.

Process of node rotation intentionally made slow and usually takes 3 to 5 days. This process implements a rolling update strategy to minimize disruption to services running on the cluster. During this period, both old and new nodes coexist in the cluster, and workloads are gradually shifted to the new nodes.

Our process involves testing new nodes in staging environment before rolling them out to production. New nodes were tested in staging environment without known issues.

New nodes rollout was scheduled during the first week of the month. It happens in phases, where each phase involves rolling out a subset of nodes, monitoring their performance, and ensuring stability before proceeding to the next phase. New nodes were gradually introduced into production cluster over several days. New nodes were allocated in us-east-1c and us-east-1d availability zones. No issues were detected during initial phases of the rollout. During next phase (rolling out remaining new nodes in us-east-1a), Istio ingress pods were gradually re-deployed onto new nodes.

Detection and Mitigation

The incident was detected through monitoring systems that detected escalated error rates at 09:19:30 UTC. Initial investigation revealed health check failures in load balancers managed by AWS ALB Controller. Immediate mitigation step was to roll back to previous nodes running Amazon Linux 2 in us-east-1a, however this was not effective because ingress pods were deployed onto nodes based on Amazon Linux 2023 in us-east-1c and us-east-1d availability zones. Misconfiguration was identified in AWS ALB targets group configuration, where targets were marked as unhealthy however the root cause was targets were not properly registered due to AWS ALB Controller errors. AWS ALB Controller logs were reviewed to identify specific errors related to target registration. The error indicated that multiple security groups were attached to the ENI, causing confusion for the controller.

This kind of misconfiguration was not detected in staging environment because staging environment used instance IDs as targets for AWS ALB target groups, while specifically Istio Ingress load balancer in production used IP addresses as targets. AWS ALB Controller behavior difference between these two target types was unexpected and not covered by existing tests suite. AWS ALB target groups healthchecks were not properly monitored, which delayed detection of the issue.

Additional security group that was attached to Amazon Linux 2023 nodes was security group managed by AWS EKS. It was attached to nodes as required by AWS EKS for proper cluster operation. Old nodes had only the custom security group attached to their ENIs. New AL2023 nodes had both the custom security group and the EKS-managed security group attached. The ALB Controller expects exactly one security group tagged with kubernetes.io/cluster/<name>, and failed when it found two.

To mitigate the issue, the following steps were taken:

AWS EKS managed security group was reviewed and determined to be subset of custom security group.
AWS EKS managed security group was detached from nodes running Amazon Linux 2023.
AWS ALB Controller was restarted to force reconciliation of target group bindings.
Amazon Linux 2023 nodes deployment was updated to include only one security group.

Post-Incident Actions

To prevent similar incidents in the future, the following actions will be taken:

Improve monitoring and alerting for ALB target group health to detect similar issues faster.
Align staging environment configuration with production, specifically using IP addresses as targets for AWS ALB target groups in staging as well.
Update Amazon Linux 2023 node provisioning process to ensure only one security group tagged with kubernetes.io/cluster/<name> is attached to each node's ENI.
Review and update testing suite to cover scenarios where targets are IPs and nodes have multiple security groups attached to their ENIs.
Document AWS ALB Controller behavior differences between instance ID targets and IP address targets to ensure better understanding of potential issues in the future.

Posted Feb 03, 2026 - 08:45 EST

Resolved

This incident has been resolved.

Posted Feb 03, 2026 - 05:56 EST

Monitoring

A fix has been implemented and we are monitoring the results.

Posted Feb 03, 2026 - 04:56 EST

Identified

The issue has been identified and a fix is being implemented.

Posted Feb 03, 2026 - 04:53 EST

Investigating

We're currently experiencing issues with our backend cluster. Our DevOps team is currently investigating the issue.

Posted Feb 03, 2026 - 04:53 EST

This incident affected: EKS Cluster.