Early on July 17, 2020, during unplanned required maintenance on a service management host, the service provider for the Fluke Networks User Authentication Service suffered an outage that affected all customers located on one of their AWS us-east clusters. This also included several production deployments, which meant that login for the LinkWare Live website was unavailable to customers.
Their engineers attempted to stabilize individual hosts for several hours, but eventually needed to restart all hosts individually and bring them online one by one. Services were restored for most customers by Wednesday evening. The extended time to come online was due to the management service role and size of the cluster, the number of capsules on the hosts, and their engineers being cautious to not trigger additional network issues.
The service provider has created a different plan of action in case this same issue happens again that would bring services online quicker. They continue operations root cause analysis and corrective action. Their team is being cautious with all maintenance to not trigger additional issues.
With regard to redundancy, the service provider for the Fluke Networks User Authentication Service has redundancy across 3 availability zones in a region. In this unfortunate case, all zones in the region were affected for this one cluster.
Because all systems have been stable since the resolution, the provider marked this incident as "resolved".