This post focuses on the details of the service disruption that occurred in the Northern Virginia (US-EAST-1) Region on November 25th, 2020 .
Introduction ๐
Disclaimer: This article is not related to my employer. While writing this, I spoke to no one at AWS; I only read the public postmortem. I write summaries like this because they help me understand things I’ve read; I am sharing it on the off chance that it helps you learn something too.
The subject of this outage, Amazon Kinesis, is a distributed cloud-based service used to process streaming data. Streaming data is data generated continuously by different data sources, e.g. trading prices from stock exchanges around the world or logs from large systems.
A personal overview of Kinesis' architecture:
A few takeaways from the architecture:
The frontend server fleet sits in front of backend clusters and performs a few functions; the most important one to us right now is request routing. The fleet owns a sharding mechanism that determines which backend server (or shard) within which backend cluster will process incoming data streams. Sharding is a valuable technique for scaling systems.
Groups of backend clusters perform the processing of streaming data. In this architecture, each server (or shard) in a cluster is equally capable of operating on the streaming data received.
In this system, the sharding mechanism uses information from three sources to determine which shard will process any incoming data: a membership microservice, config details from a DB, and information from all the frontend fleet servers.
As you can imagine, calling these three system components each time streaming data is received will be extremely expensive. So, the sharding mechanism builds and maintains a shard-map cache for each frontend fleet server, enabling fast lookup to determine which shard(s) will process incoming data.
Trigger ๐
The outage began to occur after more servers were added to the frontend fleet. One reason servers could have been added to the fleet was to handle higher request load, so adding servers tells us nothing.
Sometimes, in engineering, the trigger is not the root cause. So the AWS team had to dig deeper.
Root Cause ๐
Remember that one of the components that the sharding mechanism uses to build the shard-map cache is “information” from all the frontend fleet servers. One type of information the mechanism can collect from the fleet is the most recent response times for all the backend cluster servers. The frontend server fleet can use this information during request routing to send incoming data to the backend cluster servers in ascending order of response times, for example.
A server in the frontend fleet communicates with other servers using OS threads; each server maintains an OS thread for every other server within the fleet. For example, this means that if there are 5 servers within the fleet, each server will maintain 4 threads (that connect to all the other servers in the fleet), totalling 20 threads within the entire fleet.
Anytime a new server is added to the fleet, each existing server needs to create a new connection to the new server and vice versa. The article says this takes 1 hour at the moment.
This incident’s root cause was related to configuration. When the new servers were added to the frontend fleet, each existing server began creating threads to connect to these new servers and vice versa, which caused the number of threads created to exceed the thread limit set in each server’s operating system configuration. As this limit was being exceeded, cache construction was failing to complete, and frontend servers were ending up with useless shard-maps that left them unable to route requests to backend clusters.
Fixing the Problem ๐
Since adding capacity to the frontend fleet triggered the event, the first line of action was to remove this additional capacity. This highlights an important initial step in incident management: stop the bleeding (i.e. undo the trigger).
The fleet was restarted slowly after the additional capacity was removed.
Improvements ๐
The following are a collection of actions taken that aim to prevent this problem from happening in the future:
Each server in the frontend fleet was moved to servers with more compute (larger CPU & memory). Since each server is more powerful, fewer servers will be needed overall, and hence, fewer OS threads overall in the fleet.
An alarming system was added to monitor the thread consumption within the fleet. Here, a threshold is set, and if it is exceeded, an alarm goes off.
The AWS team planned tests to determine a safe upper bound for thread counts - this is probably now completed.
Cellularization for the frontend server fleet will be accelerated. According to the article, Cellularization is an approach used to isolate the effects of failure within a service and keep the service’s components operating within a previously tested and operating range. (Cellularization is akin to fire doors which limit the spread of fire and smoke between separate compartments of a structure).
One other thing to note:
Typically, during operational events like this, customers are alerted via a public Service Health Dashboard and their Personal Health Dashboard. Unfortunately, updates to the Service Health Dashboard initially failed because the update process used Amazon Cognito, which was affected during this incident. While there was an alternative way to update this dashboard, it was very slow because the support staff weren’t as familiar with the process. One commitment from AWS is to make sure that this alternative way to update the dashboard is part of support training. It is an essential lesson for us all: alternative ways to run, debug or update critical systems within an organization should be known adequately.
Conclusion ๐
A problem amongst all cloud providers is that their service issues cascade to their users and applications. A perfect system doesn’t exist, but outages (such as this one) allow providers to keep building more robust systems. For end-users, it is crucial to think about development patterns (e.g. having backup providers or caching locally) that allow for handling some provider failures. Each user will need to decide around their budget.
Belated HugOps to the team at AWS!