Now that the smoke has cleared and the hours-long Amazon Web Services incident that took large parts of the internet offline is over, Amazon is offering an explanation for the chaos. According to the company, what took web sites and services such as Slack, Medium and Giphy down was a typo.
Amazon explained some of its S3 servers were operating sluggishly, so a team member tried correcting the problem by taking a few billing servers offline. Unfortunately, one of the inputs to the command was entered incorrectly, and more servers were taken offline than intended.
The mistake took down two subsystems necessary to all S3 objects in the US-EAST-1 region, a large data center location which is also Amazon"s oldest. Both systems required a full restart, and the company explains "the process of restarting these services and running the necessary safety checks to validate the integrity of the metadata took longer than expected."
Amazon has since apologized and is making several changes due to the incident, including measures to prevent an incorrect input from causing the same problems again. The company explained:
"While removal of capacity is a key operational practice, in this instance, the tool used allowed too much capacity to be removed too quickly. We have modified this tool to remove capacity more slowly and added safeguards to prevent capacity from being removed when it will take any subsystem below its minimum required capacity level."
In addition, Amazon is changing the administration console for the AWS Service Health Dashboard so it runs across multiple regions. The typo that caused the outage also knocked out the dashboard, so Amazon had to use Twitter to keep customers up to date on the problem.
Source: Amazon via ZDNet | Image via Shutterstock