The massive internet outage yesterday affecting major web sites and services such as CNN, Spotify, Shopify, Reddit and Stripe was caused by a software bug triggered when a single customer, not a Fastly administrator, made a specific change to their settings. The customer is currently unnamed, and will hopefully remain so as Fastly is taking full responsibility for the outage in their blog post and is quite open and transparent about the issue.
In it, they even provided a timeline of events (all times are in UTC):
09:47 Initial onset of global disruption
09:48 Global disruption identified by Fastly monitoring
09:58 Status post is published
10:27 Fastly Engineering identified the customer configuration
10:36 Impacted services began to recover
11:00 Majority of services recovered
12:35 Incident mitigated
12:44 Status post resolved
17:25 Bug fix deployment began
This is quite the revelation, and indicates that relying on a single service such as Fastly, Cloudflare, Amazon and Microsoft might be creating some big new risks while they are helping to solve the problems of running the web at scale. There have been no reports of major customers leaving Fastly for their competitors, and in fact their shares jumped 11% after news of the outage started to spread when major news sites started to come back online.
Fastly was able to end the outage quickly saying "Within 49 minutes, 95% of our network was operating as normal,". They are also committed to a post-mortem of the event which is looking to be significantly more interesting now that we know a customer action was the cause.
3 Comments - Add comment