Lightning causes major Amazon EC2 and Microsoft BPOS outages

Amazon"s wildly popular "Elastic Compute" service has had its fair share of problems, with an outage earlier this year causing customers to be without their instances for up to four days, causing major issues for sites such as Reddit that rely on EC2 for their hosting. Now, the company has been hit again with another major issue.

According to ZDNet, all instances in the European datacenter went down at around 10:41 AM PDT, and the company was able to recover a handful of instances by 1:47 PM PDT. However, at least a quarter of instances are still unavailable 12 hours later, and it may take up to 48 hours to restore them.

Over on the Amazon EC2 status page, the company stated:

We understand at this point that a lighting strike hit a transformer from a utility provider to one of our Availability Zones in Dublin, sparking an explosion and fire. Normally, upon dropping the utility power provided by the transformer, electrical load would be seamlessly picked up by backup generators. The transient electric deviation caused by the explosion was large enough that it propagated to a portion of the phase control system that synchronizes the backup generator plant, disabling some of them. Power sources must be phase-synchronized before they can be brought online to load. Bringing these generators online required manual synchronization. We"ve now restored power to the Availability Zone and are bringing EC2 instances up. We"ll be carefully reviewing the isolation that exists between the control system and other components. The event began at 10:41 AM PDT with instances beginning to recover at 1:47 PM PDT.

As with the outage earlier this year, Amazon is citing EBS (Elastic Block Storage) as the issue, and saying that they need to manually restore services, and that customers will be contacted before their services will be restored to ensure integrity of the disks;

“Due to the scale of the power disruption, a large number of EBS servers lost power and require manual operations before volumes can be restored. Restoring these volumes requires that we make an extra copy of all data, which has consumed most spare capacity and slowed our recovery process. We’ve been able to restore EC2 instances without attached EBS volumes, as well as some EC2 instances with attached EBS volumes. We are in the process of installing additional capacity in order to support this process both by adding available capacity currently onsite and by moving capacity from other availability zones to the affected zone. While many volumes will be restored over the next several hours, we anticipate that it will take 24-48 hours until the process is completed. In some cases EC2 instances or EBS servers lost power before writes to their volumes were completely consistent. Because of this, in some cases we will provide customers with a recovery snapshot instead of restoring their volume so they can validate the health of their volumes before returning them to service. We will contact those customers with information about their recovery snapshot.”

Whilst Amazon struggles back to their feet, Microsoft was also affected by what appears to be the same storm, but has been able to get back on their feet much faster. PC Mag reports that Microsoft tweeted that BPOS in the European zone was affected by a power issue, but services were restored within 4 hours.

Europe data center power issue affects access to #bpos. Please see the Service Health Dashboard for latest updates: https://t.co/0YJeU6bless than a minute ago via web Favorite Retweet ReplyMicrosoft Online
msonline

ZDNet pointed out that as usual, disgruntled developers are airing their thoughts on Amazon"s forum, saying that the company took too long to acknowledge the issue. One user said "Absolute disaster as we are mid deploy" and another user saying "[this is] a complete disaster! Your information policy is a catastrophy," with many users upset as the issues have such a long resolution time.

Tags