On September 8, users of Microsoft's Hotmail and Skydrive online services were unable to access these features for a few hours. At the time, Microsoft didn't offer much in the way of a reason for this outage but this week, the Windows Live blog site provided a more detailed explanation. According to Arthur de Haan, Microsoft's Vice President of Windows Live Test and Service Engineering, "A tool that helps balance network traffic was being updated and the update did not work correctly. As a result, configuration settings were corrupted, which caused a service disruption."
That corruption affected Microsoft's DNS service, according to de Haan. He stated that two events that happened at the same time helped to corrupt a file in the DNS service. He states:
The first condition is related to how the load balancing devices in the DNS service respond to a malformed input string (i.e., the software was unable to parse an incorrectly constructed line in the configuration file). The second condition was related to how the configuration is synchronized across the DNS service to ensure all client requests return the same response regardless of the connection location of the client. Each of these conditions was tracked to the networking device firmware used in the Microsoft DNS service.
De Haan said that Microsoft has taken several steps to prevent such an outage from happening again including "further hardening the DNS service to improve its overall redundancy and fail-over capability" and "developing an additional recovery process that will allow a specific property the ability to fail over to restore service and then fail back when the DNS service is restored."
7 Comments - Add comment