On May 2, Microsoft Azure suffered a global outage for nearly two hours starting from 19:43 UTC. Although services were partially recovered by 21:30 UTC, it was not until 22:35 UTC that the problem was fully resolved. The connectivity issue with Azure primarily affected Microsoft's services including Microsoft 365, Dynamics, and DevOps, with customers being impacted all over the globe.
Now, the Redmond giant has published an update discussing why the issue occurred.
According to Microsoft, the preliminary root cause for the widespread problem appears to be a "nameserver delegation" issue. The firm stated:
Engineers identified the underlying root cause as a nameserver delegation change affecting DNS resolution and resulting in downstream impact to Compute, Storage, App Service, AAD, and SQL Database services. During the migration of a legacy DNS system to Azure DNS, some domains for Microsoft services were incorrectly updated. No customer DNS records were impacted during this incident, and the availability of Azure DNS remained at 100% throughout the incident. The problem impacted only records for Microsoft services.
As per the company's statement, although engineers identified and resolved the issue within a couple of hours, some systems that accessed domains with incorrect configurations stored that information, resulting in longer recovery times until the faulty cache expired.
That said, Microsoft is still looking into the problem, and has promised that a detailed root cause analysis (RCA) will be published within 72 hours. This is not the first time Azure has been hit by a global outage; in 2016, a worldwide DNS outage impacted a number of Azure-based services, including SQL Database, App Service / Web Apps, API Management, Service Bus, HDInsight, Media Services, and Visual Studio Team Services. Earlier this year, many Office 365 users weren't able to access their mailboxes due to a similar outage as well.
14 Comments - Add comment