Earlier today many of you probably noticed, and perhaps even posted about it here on the forums, that the social networking giant, Facebook, was down. The outage, which lasted approximately 2.5 hours, according to Robert Johnson of Facebook, hit globally. From feedback on the forums generally stated one of three scenarios: Facebook was completely down, Facebook was extremely slow, or Facebook was relatively normal. The first two options seemed to be the more widespread of the scenarios.
Johnson writes that the cause was an error in handling, ironically, an error condition. Facebook has an automated system to verify configuration values, which checks for invalid entries in the cache, and swaps them with updated values from the persistent store on the site. Unfortunately, when Facebook modified the persistent copy of one configuration value, it was perceived by the automated system as "invalid." When a user tried to access Facebook, that sent a query to the databases attempting to fix the "invalid" value. This wouldn"t have been such an issue, except after the problem had been fixed, the numerous amount of clients attempting to access the database all still attempted to fix the invalid issue. This sent queries to the database which were all trying to restore the invalid value, however once they restored the value, it was still invalid since the persistent store version was invalid.
Creating a feedback loop which would not allow the databases to actually be fixed, without anything very quickly being replaced with the invalid value again reasoned a major decision from the network. Facebook had no choice but to actually turn the site off so that the databases may recover. Once they were recovered, the social networking site slowly let users access the service again, and they have since turned off the automated system to verify these configuration values. They are currently looking into new designs and models to prevent another outage such as this in the future.
Facebook said that this is the worst outage they have had in over four years, and wish to apologize for the issue and want you to know that they are very serious about the performance and reliability of the social network.