What should have been routine maintenance turned into something else.Following Monday’s massive service outage that took out all of its services, Facebook has published a blog post detailing what happened yesterday. According to Santosh Janardhan, the company’s vice president of infrastructure, the outage started with what should have been routine maintenance. At some point yesterday, a command was issued that was supposed to assess the availability of the backbone network that connects all of Facebook’s disparate computing facilities. Instead, the order unintentionally took those connection
s down. Janardhan says a bug in the company’s internal audit system did not properly prevent the command from executing.
That issue caused a secondary problem that ultimately made yesterday’s outage into the international incident that it became. When Facebook’s DNS servers couldn’t connect to the company’s primary data centers, they stopped advertising the border gateway protocol (BGP) routing information that every device on the internet needs to connect to a server.
“The end result was that our DNS servers became unreachable even though they were still operational,” said Janardhan. “This made it impossible for the rest of the internet to find our servers.”
As we learned partway yesterday, what made an already difficult situation worse was that the outage made it impossible for Facebook engineers to connect to the servers they needed to fix. Moreover, the loss of DNS functionality meant they couldn’t use many of the internal tools they depend on to investigate and resolve networking issues in normal circumstances. That meant thecompany had to physically send personnel to its data centers, a task that was complicated by the physical safeguards it had in place at those locations.
“They’re hard to get into, and once you’re inside, the hardware and routers are designed to be difficult to modify even when you have physical access to them,” according to Janardhan. Once it could restore its backbone network, Facebook was cautious not to turn everything back on all at once since the surging power and computing demands may have led to more crashes.
“Every failure like this is an opportunity to learn and get better, and there’s plenty for us to learn from this one,” said Janardhan. “After every issue, small and large, we do an extensive review process to understand how we can make our systems more resilient. That process is already underway.”