Facebook Fabric Aggregator, a system that manages data traffic between its data centers. (Photo: Rich Miller)
Facebook said a configuration error cut off its connection to its main network backbone, disconnected all of its data centers from the Internet and left its DNS servers unreachable, the company said.
The unusual set of errors disrupted web operations on Facebook, Instagram and WhatsApp in a massive global outage that lasted more than five hours. In fact, Facebook said, one wrong thing has taken down web services used by more than 7 billion accounts worldwide.
Early external analyzes of the outage focused on Facebook’s domain name servers (DNS) and network path changes in the Border Gateway Protocol (BGP), problems that were clearly visible from Internet logs. Turns out these were minor issues caused by Facebook’s backbone disruption.
During the planned network maintenance, “an order was issued with the intent of assessing the availability of global backbone capacity, which inadvertently severed all connections in our core network, effectively separating Facebook’s data centers globally,” according to a blog post by Facebook’s Vice President. Infrastructure Santosh Janardan.
Usually, the wrong command is caught by an audit tool, but “an error in this audit tool did not stop the command properly,” Facebook said.
Technical overview of the Facebook outage
Here is the section of the blog post explaining the issue and the resulting outage, which is worth reading in full:
Data traffic between all these computing facilities is managed by routers, which determine where to send all incoming and outgoing data. And in the intense daily work of maintaining this infrastructure, our engineers often need to be involved in the offline backbone of maintenance – perhaps to repair the fiber line, add more capacity, or update software on the router itself.
This was the source of the power outage yesterday. During one of these routine maintenance jobs, an order was made with the goal of evaluating the availability of global backbone capacity, which inadvertently severed all connections in our backbone network, effectively separating Facebook’s data centers globally. Our systems for auditing commands like this are designed to prevent such errors, but an error in this audit tool did not stop the command properly.
This change caused a complete disruption of our server communications between our data centers and the Internet. This complete loss of connection caused a second problem that made matters worse.
One of the functions that our smaller facilities perform is responding to DNS queries. DNS is the Internet’s address book, which enables the translation of simple web names we type into browsers to the IP addresses of a specific server. These translation queries are answered by trusted name servers running well known IP addresses, which in turn are advertised to the rest of the Internet via another protocol called Border Gateway Protocol (BGP).
To ensure reliable operation, our DNS servers disable these BGP ads if they themselves cannot talk to our data centers, as this is an indication of an unhealthy network connection. In the recent outage, the entire backbone was removed from operation, causing these sites to declare themselves unhealthy and to pull those BGP ads. The end result was that our DNS servers became unreachable even though they were still running. This made it impossible for the rest of the internet to find our servers.
Manual restarts prolong delays
Recovery is made difficult because not all Facebook data centers are accessible, and the DNS outage has hampered many network tools that would normally be essential in troubleshooting and fixing problems.
With no remote management tools available, affected systems must be manually patched and restarted by technicians in data centers. “It took extra time to activate the secure access protocols needed to get people on site and able to work on the servers. Only then can we confirm the issue and get the backbone back up and running,” Janardan said.
The last issue was how to restart Facebook’s massive global data center network and deal with an immediate increase in traffic. This is a challenge that goes beyond network frictions to data center hardware and power systems.
“Individual data centers have been reporting reductions in energy use in the tens of megawatts, and suddenly reversing that reduction in energy consumption can put everything from electrical systems to bunkers at risk,” Janardan said.
The data center industry exists to eliminate downtime in IT equipment by ensuring that power and network are always available. A key principle is eliminating single points of failure, and Monday’s outage illustrates how ultrawide networks serving global audiences can also enable outages on an unprecedented scale.
Now that the details of the outage are known and available, Facebook’s engineering team will assess what went wrong, and strive to prevent a similar issue from recurring in the future.
“Every failure like this is an opportunity to learn and improve, and there is a lot to learn from this,” Janardan said. “After every problem, small and big, we go through a comprehensive review process to understand how we can make our systems more resilient. This process is already underway. … From now on, it is our job to enhance our testing, training and overall resilience to make sure that events like this happen as rarely as possible. possible.”