The 2010 film The Social Network was almost entirely made up out of whole cloth, but one point made early in the film continues to ring true. Facebook has long had an obsessive focus on keeping the site up no matter what strain its servers were under, out of sheer competitive fear that users who couldn’t access the site would turn elsewhere and never return.
That focus on keeping the site up helped Facebook to be competitive in its early years when rivals like Twitter were routinely sidelined by melting servers. But this year, well into Facebook’s decade of dominance, things have taken a turn for the worse. In July, the company experienced a day-long outage across Facebook, Instagram, and WhatsApp. That followed the company’s worst-ever outage in March, which lasted more than 24 hours.
Outages like these are becoming more serious. As Facebook increasingly positions itself as a core part of the world’s communications infrastructure, a day-long outage can have serious consequences — especially if one were to take place during a catastrophe.
So what’s going on?
In July, Facebook’s official explanation for that outage was that routine maintenance had “triggered an issue.” The full story is more interesting — and Mark Zuckerberg shared it with employees during some of the leaked audio that we began sharing here last week. So today here’s the full explanation of why Facebook keeps going down. Of note here: the basic answer is that Facebook’s massive size means that even small changes have hugely unpredictable effects, and can bring down the entire network.
So here’s Zuckerberg’s answer in full. He’s joined on stage partway through by Santosh Janardhan, vice president of engineering. The answer is highly technical — and involves terms like “storm testing,” “traffic drains,” and “slope testing” for which Google offers little explanation. But the basic answer is clear: Facebook ran some tests, and the tests knocked the system over. (The transcript has been lightly edited for clarity.)
Question: We had several major outages this half. Is our reliability becoming a problem? What is the overarching root cause and how can we fix that?
Mark Zuckerberg: I’m glad that this is the top question because it’s something I’ve been thinking about a bunch. We’ve had more downtime this year than the last few years combined. And it is an issue, and especially as we move towards more services in the private social platform area around messaging, that’s such a core utility to people that it’s really important that these services are reliable. Even from just a competition standpoint, what we see is that when we have downtimes in WhatsApp or Instagram Direct, there are people who just don’t come back. They may move their messaging behavior over to iMessage or Telegram or whatever the service is and that’s kind of it.
And then it takes months to fight and earn back people’s trust and usage of our services. So yes, it’s a big deal. We’re doing worse on this now than we were before. We need to focus more on this … There have been a few different outages recently. But at a high level, they come from different areas. So it’s not that there’s one technical being, except that just the complexity of the systems is growing. So things that previously would have just been a blip are now things that are causing systems to fall over, and we’re going to need to change the way that we react to that and change, focus a little bit more on reliability in the systems that we’re engineering. So this is going to be more of a focus. We have to get this right. It’s not that it’s currently in a very bad place, but it’s certainly trending worse than it should be. And we need to make sure we do better on this.
One of the risks that we run when we run these tests is that we risk pushing our system just a little over the edge so that it fails in ways that we didn’t anticipate or plan for. Now this is exactly what happened last week. We were running a load test on … one of our biggest data centers. And we just pushed it over the edge, in our store which is where we store our photos, our videos, our Messenger attachments, your stickers, things of that nature. And it went into a series of cascading failures.
Now when this happens the recovery becomes long, complex, and the mitigation takes time, which is what ended up happening. I want to touch a little bit on the reliability theme that Mark was alluding to. We will do the short-term things here, do the logs, errors, graphs, put together an error or two and fix the technical issues in the short term. What we are grappling at this point is, are you coming to an inflection point in complexity that … We’re still thinking through how to approach things.
For example, in the outage that happened last week, some of our tools and monitors that are designed to help us exactly deal with this actually failed us. They prolonged the outage. … So we are dealing with a little different beast at this point. So what are we going to do about this? Two different workstreams.
One is that we’re going to do something to literally tackle complexity. We’re going to create and augment new tools. We’re going to do failure testing so that we identify the dependency graphs and run a bunch of bugs. Second is, actually, we want our teams to focus more on what I call fast and graceful recovery. This is something that we have not focused on before. And the last thing here is that this is going to take a little bit of time. We arguably, if you look at across our family of apps, are probably running the busiest online destination on the planet right now. So busy. And many have to tackle complexity that at the same time keep the site humming along. This is going to take some orchestration. We’ll get there. Just bear with us.