Facebook Outage: The Bigger Resilience Picture
Everything is down, my whole business is down*
Facebook, Instagram and WhatsApp are all back online after a 6 hour global outage, starting at around 4pm GMT yesterday (4th October). A single minute of downtime can cost Facebook over $200,000, and this outage looks to have cost them over $70 million in 6 hours (and that’s before we account for the reputational damage this kind of massive outage can have). This shows that downtime is a significant cost to businesses, especially when it’s unplanned; the IDC estimates that Tier 1 banks waste around $2.5billion a year on critical application failure.
More than 3.5 billion people across the world use Facebook, Instagram, WhatsApp and Facebook’s own Messenger service, and not just to share memes and watch dog videos. These apps are hardwired into how we communicate, how businesses operate, how we access other services (using social sign-on) and in some parts of the world, for example India and many parts of South East Asia, Facebook is synonymous with the internet.
The domino effect
The outage has thrown into sharp relief the complex network of functions and services reliant on the availability and resilience of a single service provider. According to the New York Times, users reported being unable to access internet-connected smart devices like smart TVs and thermostats - not provided by Facebook, but accessed via Facebook credentials. Facebook and Instagram especially are part of the economic fabric too; businesses around the world, reliant on Facebook platforms to drive orders, essentially ceased to trade whilst the platforms were offline.
Nor is Facebook is uniquely complex - most enterprises of any scale are operating on hybrid architectures that have grown organically over time, resulting in this kind of complexity driven vulnerability. Enterprises now operate 10,000s of applications across 1000s of workloads, making identifying, resolving and further preventing an issue incredibly difficult.
Facebook claims the outage was caused by configuration changes which affected traffic between its data centers, and the effects were not limited to users. Troublingly, the outage also affected Facebook’s internal systems taking out internal communications platforms, locking staff out of systems and, most alarmingly, actually hindering engineers from physically accessing servers as their security credentials were blocked.
No-one’s too big to fail
Yesterday’s outage shows how easy it is for enterprises like Facebook to fail on a global scale, with wide reaching ripple effects. These are the kinds of outages that trouble regulators and are the impetus behind new regulations governing Digital Operational Resilience, initially in the Financial sector.
Earlier this year the Financial Conduct Authority (FCA) published its final guidance on operational resilience in the Financial Services sector which comes into force in March 2022. The FCA guidance aligns with the EU's Digital Operational Resilience Act (DORA), which is currently under consultation and set to be enacted from 2023. In addition to identifying any vulnerabilities in their operational resilience, firms are expected to have:
- identified their important business services;
- set impact tolerances for the maximum tolerable disruption, and;
- carried out mapping and testing to a level of sophistication necessary to do so.
The chaos wrought to Facebook’s internal systems, and the hindrance it caused to restoring services, will no doubt be playing on the minds of many involved in writing these regulations and those set to be affected by it.
Visit Cloudsoft’s Resilience Resource Centre for practical guides to Digital Operational Resilience.