Resilience lessons from the Fastly outage
Yesterday's '#internetshutdown', caused by an outage at CDN Fastly, demonstrated the importance of planning for failure, thinking about application reliability from a top-down perspective and setting a resilience strategy to combat the fragility complex IT estates.
Thousands of sites were affected including Amazon, Netflix, the BBC, the Guardian and Spotify. Whilst not being able to access your favourite show, news source or album for just shy of an hour might be just a mild inconvenience, more worrying is that the UK Government site gov.uk was also out of action. Ultimately the issue was resolved within 45 minutes, demonstrating the importance of observability and fast RTO, but if that fix hadn't been identified so quickly it could have caused significant issues for the UK population - the vast majority of whom now rely on being able to access Government services online and on-demand.
New dependencies, new vulnerabilities
The Fastly network outage reveals the new dependencies and new vulnerabilities that are emerging from the complexity of modern technology landscapes. Yet whilst individual organisations have more complex tech stacks than ever, the vendor landscape becomes more homogenous - meaning outages have even bigger impacts to end-users.
technology that starts out as nice to have is rapidly becoming fundamental to the way we operate. But too often resilience is an afterthought.
- Paddy McGuinness, a former Deputy Director of National Security until 2018.
Interoperability and regulating resilience
The drive towards regulating resilience in the Financial Services sector seeks to prevent precisely this issue affecting core banking services and global financial markets. If several large banks are using the same third party provider of a service, and that provider fails, then what? The implications of such an outage are huge for global financial markets.
The Financial Conduct Authority recently published its final guidance on operational resilience in the Financial Services sector which come into force in March 2022. The FCA guidance aligns with the EU's Digital Operational Resilience Act (DORA), which is currently under consultation. In addition to identifying any vulnerabilities in their operational resilience, firms are expected to have:
- identified their important business services;
- set impact tolerances for the maximum tolerable disruption, and;
- carried out mapping and testing to a level of sophistication necessary to do so.
Towards continuous resilience
Complexity breeds fragility - resilient systems need to be, by definition, more elastic.
One way to achieve this elasticity is orchestration. This approach cuts through the complexity of the landscape instead of adding to it. Gartner call this category of tooling the ‘Digital Platform Conductor’ - an innovative new breed of tool that allows technology leaders visibility of the hybrid digital infrastructure they have in order to ensure it delivers value. Cloudsoft AMP is a representative vendor of this type of tooling, and you can find out more about Digital Platform Conductors
With the regulatory deadlines less than 12 months away, doing nothing is no longer an option. Speak to use today about how to cut through complexity, orchestrate your estate and be compliantly, continuously resilient.