Watch us at PlatformCon! 🚀 Failure heal thyself. Is digital immunity the holy grail of platform reliability?
PlatformCon 23 is live! I'm delighted to be one of 150(ish) speakers at this year's PlatformCon which is being watched by 20,000 people all over the world!
My talk, "Failure heal thyself. Is digital immunity the holy grail of platform reliability?" is now available to stream on YouTube... tune in below!
And if you'd like to browse our recommended talks, check out this blog!
Welcome to Failure Heal Thyself, a talk about creating a culture of digital immunity.
I'm Charlotte and I am super excited to be delivering my first talk here at PlatformCon and it's actually my first ever conference paper too.
I'm based in Edinburgh in the UK with my dog Bobby Dazzler (who you can see there), where I organize the UK's brand new Site Reliability Engineering Meetup. We're also building a community at SREhub.io so do come check us out!
In my day job I'm a Product Marketing Manager at Cloudsoft, where I spend a lot of my time thinking about digital reliability and resilience for both the technical and a cultural perspective. Which is why I wanted to talk today about creating a culture of digital immunity.
But what is digital immunity? What do we mean by the term digital immunity?
Well if we ask industry analysts Gartner, they would tell us that digital immunity is a combination of practices and technologies for software design, Dev, Ops, and analytics which mitigate business risks.
So far so Gartner. But I think we can de-Gartner this definition just a little bit. With humans and with animals we administer vaccines to help our immune systems learn how to heal before we get too sick. And this is important because vaccines don't stop us from getting sick; they train our immune systems to respond to threats and to reduce the impact of those threats. So we might still get sick, but we get a lot less sick than we otherwise would have and we recover much more quickly and we can apply this theory to our digital environments too.
So to go back to our definition a culture of digital immunity can help prevent our tech ecosystems from getting too sick and if they do get sick help them to recover much more quickly.
So when we're talking about digital immunity we're really talking about reliability.
But much like in people, protection from illness doesn't just come from interventions like vaccines. It also comes from daily habits and hygiene practices. Thanks to the work of people like Florence Nightingale we know that poor hygiene leads to illness in humans. But we had to learn this and we had to adopt these new practices to absorb this information into our lives and this is an ongoing process anyone listening from the UK will remember being told at the outset of the pandemic that we should sing Happy Birthday three times whilst we washed our hands to try and avoid infection.
So in other words hygiene could be described as a set of cultural practices and just like in humans a culture of poor digital hygiene can cause serious problems in our digital systems too. So if we were to take a swab to our digital systems and we were to smear it on a petri dish, here are some of the things that I think would pop up in that petri dish. You’d find poor governance, poor or no automation there'd be a lot of toil (which is manual work which can be automated but often isn't). There might be a lack of observability into the system, poor change control and silos which result in bad communication and ultimately a lack of trust between the people who are trying to deliver all of this stuff. And together all of these things can affect our digital systems and make them weaker and a lot less resilient.
And the result of this is frequent outages, a high amount of cognitive load for your development teams and slow recovery from failures when they inevitably happen. And this time off for recovery (downtime) has a massive impact on both the organization and its customers. You may remember earlier this year when an outage at the Federal Aviation
Authority grounded all flights in the U.S or AWS who had an outage late last year where power loss took out a major Availability Zone affecting thousands of businesses.
But downtime isn't just inconvenient, it's also incredibly expensive. In fact a report from The Uptime Institute found that in 2022 25% of outages cost in excess of a million dollars, up from just 11% three years previously in 2019 and Gartner actually finds that digital immunity can reduce downtime by up to 80 percent.
So improving your immunity, and your reliability, can save you tens or hundreds of thousands of dollars every year in lost revenue the impact from losing reputation with your customer and also potential regulatory fines if you happen to work within a regulated industry.
So how do we go about creating the culture that allows this kind of immunity to flourish?
Let's remind ourselves of the definition that we reached earlier a culture of digital immunity can help prevent tech ecosystems from getting too sick and help them recover more quickly when they do get sick, and as we touched on culture is ultimately a set of shared practices.
So here are my recommended practices for creating your culture of digital immunity.
So there are six practices here.
We've got: Auto-remediation which is using automation to resolve issues; we have chaos engineering which is where you inject failures either into production or to test to force your system to fail and this helps you to identify issues both with your processes and with your technologies before they cause too much of a problem.
Site Reliability Engineering is a kind of agreed set of practices often assigned or aligned to a group of people within an organization who respond to and learn from failures. Observability is the ability to measure your system's current state based on data like logs, metrics and traces and test automation (funnily enough) is running tests automatically. Toil reduction is automating away those repetitive manual and low value tasks.
Taken together these practices underpin your immunity culture but what they look like and how they are delivered is up to you and what's right for your organization. For example you could have, a standalone SRE function which serves different areas of the business or you could have SRE embedded into your products or your platform teams which personally I think is the way to go.
So I've got time to touch on a couple of these in a little bit more detail so I'm going to focus in on SRE and auto-remediation.
First we have SRE. SRE is not new; Google in fact coined this term around 20 years ago and then literally wrote the book on the subject but it is having something of a renaissance as reliability climbs up the agenda once more.
SREs use automation to deliver high quality services with maximum uptime, minimal disruption and therefore a higher quality customer experience. And these skills are in high demand because they are essential for the successful operation of increasingly complex digital systems. But as SRE has slowly matured its role is transforming. SRE is actually shifting left to have more influence at the design stage of the development process rather than just the production phase. By moving its influence earlier and earlier SRE is no longer just the medicine you take to resolve a failure. It's a critical part of your digital hygiene.
In fact you could say it's preventative medicine.
And this shift left means SREs aren't just resolving failures in production they're also stopping faulty services from getting into production in the first place. And this is where SRE aligns beautifully with platform engineering. With SRE on board your golden path to production can be built on a foundation of reliability. SRE can inform the development process by Blueprinting and automation, codifying those best practices and guardrails into the golden path to production and, crucially, not adding reliability to the cognitive load of Developers.
SREs are also the ones who can help enable our next practice which is Auto-Remediation. Auto-Remediation is what powers your digital immune system; it's the heart of it. It's about sensing issues and automatically fixing them and it saves you time, it saves you money and it reduces the impact of inevitable failures in complex systems.
Successful Auto-Remediation requires connecting monitoring and alerting tools with blueprints of what a service should look like and policies to effect should that service drift or not meet its defined service level objectives. And all of this is held together with automation so you can describe your remediation policy in the event of a particular event and automatically run it as many times or as often as you need to without human intervention. And you can update that policy with your learnings from each failure so in fact you may not need to run it again because you've stopped the failure from happening again.
So let's look at a couple of incident timelines to show you how powerful this is and I'm sure many of you will recognize the first timeline.
So this is an example of a of a system where Auto-Remediation is not in place. The service is running really nicely in production 3:37 am. But at 3:39 there's a failure; the service is unavailable and the monitoring system detects and alerts the on-call engineer and this poor soul has to potentially drag themselves out of bed spend a few minutes of getting their bearings and getting some coffee before they can even begin to identify and understand the failure in front of them. They then have to find the right runbook to resolve the failure and test and implement a workable solution because you know there may have been drift since the runbook was produced. This means it can be a whole two and a half hours from failure to recovery which if this is a common scenario within an organization is a really high mean time to recover.
Contrast this with an example in which Auto Remediation is in place and you can see it can take just three minutes for a service to go from unavailable to available. This lowers the mean time to recover massively and, most importantly, nobody had to get out of bed at 3am.
So you can see how Auto-Remediation reduces recovery time, reduces developer toil and over time improves reliability. These things come together to boost digital immunity.
So to wrap up I'll leave you with my five steps to start building your own digital immune system.
- First off it's really important that you get executive buy-in and this helps you to get going. gets you funding and hopefully opens up any doors that might be closed to you!
- You should decide what digital immunity looks like for you, what the vision is for it and then roadmap how and when you're going to adopt different immunity practices.
- You should be getting buy-in from your developer, your platform, your SRE Community. Let them know what's happening, educate them on why you want to make changes and the benefits of those changes. Bring them along for the ride after all they're going to be the ones who are responsible for implementing and maintaining your digital immune system.
- It's really tempting to try and fix everything but do start small and don't try to fix everything all at once! Identify where establishing digital immunity can make a real impact but which won't be too resource intensive for your first project you want to show value and you want to show it quickly.
- So you can celebrate your success and scale it from there.
So it just leaves me to say thank you very much for watching my talk I hope you enjoyed it and found it useful and Bobby Dazzler also says thank you for watching. If you would like any more resources you can find them at SREhub.io and if you have thoughts comments questions please do drop them in the comments of this the YouTube video. More than happy to answer them and or you can get in touch with me via LinkedIn or email or via the platform engineering community slack thank you very much.