What is digital immunity?
Digital immunity takes the drama out of downtime. Establishing digital immunity practices can help to improve operational resilience, de-risk innovation and boost customer experience.
It can help us to balance speed with quality, by making sure that resilience and reliability is embedded into software development practices and delivering products that are more able to recover quickly from inevitable failures.
6 Practices to build Digital Immunity
Digital Immunity, fully realised, is about building self-healing applications and ensuring failures don’t spiral out of control.
According to Gartner, there are six components to a Digital Immune System:
Can your engineers ask their systems how they’re doing?
Observability gives you the ability to ask your systems how they’re performing, in real-time. By surfacing that information, you are more able to head off service-impacting incidents before the blast-radius becomes too big AND the system can be measured against KPIs/SLOs by those who need to know.
2. AI-Augmented Testing
Test automation is an important element of balancing speed and quality. As delivery cadences become faster, automating testing can ensure that test doesn’t become a bottleneck.
AI-augmented testing can take this a few steps further.
Augmenting test with AI can help to:
- Predict risk and prioritise tests
- Optimise tests and test environments
- Reduce test costs
- Generate meaningful synthetic test data with NLP
- Maintain test environments.
3. Chaos engineering
“The last strand that breaks is not the cause of failure.”
Chaos engineering is the intentional injection of faults and failures to see how applications behave under duress; do they call automation that helps them to heal? Does the failure set off a chain reaction of error messages?
In complex environments, chaos engineering is a powerful way to spot weak-points you might not have anticipated, and to enhance your digital immunity.
Chaos engineering also helps to foster a reliability mindset, and to create psychological safety for teams to see the value in failure and use it as a learning opportunity.
Auto remediation is what powers your digital immune system, and limits the blast-radius of failures.
Auto-remediation is about sensing issues and automatically fixing them. Saving you time and money and reducing the impact of inevitable failures.
Successful auto-remediation requires connecting monitoring and alerting tools with blueprints of what a service should look like and policies to effect should the service drift or not meet its defined service level objectives. All, of course, held together by automation.
5. Site Reliability Engineering
Site Reliability Engineering (SRE) is about balancing the need for velocity with the need to mitigate risk. SRE’s do this by working to the principles of:
- continuous monitoring,
- small but often change implementation,
- codifying and automating best practice.
Taken together, the above Digital Immunity practices are the building blocks of an SRE function.
6. Supply chain security
Security is a big part of reliability; after all, attacks can take out services even if they aren’t DOS attacks!
Modern software is built from components, libraries, tools and processes which come together into a ‘software supply chain’. Some of these components might be third-party or open source, and so you aren’t entirely responsible for their security.
Software teams can mitigate this, and contribute to digital immunity, with strong version-control policies, libraries of trusted content and managing vendor risk throughout the delivery cycle.
Why digital immunity is more important than ever
Simply put: our digital systems are more complicated than ever before, and they’re only getting more complicated.
This digital complexity, coupled with faster and faster release cycles, results in a lot of poorly understood dynamics and dependencies. And, we don’t always own the infrastructure, tools and code our systems run on.
Digital immunity is a strategy to improve the reliability of those systems and a mindset which accepts that sometimes they will fail, but we need to develop ways to ensure they recover quickly.
Hear our take on Digital Immunity at PlatformCon23!
On June 08-09 June, I’ll be taking to the (virtual) stage atPlatformCon23 to talk about building a culture of digital immunity. Tune in to hear me, and a tonne of incredible speakers, talk about all things Platform!