Skip to content

Building digital immunity with Chaos Engineering

Chaos Engineering is the practice of intentionally injecting failures to see how your complex systems *really* perform under stress. It is part of a framework for digital immunity, which can reduce downtime by 80%

FAILURE HEAL THYSELF 17

This is helpful for Site Reliability Engineers (SRE)s because it allows them to proactively identify potential failures and weaknesses in their systems. The upshot is more resilient systems, reduced downtime, and improved reliability.

A brief history of Chaos Engineering

Chaos engineering practices are derived from Chaos Theory, which studies how complex and changing systems behave in response to seemingly random events. In complex, distributed systems, a data centre glitch or a missed bug can spiral into a huge and costly outage. Remember a couple of years ago when a customer configuration change took down 85% of Fastly’s network?

But the goal of chaos engineering is not chaos, it is improved reliability.

Which is why engineers now run tightly controlled, hypothesised chaos experiments; controlling chaos simulations in this way helps to collect useful data to improve the system and design future experiments. 

This differs from testing, as testing seeks to validate expected behaviour whilst chaos engineering aims to cover unexpected behaviour in similarly controlled environments, as well as in production environments where it can uncover real-world issues that might not rear their heads in testing. 

Chaos Engineering shines the light of reliability and resiliency on engineers' assumptions and educated guesses, exposing actual weaknesses before they are a career- or business-ending catastrophe.

- Myra Haubrich, Senior SRE, Adobe Experience Platform

Why Chaos Engineering helps to build digital immunity

Digital Immunity is a set of practices for reliability and resilience. 

These practices are:

  • auto-remediation
  • chaos engineering
  • site reliability engineering
  • observability
  • test automation
  • toil reduction

FAILURE HEAL THYSELF 14

If we compare digital immunity to human immunity, Chaos Engineering could be seen in the same light as a vaccine; it’s about exposing both the digital and human elements of our systems to a controlled threat or failure so we can build up the knowledge and technical requirements to recover from worse situations in the future. 

Chaos Engineering experiments can expose where auto-remediation is needed, where human intervention can be automated away and where 

Get started with Chaos Engineering to boost digital immunity

If you’re looking to build digital immunity and reliability, and are looking to implement chaos engineering in concert with other digital immunity practices like auto-remediation and toil reduction, then Cloudsoft AMP is a great place to start. 

AMP can work as a chaos agent, coordinating various tools to inject failures, and validating your system’s responses. 

Watch the demo: 

 

Related Posts