SRE Automation: the good, the bad (and the toily)
Toil is the kind of work that tends to be manual, repetitive, tactical and devoid of long-term value; for example software rollouts or rollbacks, configuration changes, restarts and more. It is also work that tends to increase proportionately as a service grows, meaning you need more people to tackle the same value-draining problems.
Automation can be a panacea to toil, but with 68% of SREs surveyed saying that they spend most of their time on building and maintaining one-off automation code, it’s important to distinguish between Good Automation and Bad Automation. The last thing automation should be doing is adding to toil!
Good automation vs bad automation
Bad Automation is not repeatable, it’s resource intensive to maintain and it’s almost impossible to scale quickly. It also creates knowledge silos, leading to key-man dependency and creating more risk and toil.
Good Automation is built on the principles of composability. It should be well documented, accessible, easily maintainable and repeatable across different context; after all, it’s there to help you scale! SREs can use good automation to improve reliability, introduce auto-remediation and to scale these best practices across their organisation.
But, to identify where good automation can make the biggest impact we need to identify the top sources of manual toil.
These will vary depending on your organisation, but toil tasks fall into three broad categories:
1. Event response
The dream is obviously to have services which run perfectly all the time, but the reality is that our services aren’t perfect and will fail from time to time.
Event response is therefore a major focus for SREs, whose goal is to restore service as quickly as possible before identifying a root cause that can be more widely addressed.
Where Good Automation can help with event response:
- Auto-remediation. Use automation to ingest and stitch together data from your monitoring, alerting and ITSM tools and automate the response to quickly resolve the event.
The benefits of Good Automation here are reducing toil and keeping MTTR nice and low.
2. Environment maintenance
Where possible, you want to automate environment maintenance to reduce developer toil.
Environment maintenance includes tasks like:
- manual activities around deployments, upgrades, and test environments.
- building and maintaining ad-hoc automation for immediate fixes.
Environment maintenance also starts at the design stage; SREs should also shift-left into influencing architectural design decisions to ensure reliability and scale.
This is where SRE and Platform Ops can overlap. SREs can build reliability and automation into reusable application blueprints, shared via developer platforms and forming a golden path to production.
This not only reduces toil for SREs, but also for developers, who can build, test, deploy and upgrade their applications with confidence.
3. Continuous compliance
Compliance, with both internal and external policies, is a constant challenge for SREs. Service reliability is increasingly being regulated (eg, EU’s DORA and FCA regulations) and downtime is less and less tolerable.
Sources of toil when it comes to compliance can include:
- checking compliance against best practices
- instrumenting applications to make them observable
- improving service design in response to a breach of compliance.
But, by combining blueprinting, composability and contextual observability, Good Automation enables continuous compliance and reduces toil. How? It aligns monitoring to your architecture and can provide greater consistency across your complex hybrid estates.
How to implement Good Automation?
Your technology environment spans applications, infrastructure, configuration, policies and core services. By elevating Infrastructure-as-Code to declaring your environmental entities as code you can manage your environments in their entirety.
When everything is expressed as code, it becomes much easier to add automation.
Good Automation with Cloudsoft AMP
AMP is a platform-based software solution which combines powerful automation with everything-as-code capabilities to make your complex tech ecosystems manageable, scalable and reliable.
A Tier 1 Bank implemented AMP and achieved:
- 95% reduction in MTTR
- 100% hands-off environment maintenance
- 75% reduction in toil