Reducing toil with automation
Teams working in Site Reliability Engineering (SRE), Platform and Operations teams often find themselves heavily involved in undertaking tasks that are manual in nature and highly repetitive. This type of work is known as toil.
Toil: work that is manual, repetitive, automatable and reactive and that lacks enduring value
Examples of toil are:
- Scheduling system backups
- Software upgrades and system configuration changes
- Responding to events (e.g., out-of-memory, high CPU usage, scaling requests)
- Responding to incidents, such as distributed denial of service (DDoS) attacks, data breaches, or major business-impacting incidents.
For many teams, toil can be all-consuming - with some spending 90% of their time, or more, on toil. This means they’ll struggle to find the time to improve productivity and accuracy, let alone scale site reliability engineering operations by automating these often highly automatable tasks.
Toil, therefore, reduces the impact and value of SRE. Toil does not mean that a task is not important or should not take place. It means that it is non-value-adding engineering work.
The human impact of toil
Teams with high levels of toil are typically not happy teams as toil can have a serious detrimental impact on team morale.
Teams with a high toil-to-engineering work ratio often lack job satisfaction and will overall be unsuccessful teams as a result. High levels of toil typically also lead to higher than normal levels of staff turnover for these teams.
Reducing toil
Ultimately automation is the key to reducing toil. Whilst it can be tempting for organisations to aim for zero toil, this realistically isn’t practical or even possible to achieve.
The goal should be to rightsize toil to a manageable level, and what this is will depend on the individual organisation's size and growth rate. Industry analysts Gartner recommend for example that no more than 50% of a site reliability engineer’s time be spent on toil.
Benefits of reducing toil
- Scalability: Supports growing operational work without any corresponding increase in resource count.
- Improved morale: Teams can focus on higher-value, cognition-based tasks which will lead to higher job satisfaction, productivity and success, therefore reducing staff turnover.
- Increased reliability: Reducing toil will create resilient, performant, predictable and durable systems.
- Improved user or customer experience: Automation around toil will reduce time to detect and resolve issues or problems, meaning issues have less impact on your customers.
Where do I start?
Whilst SRE teams will have many processes and workflows that can and should be automated, where should you start?
- Identify high-impact use cases
- Prioritise the automation efforts that will deliver the greatest benefit by identifying the biggest constraint in the workloads
- Then identify the next biggest benefit and constraint and so on.
- Look at undertaking a proof of value around one of your high-impact use cases with a tool like Cloudsoft AMP that will reduce your toil and fast.
From automation to autonomous systems
SRE best practice advocates the reduction of toil by using innovative tools and technologies to automate repetitive or error-prone tasks.
However, the longer-term and high-value goal as your SRE function matures should not just be around creating automated systems but autonomous systems that require minimal human intervention to make decisions.
Doing this moves organisations towards autonomous operations and optimised high-value engineering models.