Reduce SRE toil: 4 steps to success
Why is toil such a big problem?
Toil is repetitive, manual work that adds little value and which doesn't scale linearly. It can be time-consuming and tedious, taking engineers away from more valuable tasks such as building new features or improving existing ones. This hinders product development, and can impact on revenue.
Additionally, toil can lead to burnout and decreased job satisfaction, as engineers may feel that their skills and expertise are being wasted on menial tasks. By minimising toil your engineers can focus on the work that they find most fulfilling and valuable, and you can retain the skills and expertise of your team.
What tasks are toil?
Toil tasks are simple to perform, but do not add much value. They do not require an engineer's expertise or critical thinking. Instead, toil hinders engineers from advancing with the much more valuable (and interesting) work of product and service development.
Some examples of toil include:
- Copying and pasting commands from a playbook
- Repaving large environments
- Scheduling system backups
- Manual system configuration changes
- Spinning up/down test or staging environments
- Manual response to repeated incidents or events
What is toil costing my team?
The impact of toil is huge, and it has a real cost attached to it. If we take Gartner's example that 75% of IT Operations team time is spent on toil and average that out across a 100 person It Operations organisation with an average salary of £60,000, the cost of that toil is £3 million per year in lost productivity alone.
Use the calculator below to work out how much you could be wasting on toil:
4 steps to reduce toil (and its costs)
Before you can take any of the recommended steps, it's essential that you understand which operational activities and processes are most heavily laden with low value, manual repetitive tasks.
This exercise should be completed regularly, perhaps quarterly, to identify where progress has been made and where new toil tasks are encroaching on engineering time.
Once you understand where toil is coming from, you will be better able to identify which tasks are ripe for your toil reduction programme.
By automating toil tasks, your engineers are freed up to focus on more complex and creative work, ultimately increasing productivity and efficiency. Additionally, automation can reduce the risk of errors and improve consistency in the output. This can lead to improved quality and customer satisfaction, and reduced toil from incident response.
Although automation is a solution to toil, done badly it can actually add to it! A recent survey found that 60% of Site Reliability Engineers are spending most of their time on building and maintaining automation code. Some of this work is undoubtedly valuable, but when it stems from one-off automation code...it's toil.
Much of this one-off code is a result of the growing complexity of tech infrastructure and systems. One way to manage this complexity (and reduce toil) is offered by platform engineering. By setting guardrails (automated checks and controls) and creating self-service blueprints (predefined templates that can be used to quickly spin up new environments/provision infrastructure) you can enforce better standardisation, making it easier to automate across your environments.
Composability is the ability to reuse and combine different components to create new systems. By using composable components, developers can build new systems faster and with greater flexibility, since they can mix and match pre-approved components as needed.
This also makes managing your tech ecosystem much easier, by reducing the amount of one-off code which can be tricky to troubleshoot and which can slow down incident response.
Combined with standardisation and automation, composability further reduces the toil required to configure new environments, testing and remediation of compliance issues and improves reliability in production, reducing the toil of incident management.
Can we eliminate toil?
Short answer, no.
Google, who coined the term, recommend that their Site Reliability Engineering teams spend less than 50% of their time on toil tasks and some teams even report as little as 33% of their time on toil.
There will always be some element of toil in any tech operation, but the important thing is to recognise when a task is becoming toil and ask: 'can I automate this'? If the answer is yes, then get automating!!
Reduce toil by up to 50%
Cloudsoft AMP combines automation, blueprinting and composability to reduce developer toil and improve system reliability.