Must-Know Site Reliability Engineering (SRE) Terminology
Familiarise yourself with the key terms in Site Reliability Engineering and test yourself with our wordsearch!
Toil
Toil is manual, repetitive work (for example, creating user accounts, handling access requests, clearing browser cache, etc.) that is necessary, but of low value. The SRE approach is to automate as much toil as possible, leaving more opportunities to perform higher-value work.
Auto-remediation
The ability to use advanced automation to sense issues, effect solutions and restore service; reducing toil.
SLO
Service-level objective: A target value or range of values for a service level that is measured by an SLI.
SLI
Service Level Indicators (SLIs) are measures of performance that allow SREs to understand if they are meeting the SLOs for the system. For example, they can be the uptime metric for a particular service.
Mean Time to Detect (MTTD)
The time taken to identify an incident before it makes an impact.
Mean Time to Recover (MTTR)
Critical for SREs. The average time it takes to recover and get back up and running.
Incident response automation (IRA)
The ability to streamline the incident response process through automation.
Mean Time Between Failures (MTBF)
The average time elapsed between two incidents across a series of incidents.
Error budget
The amount of acceptable downtime for a particular service.
Can you find all these words, and more, in our SRE wordsearch?
Click to play our interactive SRE wordsearch!