Demonstrating the Value of SRE
SREs have been described as Swiss aRmy Engineers because SRE is a discipline that takes aspects of software engineering and applies them to infrastructure and operations problems to create scalable and more reliable systems. Over the last 10 years or so, it’s become a crucial part of many organisations because they bridge the gap between shipping code and making it run well in the real world.
The term "SRE" (Site Reliability Engineering) was coined by Google engineer Ben Treynor Sloss in the early 2000s, who defined SRE as "what happens when a software engineer is tasked with what used to be called operations." My, how it has evolved from there!
In this online event, we had some fantastic panelists:
- Ashley Sawatsky (Senior Reliability & Incident Response Advocate, Rootly)
- Andy McGuigan (Engineering Manager - SRE, Axon)
- Mandi Walls (DevOps Advocate, PagerDuty)
- Alex Heneveld (CTO, Cloudsoft)
Watch the on-demand recording below:
Why is SRE so important?
SRE is important to businesses running in the cloud because it focuses on automating operational aspects of your IT operations like deployments, monitoring and IaaS management. The goal is reliability and increased resilience, and ultimately to reduce the workload on operations teams while maintaining system uptime. We’ve often said that SRE, as a discipline, is like the CIA; our failures are known, our successes are not! One of the most effective tactics to share back to your business counterparts is the number of days without a meltdown… or in earnestness, Mean-Time-To-Dopamine.
DevOps and Platform Engineering: Distant Cousins?
SRE is the first cousin (once removed) from Devops. It shares some grandparent values and principles somewhere, but there are some pretty big distinctions. DevOps is cultural. It’s a movement. It’s meant to bridge the gap between development and operations, aiming to reduce the time between deployments, go faster, and make software delivery more reliable. SRE might be interpreted as an implementation of DevOps, a way of putting principles into action, but with a much stronger focus on automation and system resilience.
Platform Engineering, on the other hand, often deals with building and maintaining the underlying platforms that host the applications. It focuses on the scalability, reliability, and efficiency of these platforms. SRE complements platform engineering beautifully by ensuring that the applications running on these platforms are reliable and meet their service level objectives (SLOs). There are some wonderful overlaps on KPIs around service availability, resilience, mean-time-to-X (detection, downtime, recovery, etc.) and much more.
Top Trends - What Cloudsoft Sees
- More Automation: As systems grow in complexity, the need for automation in monitoring, alerting, and response becomes more critical. SREs are adopting more sophisticated tools and technologies to automate routine tasks and responses to common incidents.
- SLO Down to Go Faster: The use of SLOs to define and measure reliability is becoming more prevalent. SREs are focusing on setting more realistic (read: measurable) objectives for system performance and reliability, which helps in making informed decisions about where to invest in reliability improvements.
- Deep Observability: Moving beyond traditional monitoring, observability encompasses a deeper insight into the system's internal state, based on external outputs. This trend focuses on understanding the “why” behind system behaviors, enabling more effective troubleshooting and system improvement.
- Blameless Postmortems: SRE Burnout is a major thing we’ve talked about before publicly. Cultivating a learning culture where failures are seen as opportunities to improve system reliability and not as reasons for blame. This approach encourages open sharing of mistakes and learnings, contributing to overall system resilience.
- Chaos Engineering: The practice of intentionally introducing failures to test the resilience of systems. This proactive approach helps identify weaknesses before they become critical issues, promoting system robustness. If a butterfly flaps its wings in US–EAST-1, why does a spunky team of 40 engineers in Edinburgh go onto high alert?
How do we demonstrate the value of SRE?
SRE teams are not natural limelight chasers which is why having the “advocacy role” is becoming increasingly popular. We need to do a better job of highlighting its impact on the business in terms of stability, reliability, and efficiency of technology services, which directly correlates to improved user satisfaction and business outcomes. After this online event, we really thought about some practical advice for the SRE community to surface their value:
Performance Metrics Improvement
- Before-and-After Analysis: Conduct a comparative analysis of key performance indicators (KPIs) before and after the implementation of SRE practices. Metrics such as system uptime, error rates, and incident response times are tangible indicators of SRE's impact.
- Service Level Objectives (SLOs) Achievement: Demonstrate how SRE has contributed to meeting or exceeding established SLOs, showcasing the improvement in reliability and quality of services.
Incident Management Efficiency
- Reduced Incident Response Time: Show how SRE practices have streamlined the incident response process, reducing the time to detect and resolve issues.
- Post-Incident Reviews and Learnings: Highlight how SRE facilitates a culture of learning and continuous improvement through blameless post-mortems, leading to fewer recurring incidents and systemic improvements.
Operational Cost Savings
- Infrastructure Cost Optimisation: Illustrate how SRE has optimised resource usage and infrastructure costs through efficient scaling and automation.
- Reduction in Downtime Costs: Quantify the financial impact of reduced system downtime on revenue and brand reputation, underscoring the cost-saving aspect of SRE.
Enhanced Developer Productivity
- Automation of Toil: Provide examples of how automation of repetitive operational tasks (toil) has freed up development resources, allowing teams to focus on innovation and value-added activities.
- Faster Time to Market: Discuss how SRE practices, such as CI/CD and automation, have accelerated the development lifecycle, enabling quicker release of features and products.
Customer Satisfaction and Business Continuity
- Improved User Experience: Link improvements in system reliability and performance directly to enhanced user satisfaction and engagement metrics.
- Risk Mitigation and Business Continuity: Emphasise how SRE strengthens disaster recovery and business continuity planning, reducing business risk.
Cultural Transformation
- Shift towards a Reliability Culture: Describe the cultural shift towards prioritising reliability and shared ownership of production systems between development and operations teams.
- Cross-Functional Collaboration: Showcase instances where SRE practices have fostered collaboration and knowledge sharing across different functional teams.
Key Takeaway
The key takeaway is that Site Reliability Engineering is an essential discipline in the modern enterprise landscape, focusing on reliability, automation, and operational efficiency. Systems are becoming brutally complex, and yet despite the need for fast and reliable service delivery, the expansion of SRE headcount isn’t always a planned or preventative measure. When we surface the value we deliver back to the business (in terms of happy customers and continued flow of money through working pipes) we’ll be heroes for more than an hour.
Got some feedback?
As ever, we would be interested to hear about your experience with SRE. Please email info@cloudsoft.io with your comments and suggestions or book a free session with one of our cloud experts.