Why Site Reliability Engineering is one of the most in-demand skills in 2023
LinkedIn’s annual Jobs On The Rise report looks for the fastest growing job titles in the UK, providing insights into where the workforce is headed.
The report covers the entire labour market, and unsurprisingly a number of technology roles make the cut: Cloud Engineer (8), Data Engineer (13), Security Operations Analyst (14), Machine Learning Engineer (16) and Software Engineer (19).
But it is the rise of the Site Reliability Engineer which I want to delve into. SRE has appeared in the LinkedIn Jobs on the Rise report since 2020, but the acceleration of digital transformation and our increasing dependence on highly available digital technologies for everyday life means that demand for SRE skills has exploded.
There are currently over 10,000 SRE related jobs being advertised in the UK on LinkedIn, and with Gartner predicting that “by 2027 75% of enterprises will use SRE practices organisation wide to optimise their operations”, up from just 10% in 2022, that number is only likely to increase.
What does a Site Reliability Engineer do?
A Site Reliability Engineer (SRE) is responsible for ensuring the reliability, performance, and availability of digital systems. They are responsible for designing, building, and monitoring systems to maximise system uptime and efficiency for the best possible end-user experience.
SREs work closely with developers to ensure applications are designed and built with reliability in mind, as well as with operations teams to ensure that the necessary infrastructure is in place to support the application. They are also tasked with identifying and resolving potential outages and performance issues before they become a problem.
Why are SRE skills in demand?
Site Reliability Engineering (SRE) skills are in high demand because they are essential for the successful operation of increasingly complex digital systems.
Products are increasingly digital, and increasingly intertwined with our everyday lives. Reliability is a competitive factor - would you bank with a bank whose app was frequently offline or stream from a media company that disconnected at peak times?
By leveraging automation and cutting-edge technologies, SREs are able to deliver high quality services with maximum uptime and minimal disruption, and therefore deliver a high-quality customer experience.
The principles of Site Reliability Engineering
1) Continuous Monitoring
One of the tenets of SRE is that errors and failures are inevitable, especially as deployment environments become more and more complex. Perfection is not possible, so instead SRE teams monitor applications in production environments in line with performance metrics.
2) Small but often change implementation
SRE lives and breathes the CI/CD principles of DevOps to maintain optimal system reliability. SRE teams use automation tools to execute repeatable processes consistently, reduce change risk and increase the speed and efficiency of change management.
3) Automation and blueprints
By blueprinting best practices 'as code' and enabling these templates to be used by developers via a service catalog, and automation of testing SRE teams can ensure reliability practices are embedded into each stage of the delivery pipeline; from architecture choices through to testing and production.
Scale SRE with PlatformOps
Because a large part of SRE is ensuring that applications are designed and built with reliability and resilience in mind, SRE skills are an essential component of Platform teams.
Platform teams collaborate on creating, sharing and reviewing services which form a Golden Path for developers to follow as they build their applications. This embeds SRE best practices into the developer workflow from the very beginning, improving the resilience and reliability of applications under development before problems occur in production. Teams also learn from previous incidents, and roll those learnings into new SRE guardrails.
But, how can you do this at scale?
One option is to hire lots of Site Reliability Engineers to scale out best practices. However in a tight hiring environment where demand outstrips supply, this isn't always possible.
Another way to achieve this is to implement tooling that can support SRE principles and blueprint and catalog services, architectures. With Cloudsoft AMP, you can use environment-as-code to define, govern and manage your environments and services:
Powered by environment-as-code, Cloudsoft AMP's easy to use unified control plane provides monitoring of all your critical systems. And because AMP's sensors can integrate with your entire tech-stack, you never need to worry that an incident will go undetected.
Detecting incidents is only part of the resilience equation - you also need to be able to recover quickly. AMP's automation engine enables self-healing, zero-touch operations. Major Tier 1 banks have even slashed their downtime by 95%. Read the case study.
Identifying and fixing issues helps with resilience, but SRE is also about capturing and codifying architecture, best practices, policies, processes and runbooks. AMP enables you to create composable design-time and run-time models in reusable blueprints which can be wrapped into your service catalog for self-service.