What is Site Reliability Engineering?
Site Reliability Engineering (SRE) is a set of principles and practices that applies application development approaches to infrastructure and operations.
SRE’s main objective is to efficiently run and operate highly scalable Web based or SaaS software products.
SRE as a discipline was created at Google in 2003 under the direction of Ben Traynor Sloss. When Sloss joined google, there was not yet a concept of DevOps and the infrastructure scaling needs were massive based on the pace of growth. This resulted in an urgent need to approach scaling and operating applications in a new, significantly more efficient and automated way.
People often confuse DevOps and SRE, although DevOps and SRE may be two sides of the same coin, their focus is slightly different. DevOps is more focused on optimizing the development pipeline from when an engineer checks in code through to when that code reaches customers in production (usually referred to as Continuous Integration and Continuous Deployment or CI/CD) while SRE’s focus is more so on the operation of code in production including configuration, scalability, capacity management, change management, monitoring, reliability, resiliency etc.
Key principles of SRE help bring clarity to what SRE professionals are responsible for on a daily basis:
- Embracing and systematically assess risk of operating the platform based on business objectives.
- Driving the definition of Service Level Objectives (SLO’s) for a product. SLO’s are normally internally published objectives for uptime and/or other performance indicators. Missing SLO’s can have internal consequences such as stopping the introduction of new features until a platform is stabilized. SLO’s are aspirational goals, rarely committed to in external contracts with remuneration, and can be internal or external. SLA’s on the other hand are contractual and almost always externally (customer or user) facing and may include financial penalties for not achieving.
- Eliminating Toil on an ongoing basis. Toil is defined as mundane, repetitive operational work providing no enduring value, which scales linearly with service growth. This objective clearly aligns with optimizing the level of automation to successfully operate a platform. While one-off tasks may not need to be automated, the majority of tasks required to operate a platform can be automated to reduce toil, as well as the chance of human error.
- Optimizing and automating monitoring of the platform. Although ‘fully monitoring’ seems like a straightforward objective for any platform, it can be challenging to optimize in practice. For example, many failures result from a hung process or transaction that may just block new requests but not immediately show as hard failures of transactions. It is important that monitoring be part of the engineering design process including a heavy focus on monitoring for customer/business outcomes. Too often we see clients who monitor system level metrics but not customer transaction outcomes, resulting in customers regularly finding issues well before the product monitoring does.
In summary, the definition and evolution of SRE is a useful framework for thinking about and optimizing how companies operate, monitor and scale their products in production. The SRE discipline incorporates constantly evolving approaches and tools to improve the efficiency of this work and the quality of the outcome (i.e. products that minimize customer incidents and operate efficiently in production).
Why is SRE important?
Very simply – mature SRE capabilities improve time to market (TTM), Quality and relative cost of any software product.
Time to Market can be improved when monitoring and error detection are fully and heavily automated such that introducing change is inherently less risky. For example, if a change is rolled out and error rates increase, it can be rolled back automatically with no intervention and minimal impact. Although SRE capabilities don’t directly decrease the time to code a change, SRE well executed makes it less risky to introduce change overall, thus improving TTM.
Similarly, end user Quality improves when SRE is done well as the level of automation reduces the opportunity for human error. In addition, thorough automated monitoring, especially monitoring aligned to the end user experience (Business monitoring that simulates actual user behavior, in addition to system monitoring), makes it more likely that issues are caught early before customers are impacted.
Building SRE capabilities that are embedded with or at a minimum, very closely aligned, with engineering teams, also improves collaboration and team cohesion, while enabling more cross-pollination of skills.
Key points for CEOs and Board members
- SRE should be integrated into the Engineering teams – As briefly discussed above, it’s important to embed or very closely align Site Reliability Engineers directly into engineering teams (at the Scrum or Squad level) as opposed to having a pool of Site Reliability engineers that are taking requests from any and all teams. This is critical to ensuring that the SRE team member truly understands the area they are focused on and they are aligned and incented to the same Business outcomes (e.g. TTM, Quality, etc.) that the rest of the team is. In some cases, on Site Reliability Engineer can be shared across 2 teams based on needs/capacity but should always be consistently aligned to same one or small number of scrum teams. Ideally the SRE reports to the same engineering leader as the rest of the Engineering team and can have a dotted line (matrixed) reporting relation to a small central SRE function.
- SRE is a very close cousin to DevOps but has a different focus – DevOps focuses on optimizing the flow of code from Engineers’ hands on keyboards to check-in through (automated) quality checks and deployment to production. Deployment, as part of the DevOps purview, most certainly overlaps with SRE’s roles and responsibilities, however, DevOps is usually less focused on monitoring, configuring and operating in product than SRE is. Given similar skillsets, some engineers are capable in both areas and in smaller companies, these functions may be performed by the same team or even the same people. Normally as a company scales, skillsets will diverge in that there will be a need for team members that are more focused on one area or the other.
- SRE also helps address a gap between engineering capability and mindset of ‘a typical software developer’ and an Operations focused Engineer. Many software engineers are so focused on making things work that they don't think through how things will BREAK. SREs are very operationally minded software developers who focus on non-functional rather than functional requirements. Bringing these two capabilities together helps ensure that product actually runs optimally in production, in addition to passing the functional quality checks before it reaches customers.
- SRE professionals are focused on automation of repetitive tasks (eliminating toil). In contrast, if you hand a software person a repetitive task, they will automate it - but only if and when they have time. SREs focus a significant amount of their time on automation as part of their key principles and often they will even measure their split of time/effort to ensure adequate focus on toil elimination and automation on an ongoing basis.
Common CEO questions about Site Reliability Engineering
At what point in my growth do I need to have SRE experts?
As with many scaling questions, the answer is “it depends.” A small company, that has 10-20 engineers, may not need additional SRE capacity if its engineering team is Public Cloud savvy and/or some team members have basic capabilities in both DevOps and SRE. However, in our experience with clients, software development engineers on average are not deeply skilled, passionate or interested in the operational aspects of their product. Especially in startups, the focus is heavily on getting new feature functionality out to customers and operational practices are a second thought at best. Therefore, we recommend bringing in SRE capabilities relatively early on, and definitely as soon as the product has meaningful traction in the marketplace (growing customer base).
Can I hire people that can do both DevOps and SRE?
As previously described, there is a great deal of overlap in focus and skills between DevOps and SRE. In our experience, DevOps professionals more commonly come from an Engineering background whereas SRE’s come from an Operations background but have solid scripting and ‘lite coding’ capabilities. In smaller companies that are heavily using Public Cloud, it’s more likely that a combination DevOps/SRE will be workable since the Cloud will provide more plug and play tools. In short, despite the overlap, DevOps and SRE professionals often come from a slightly different background and these related skillsets will be further bifurcated as the company scales.
What SRE is not?
SRE does NOT prevent all customer impacting incidents
While putting the principles of SRE into practice will significantly and positively impact uptime and overall reliability by improving monitoring and reducing the opportunity for human error, as long as you are in business, there will be issues that impact customers. Solid SRE capabilities will help reduce the number and duration of incidents.
SRE is not a “substitute” for Incident and Problem Management
Incident management is the processes followed during a customer impacting (or potentially customer impacting) incident to resolve as quickly as possible. Problem management deals with analyzing the true root causes for the incident and methodically tracking and driving each and every contributing root cause to closure. We see many companies that are heavily automated from an SRE perspective (monitoring, corrective actions, etc.) but then fail to address incidents and their full resolution through problem management in a disciplined way. For example, they may assemble team members on a Slack channel to work on the issue but fail to open a bridge or a war room where people can communicate more directly in real-time. It is also common for teams to conduct post-mortems but fail to push for ALL contributing root causes and fail to have systematic way to track and close out action items for each root cause.
SRE is not an excuse for application development focused engineers to disengage from the smooth running in production of what they build
Even with the best SRE team members embedded or closely aligned to Engineering scrum teams, application developers still need to understand how their application is deployed, monitored and operated in production. This is critical to their ownership mindset (as opposed to ‘throwing it over the wall’ for someone else to operate in production) and therefore their ability to effectively contribute during troubleshooting when issues arise. We see all too often that Engineers are not involved in operating their platforms which results in:
- Slower incident resolution
- Team frustration and conflict between Software and Site Reliability engineers
- A sense of general unfairness across the team as SRE needs to ‘clean up’ sometimes poor or broken code for the software engineers.
This frustration of SRE taking on poor code is also managed by taking the data driven approach of looking at measurements such as Error Budgets as previously outlined. If an error budget for a given month or quarter is exceeded, then the team must halt or slow new feature releases until the platform is stabilized. Regardless of the SRE policies at a given company, software developers must own the health of their code end-to-end.In summary, although SRE has been a growing and maturing discipline for many years, we still see many companies that neglect this capability, especially when compared to the focus of pumping out new features. It’s obvious how this could happen, but the results run counter to the business objective to also have a stable, scalable product that customers can reliably use, in addition to feature innovation. We help many companies comprehensively scale their products, call us and we can help!