Site Reliability Engineering or SRE, is the next big thing after DevOps. In this episode I cover what SRE is, some of its principles and some of the challenges along the way.

Sources

https://www.oreilly.com/content/site-reliability-engineering-sre-a-simple-overview/

https://landing.google.com/sre/books/

Transcript

 

Site Reliability Engineering - Is it just traditional siloed Ops in disguised or is it a functioning DevOps organizational structure that works. SRE comes from the trenches of Google production environments, so many organizations look to SRE after having established DevOps teams. Site Reliability Engineering has some powerful concepts and tools, but it all comes at a price. If an organization has the scale, and the willingness to invest, and change their ways of working, SRE can help you go planet-scale. But there are many obstacles on the road to Site Reliability Engineering paradise. This is the DevOpsDojo #5, I am Johan Abildskov, join me in the dojo to learn.

 

Site Reliability engineering is the fabled world of planet scaled operations, with autonomous DevOps teams. This seems like an oxymoron but is a functional DevOps organizational structure.

 Sloss, who coined the term Site Reliability Engineering, states - "Site Reliability Engineering is what happens when you ask a software engineer to design an operations team". SRE is how Good is able to run their production environments in large scale. Both in terms of how many applications and services they run and how many engineers are required to develop and maintain those products, but also in terms of the global scale of their applications. 

 

Site Reliability Engineering is about scaling services to world-class availability, but it is also about scaling the engineering organization such that teams can continue to be productive as the business grows.

 

For me the core tenants of Site Reliability Engineering are:

Minimizing Toil, Shared Ownership and the ability to say no. 

 

Engineering is a key part of SRE, and toil is the opposite. Toil is the work that does not, in the long run, add value, or that does not require engineering. This means that if we do too much work on toil, we will not be able to scale superlinearly with the number of engineers. Examples of toil are manual deployment procedures and following complex processes. Google claims to be able to have their teams work with thirty to fifty per cent toil. With fifty per cent being a hard limit. A key point here is that if a team violates this, they will either shed responsibilities or get resources added. This requires a hard buy-in from management in supporting this. This will likely be impossible in a project funded organization. So this way of thinking of eliminating toil, and give the space in the teams to do real engineering work, is the foundation of Site Reliability Engineering.

 

Shared Ownership is a weird construct when the point is to run the applications that someone else is building. But it is a very important pillar of SRE. This shared ownership comes from SRE teams providing valuable metrics from production environments to product teams. It also comes from a shared agreement that if the application itself is too unstable, developers will join the SREs on-call rotation until the service has been restored to the quality the SREs can operate.

 

SREs also helps product teams with production readiness reviews and checklists. All in all SRE enables the product teams to run applications at planet-scale without having to maintain all the competencies that this necessarily requires inside the team. This again requires commitment from management in enforcing the requirements from the SRE teams, so we do not revert back to throwing applications over the wall of confusion.

 

A powerful realisation is that we should consider downtime a resource, and spent it deliberately and respectfully. With this realisation comes the underlying acceptance that absolute 100% availability is not the right target. As we approach 100% uptime, it becomes more and more expensive to increase uptime, and less and less likely that the problem will be on our end, rather than at the client-side or at an intermediate stage. As such, 100% availability as the goal, is not a good business decision.

What the right goal is a big topic in itself, and we will not cover that here.

 

So let us say that we have a business goal of 95%, we communicate this to our customers as our Service Level Agreement or SLA. SLAs typically also comes with pre-agreed repercussions in case of violations.

SRE teams then have Service Level Objects or SLOs. These are the internal metrics that the teams use to guide their decisions. These SLOs are somewhere between the SLA and the theoretical max. The reasoning being that if we have a more aggressive requirement internally it is unlikely that we break the externally facing SLA. If we look over a period of three months, we then have some amount of downtime that we can spend doing different stuff. This is the difference between the SLO and the currently measured availability. If we, for instance, have a SLO of 99%, each month, we can have 7.3 hours of downtime while still complying with the SLO. We call this the error budget. This means that each month we can spend 7.3 hours of downtime, and the question is then how do we get the most benefit from that?

Unfortunately in most organizations, this is spent accidentally, simply by having a flakey application, or not having invested in redundancy and resiliency. This makes it hard to spend the error budget, in healthy ways. 

 

One way of spending the error budget is that we can allow a higher, and thus more risky rate of innovation. We can also do disaster recovery training, or upgrade our infrastructure. Google even proclaims that if they fail to spend their error budget they will take down their services, such that no one becomes dependent on them having a higher uptime than is promised.

 

Error budgets provide a framework of reasoning around availability that avoids the usual absolute statement that downtime is bad.

The last concrete SRE technique that I will mention in brief is the Blameless postmortem.

It is an incident analysis technique that focuses on exposing systemic errors, without trying to assign blame. Blameless postmortems strive to turn individual or team learning in to organizational learning. 

Postmortems can be presented company-wide and celebrated turning the messengers of bad news into the heroes of our organizations.

 

Site Reliability Engineering allows ambitious companies to scale their engineering organizations and their service portfolio. But it requires the buy-in and active support to create the space to allow the SRE organization to do its work as it is supposed to be, primarily engineering updated. Unfortunately, without this company-wide belief and commitment to SRE, it can quickly become a modernization theatre without any real benefit. 

 

This has been the DevOpsDojo on Site Reliability Engineering. You can follow me on twitter @randomsort. If you have any questions, feedback or just want to reach out and suggest a topic, do not hesitate. You can find show notes with transcripts, links and more at dojo.fm. Support the show by leaving a review, sharing this episode with a friend or colleague or subscribing to the DevOpsDojo on your favourite podcast platform. Thank you for listening, keep learning.

Share | Download(Loading)
Podbean App

Play this podcast on Podbean App