[TBD]Design the role of SRE (Site Reliability Engineer)

Introduction

Software engineering has this in common with having children: the labor before the birth is painful and difficult, but the labor after the birth is where you actually spend most of your effort.

Laymen explanation

Software systems are inherently dynamic and unstable. If we stop changing the codebase, we stop introducing bugs. If the underlying hardware or libraries never change, neither of these components will introduce bugs. If we freeze the current user base, we’ll never have to scale the system.

For the majority of production software systems, we want a balanced mix of stability and agility. At the end of the day, SRE job is to keep agility and stability in balance in the system.

Technical explanation

Estimates tell that 40–90% of the total costs of a system are incurred after birth. Through this lens, then, we see that if software engineering tends to focus on designing and building software systems, there must be another discipline that focuses on the whole lifecycle of software objects, from inception, through deployment and operation, refinement, and eventual peaceful decommissioning. This discipline uses—and needs to use—a wide range of skills, but has separate concerns from other kinds of engineers. Today, our answer is the discipline Google calls Site Reliability Engineering.

Before SRE

Historically, companies have employed systems administrators to run complex computing systems.

The sysadmin approach and the accompanying development/ops split has a number of disadvantages and pitfalls.

Development and sys admin teams are quite different in background, skill set, and incentives. They use different vocabulary to describe situations; they carry different assumptions about both risk and possibilities for technical solutions; they have different assumptions about the target level of product stability. The split between the groups can easily become one of not just incentives, but also communication, goals, and eventually, trust and respect.

SRE role

SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design and implement automation with software to replace human labor.

In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).

Ops focussed team Vs SREs

Eventually, a traditional ops-focused group scales linearly with service size: if the products supported by the service succeed, the operational load will grow with traffic. That means hiring more people to do the same tasks over and over again.
- To avoid this fate, the SRE team tasked with managing a service needs to code or it will drown.
Because SREs are directly modifying code in their pursuit of making Google’s systems run themselves, SRE teams are characterized by both rapid innovation and a large acceptance of change. Such teams are relatively inexpensive—supporting the same service with an ops-oriented team would require a significantly larger number of people. Instead, the number of SREs needed to run, maintain, and improve a system scales sublinearly with the size of the system.

DevOps vs SRE

One could view DevOps as a generalization of several core SRE principles to a wider range of organizations, management structures, and personnel. One could equivalently view SRE as a specific implementation of DevOps with some idiosyncratic extensions.

Useful questions for product reliability

What level of availability will the users be happy with, given how they use the product?
What alternatives are available to users who are dissatisfied with the product’s availability?
What happens to users’ usage of the product at different availability levels?

SRE tools

Monitoring

A classic and common approach to monitoring is to watch for a specific value or condition, and then to trigger an email alert when that value is exceeded or that condition occurs.

However, monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.

It uses real-time quantitive data for monitoring

White-box monitoring

Monitoring based on metrics exposed by the internals of the system

Black-box monitoring

Testing externally visible behavior as a user would see it.

Purpose of monitoring

Analyzing long-term trends
Comparing over time or experiment groups
- - For example, to check if website is slower compared to last time
Alerting if something is broken

Alerts - A notification intended to be read by a human Signify that a human needs to take action immediately

Tickets - Signify that a human needs to take action, but not immediately.

Logging - No one needs to look at this information, but it is recorded for diagnostic or forensic purposes. The expectation is that no one reads logs unless something else prompts them to do so.

Capacity planning approach

An accurate organic demand forecast, which extends beyond the lead time required for acquiring capacity
An accurate incorporation of inorganic demand sources into the demand forecast
Regular load testing of the system to correlate raw capacity(servers, disks, and so on) to service capacity

What beyond reliability

SREs are focused on finding ways to improve the design and operation of systems to make them more scalable, more reliable, and more efficient. However, we expend effort in this direction only up to a point: when systems are “reliable enough,” we instead invest our efforts in adding features or building new products.

Benefit to developers

SRE’s experience has found that reliable processes tend to actually increase developer agility: rapid, reliable production rollouts make changes in production easier to see. As a result, once a bug surfaces, it takes less time to find and fix that bug. Building reliability into development allows developers to focus their attention on what we really do care about—the functionality and performance of their software and systems.

Reference

https://landing.google.com/sre/sre-book/chapters/introduction/