Software engineering has this in common with having children: the labor before the birth is painful and difficult, but the labor after the birth is where you actually spend most of your effort.
Software systems are inherently dynamic and unstable. If we stop changing the codebase, we stop introducing bugs. If the underlying hardware or libraries never change, neither of these components will introduce bugs. If we freeze the current user base, we’ll never have to scale the system.
For the majority of production software systems, we want a balanced mix of stability and agility. At the end of the day, SRE job is to keep agility and stability in balance in the system.
Estimates tell that 40–90% of the total costs of a system are incurred after birth. Through this lens, then, we see that if software engineering tends to focus on designing and building software systems, there must be another discipline that focuses on the whole lifecycle of software objects, from inception, through deployment and operation, refinement, and eventual peaceful decommissioning. This discipline uses—and needs to use—a wide range of skills, but has separate concerns from other kinds of engineers. Today, our answer is the discipline Google calls Site Reliability Engineering.
Historically, companies have employed systems administrators to run complex computing systems.
The sysadmin approach and the accompanying development/ops split has a number of disadvantages and pitfalls.
Development and sys admin teams are quite different in background, skill set, and incentives. They use different vocabulary to describe situations; they carry different assumptions about both risk and possibilities for technical solutions; they have different assumptions about the target level of product stability. The split between the groups can easily become one of not just incentives, but also communication, goals, and eventually, trust and respect.
SRE is fundamentally doing work that has historically been done by an operations team, but using engineers with software expertise, and banking on the fact that these engineers are inherently both predisposed to, and have the ability to, design and implement automation with software to replace human labor.
In general, an SRE team is responsible for the availability, latency, performance, efficiency, change management, monitoring, emergency response, and capacity planning of their service(s).
Eventually, a traditional ops-focused group scales linearly with service size: if the products supported by the service succeed, the operational load will grow with traffic. That means hiring more people to do the same tasks over and over again.
To avoid this fate, the SRE team tasked with managing a service needs to code or it will drown.
Because SREs are directly modifying code in their pursuit of making Google’s systems run themselves, SRE teams are characterized by both rapid innovation and a large acceptance of change. Such teams are relatively inexpensive—supporting the same service with an ops-oriented team would require a significantly larger number of people. Instead, the number of SREs needed to run, maintain, and improve a system scales sublinearly with the size of the system.
One could view DevOps as a generalization of several core SRE principles to a wider range of organizations, management structures, and personnel. One could equivalently view SRE as a specific implementation of DevOps with some idiosyncratic extensions.
What level of availability will the users be happy with, given how they use the product?
What alternatives are available to users who are dissatisfied with the product’s availability?
What happens to users’ usage of the product at different availability levels?
A classic and common approach to monitoring is to watch for a specific value or condition, and then to trigger an email alert when that value is exceeded or that condition occurs.
However, monitoring should never require a human to interpret any part of the alerting domain. Instead, software should do the interpreting, and humans should be notified only when they need to take action.
It uses real-time quantitive data for monitoring
White-box monitoring
Monitoring based on metrics exposed by the internals of the system
Black-box monitoring
Testing externally visible behavior as a user would see it.
Purpose of monitoring
Analyzing long-term trends
Comparing over time or experiment groups
For example, to check if website is slower compared to last time
Alerting if something is broken
Alerts - A notification intended to be read by a human Signify that a human needs to take action immediately
Tickets - Signify that a human needs to take action, but not immediately.
Logging - No one needs to look at this information, but it is recorded for diagnostic or forensic purposes. The expectation is that no one reads logs unless something else prompts them to do so.
An accurate organic demand forecast, which extends beyond the lead time required for acquiring capacity
An accurate incorporation of inorganic demand sources into the demand forecast
Regular load testing of the system to correlate raw capacity(servers, disks, and so on) to service capacity
SREs are focused on finding ways to improve the design and operation of systems to make them more scalable, more reliable, and more efficient. However, we expend effort in this direction only up to a point: when systems are “reliable enough,” we instead invest our efforts in adding features or building new products.
SRE’s experience has found that reliable processes tend to actually increase developer agility: rapid, reliable production rollouts make changes in production easier to see. As a result, once a bug surfaces, it takes less time to find and fix that bug. Building reliability into development allows developers to focus their attention on what we really do care about—the functionality and performance of their software and systems.
https://landing.google.com/sre/sre-book/chapters/introduction/