Do you know that expense is maximum post delivery of product? Estimates tell that 40–90% of the total costs of a system are incurred after deployment.
Software systems are inherently dynamic and unstable. If we stop changing the codebase, we stop introducing bugs.
If the underlying hardware or libraries never change, neither of these components will introduce bugs.
If we freeze the current user base, we’ll never have to scale the system.
For the majority of production software systems, we want a balanced mix of stability and agility.
At the end of the day, SRE job is to keep agility and stability in balance in the system.
Production team is responsible for the
availability
Latency
Performance
Efficiency
Change management
Observability/monitoring
Emergency response, and
Capacity planning of their service(s) and
Security of entire system
Availability via
Geo distribution of Apps
Multi node distribution of app (using K8s feature)
K8s deployment desired state feature
Latency via
Geo distribution of Apps
co-locating E-W apps (using K8s feature)
Their own deployment optimisation
Performance/Scale via
Scale of service/nodes via Infra as code
Monitoring channel to understand performance/scale need
Efficiency via
Relying on K8s nodes resource optimisation
Limiting per app/POD max and min resource
Change management via CICD which helps for
having clarity that new version should be knowingly good in production environment
Ensure production environment reliability on change failure (via rollback)
Observability/ Monitoring is used for
Analysing long-term trends
Comparing over time or experiment groups
For example, to check if website is slower compared to last time
Alerting if something is broken
It is done via white-box and black box method (will discuss below in detail)
Emergency response via
Ticketing
Incident response system
Capacity planning via
analytics on historical traffic data
Regular load testing
Collecting capacity forcast from inorganic sources
For example, new year day, black day, big billion day
Via machine learning
Security via
security best practices
Runtime security monitoring tools like Aquasec
A classic approach for monitoring is to watch for a specific value or condition, and then to trigger an email alert when that value is exceeded or that condition occurs.
In modern approach,
Monitoring doesn't require a human to interpret any part of the alerting domain.
Software should be able to do the interpreting, and
humans should be notified only when they need to take action.
Monitoring uses real-time quantitive data
In this type, measurable data comes from internal of system. Below example
Citrix CNN products sends data like
Metrics (Counters)
Syslog and audit logs
Events
K8s object Status
It is based on testing externally visible behaviour. For example
Latency in request/response
Failures which are externally visible, like error in HTTP
Login service going down
Security posture monitoring
System memory/CPU etc
Prometheus
Splunk
Fluentd
Webhooks (like audit Webhook) and logstash
Zipkin for tracing
Graphite
Observability is new way in container environment. Its new term and was needed to add interlink apps/containers visibility in monitoring content. Its done via
Istio Service graph
K8s OpenTracing
Based on above understanding, let's analyse below picture yourself. It will help to imbibe overall data mentioned above
Any CNC product which wants to fit itself as first class citizen must take care of above needs of production environment. These offerings will be helping hand for SRE to meet production environment needs.
https://kubernetes.io/docs/tasks/debug-application-cluster/audit/#log-collector-examples
https://landing.google.com/sre/sre-book/
https://www.weave.works/technologies/monitoring-kubernetes-with-prometheus/
https://opensource.com/article/19/10/open-source-observability-kubernetes
https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c