Is your container ready for container native environment

Introduction

Do you know that expense is maximum post delivery of product? Estimates tell that 40–90% of the total costs of a system are incurred after deployment.

Laymen explanation

Software systems are inherently dynamic and unstable. If we stop changing the codebase, we stop introducing bugs.
If the underlying hardware or libraries never change, neither of these components will introduce bugs.
- If we freeze the current user base, we’ll never have to scale the system.
- For the majority of production software systems, we want a balanced mix of stability and agility.

Role of SRE/DevOPs

At the end of the day, SRE job is to keep agility and stability in balance in the system.
Production team is responsible for the
- availability
- Latency
- Performance
- Efficiency
- Change management
- Observability/monitoring
- Emergency response, and
- Capacity planning of their service(s) and
- Security of entire system

How production team achieves this?

Availability via
- Geo distribution of Apps
- Multi node distribution of app (using K8s feature)
- K8s deployment desired state feature
Latency via
- Geo distribution of Apps
- co-locating E-W apps (using K8s feature)
- Their own deployment optimisation
Performance/Scale via
- Scale of service/nodes via Infra as code
- Monitoring channel to understand performance/scale need
Efficiency via
- Relying on K8s nodes resource optimisation
- Limiting per app/POD max and min resource
Change management via CICD which helps for
- having clarity that new version should be knowingly good in production environment
- Ensure production environment reliability on change failure (via rollback)
Observability/ Monitoring is used for
- Analysing long-term trends
- Comparing over time or experiment groups
  - - For example, to check if website is slower compared to last time
- Alerting if something is broken
- It is done via white-box and black box method (will discuss below in detail)
Emergency response via
- Ticketing
- Incident response system
Capacity planning via
- analytics on historical traffic data
- Regular load testing
- Collecting capacity forcast from inorganic sources

- - For example, new year day, black day, big billion day
  - Via machine learning

Security via
- security best practices
- Runtime security monitoring tools like Aquasec

Monitoring in detail

A classic approach for monitoring is to watch for a specific value or condition, and then to trigger an email alert when that value is exceeded or that condition occurs.
In modern approach,
- Monitoring doesn't require a human to interpret any part of the alerting domain.
- Software should be able to do the interpreting, and
- humans should be notified only when they need to take action.
- Monitoring uses real-time quantitive data

Monitoring types

White-box monitoring

In this type, measurable data comes from internal of system. Below example

Citrix CNN products sends data like
- Metrics (Counters)
- Syslog and audit logs
- Events
- K8s object Status

Black-box monitoring

It is based on testing externally visible behaviour. For example

Latency in request/response
Failures which are externally visible, like error in HTTP
Login service going down
Security posture monitoring
System memory/CPU etc

Tools useful for monitoring

Prometheus
Splunk
Fluentd
Webhooks (like audit Webhook) and logstash
Zipkin for tracing
Graphite

Observability

Observability is new way in container environment. Its new term and was needed to add interlink apps/containers visibility in monitoring content. Its done via

Istio Service graph
K8s OpenTracing

Based on above understanding, let's analyse below picture yourself. It will help to imbibe overall data mentioned above

Summary

Any CNC product which wants to fit itself as first class citizen must take care of above needs of production environment. These offerings will be helping hand for SRE to meet production environment needs.

Reference

https://kubernetes.io/docs/tasks/debug-application-cluster/audit/#log-collector-examples

https://landing.google.com/sre/sre-book/

https://opentracing.io

https://www.weave.works/technologies/monitoring-kubernetes-with-prometheus/

https://opensource.com/article/19/10/open-source-observability-kubernetes

https://medium.com/@copyconstruct/monitoring-and-observability-8417d1952e1c

Page updated

Google Sites

Report abuse