Kubernetes Fundamentals

Kubernetes has two types of machines. The first type, most important, is the head node. That's the brain of Kubernetes. What does the brain of Kubernetes run? It runs an API server, it runs a scheduler to place the containers where they need to go, there is a controller manager that makes sure that the state of the system is what it should be, there is a data store, etcd, that is used to store the state of the system. And sometimes, to be able to manage all of this you have a process called a Kubelet. And, of course, you have a container engine, you have Docker. You could have something else, but most of the time you have Docker. That's what you find on the head node, the brain of Kubernetes. Nothing else other than four types of processes, an API server, a scheduler, a controller manager, and etcd.

Samples: https://github.com/sebgoa/oreilly-kubernetes/tree/master/manifests

Head Node

    • Api Server

    • Scheduler

    • Controller Manager

    • Etcd

    • Somtimes

      • kubelet

      • Docker

Worker Node

The other type of node is the worker node. You have all a lot of them, depending on the size of your Kubernetes cluster. What runs on the worker node? Simply the Kubelet. That's the Kubernetes agent that runs on all the Kubernetes cluster nodes. Kubelet talks with the Kubernetes API server and then talks to the local Docker daemon to be able to manage the Docker containers running on the node. And then you also have a kube-proxy. In itself, it's not really a proxy these days, but the name remained. What is the kube-proxy? It's a system that allows you to manage the IP tables in that node so that the traffic between the pods and the nodes is what it should be. So here you have it. kubelet

    • Kube-proxy ( system allows to manage IP tables)

    • docker

Demystifying the Kubernetes architecture. Two types of nodes, the head node and the worker node, each running a very well-defined set of processes.

When you bootstrap a cluster with kubeadm and you check systemctl status kubectl

Even on the head node kebelet will be the key agent running

If you check the service configuration of the kubelet, we can find many of the env variables like KUBECONFIG_ARGS to configure certificates etc. The SYSTEM_PODS_ARGS points to a manifest directory /etc/kubernetes/manifests/ where all the manifest files will be located.

What happens is that systemd makes sure that the kubelet is always running. Kubelet starts, and then the kubelet is going to look in this directory and see /kubernetes/manifests. And it's going to look for /kubernetes/manifests of what needs to be running. It's a little bit of a nested system. You have the kubelet actually starting the Kubernetes components.

In the manifest directory - We'd see four YAML files-- one for etcd, one for the kube-apiserver, one for kube-controller-manager, and one for kube-scheduler.

    • You have the Kubernetes API server-- exposes the API;

    • The scheduler, which decides where to put the containers.

    • You have the controller manager, which is this control loop that keeps the state of the cluster where it should be.

    • And, you have etcd, which actually stores the state of the cluster. Note that etcd could be outside the head node. It could be in a separate machine

if you bootstrap a cluster with kubeadm, you're going to see etcd on the head node. Now we have systemd running the kubelet all the time. The kubelet looks into etcd's manifest. It sees the YAML files, and then it's going to start those components.

Checking the etcd.yml

You see kind. It's a pod. You see some metadata. You see a spec. What's very important is in the spec the containers. So what is this pod? This pod is the lowest compute unit in Kubernetes. And what happens is that the kubelet is always going to make sure that that pod is running. It's always going to look after that etcd pod. And what is in that etcd pod? It's one container that run etcd-- key value store.

You see that the image is gcr.io, the Google Container Registry /google_container/etcd. That's the Docker image to run etcd. And then you see some startup options here to start the key value store. With this manifest in that directory, the kubelet is always going to make sure that this is running

If you check other manifest files of api-servers, controller manager, scheduler or etcd, all are pod specifications. Kebectl will make start these pods and keep it running all the time. You can check it by listing the kebectl get pods --all-namespaces. Even with docker ps you can find all the containers running.

Systemd is running and it's only running a single unit, which is the kubelet, the Kubernetes agent. And the kubelet looks into a directory-- /etc/kubernetes/manifest-- and sees the manifest to run all the components required for the head node, the scheduler, the API server, controller manager, etcd. So if one of them crashes, the kubelet will notice and will restart. If you change some of the options for one of the components, the kubelet will notice and will restart things. But the one thing that's running is the kubelet.

WORKER NODE

If you check the kebelet in worker, it looks a lot like the Head node. Only difference is that manifest directory will be empty.

We see that we have an image from the Google Container Registry, which runs the proxy.. And then we weaveworks, the weave-npc, the network policy controller, and then the weave-kube, which is the weave net agent. And then we have the pause container. So notice that we have the weave-npc, the network policy controller, and the weave-kube, because cluster networking add-on is installed, which is weave-net. We can install other networks like -- Calico, Romana, kube-router-- then you will have different set of Docker containers running to be able to implement this network. weave-net because gives a way to have network policies. So now, when I look at my running dock containers, we see the weave network policy controller, that's fine. we see weave-kube docker, which used the Kubernetes CNI-- the container networking interface to be able to set up the networking properly on that Worker node, so that when containers are started as pods, they can attach to a bridge. And then that bridge gives them a path to the other node. We're not going to go deep into the networking, but it's enabled thanks to weave and Kubernetes CNI plug-in. What's interesting is that the kube proxy, which is the other component needed on the Worker node. Remember kubelet run via system d. And then here we have the proxy running as a container. It turns out that it's actually managed by Kubernetes as a daemon set. But here you see that at the end of the day, you have the kube proxy running as a container. And then you have this pause container, because that's how pods are made. A pod is actually using this pause container, which sleeps all the time. The only thing this pause container does is that it allocates a network namespace to get one IP. And then the other containers in that pod attach to that same network namespace. It's a little bit advanced, but don't be surprised if you see a pause container. You're going to see a pause container for every pod. And now you have this container kube-proxy. You see that it's running usr local bin kube proxy. And it's interesting. If we do a ps on the host, and grep for proxy, you're going to see that it's actually also running locally-- user local bin kube proxy. And what this does is it manages the IP table on this host, so that the network traffic can be routed to the right pod.

When you run a pod and check the IP table in the worker node

Worker node we have the kubelet managed by system d. And then we have a kube proxy that's running as a container, but it ends up running user look bin kube proxy on the machine. And then this kube proxy manages IP tables, so that when you manage pods and change pods and expose them as kubernetes services-- whatever that networking abstraction is-- magically, IP tables are created on the node.

INSTALLATION

Which ever method you use it is good idea to install kubectl - kubernetes client.

Minikube can be used for local kubernetes - https://kubernetes.io/docs/tasks/tools/install-minikube/

Kubeadm - can be used to install kubernetes cluster. After installing kubeadm, you will also need to install the network.

Kubernetes API

With kubectl we can do all the CRUD operation on the pods.

If you execute kubectl -v=9 get pods, this will print the verbose (9 is the logs level).

Under the hood kubectl issues a http call using curl to the api end point to get the pod.

Kubectl is only a http call which talks to the api endpoint.

Run a proxy service using the kubectl and you should be able to access the api server endpoint using curl

Below command will start the proxy. By default it listens to 8001, but can be changed by passing the port parameter

$ kubectl proxy

Starting to serve on 127.0.0.1:8001

Now issuing the below command will return a pods details in JSON format

curl localhost:8001/api/v1/namespaces/default/pods

Use XPOST to create an image

curl -H "Content-type: application/json" --data @busybox.json -X POST http://localhost:8001/api/vi/namespaces/default/pods

Use XDELETE to delete the pod

Curl -XDELETE localhost:8001/api/v1/namespaces/default/pods/busybox

If you run the proxy, kubectl proxy allows you to hit the Kubernetes API server on your local machine, and it's very handy, because you don't have to worry about TLS and authentication. What that means is that you can use cURL to talk to your secure API endpoint

NAMESPACES

A namespace is a way to create virtual clusters. So you can have different apparent clusters in your system.

To check resources in namespace

# In a namespace

kubectl api-resources --namespaced=true

# Not in a namespace

kubectl api-resources --namespaced=false

If you try to create two pods with the same name, it's going to conflict, but if you try to create two pods with the same name in two different namespaces, then it's going to work. That's what's meant by virtual clusters. You could use a namespace to create a cluster for user Alice, and you use another namespace to create a virtual cluster for user Bob, and each of these users can create pods with the same names without conflicts. So namespaces provided a scope for names. It gives you name uniqueness within that scope, and it avoids name collisions.

kubectl get namespaces - lists the namespaces

curl -s localhost:8001/api/v1/namespaces | jq -r .items[].metadata.name (via curl)

kubectl create ns <name> - creates namespace

Quota -

You can limit the number of pods, cpu, memory etc. by using quota

kubectl create quota foobar --hard pods=2 - Create quota with only 2 pods. In this example you can spin up more than 2 pods.

kubectl get resourcesquotas - to list the quotas

kubectl get resourcesquotas foobar - o yaml - to view the metadata of the quota

Resource Qutoa - myquota.yml

apiVersion: v1

kind: ResourceQuota

metadata:

name: mem-cpu-demo

spec:

hard:

requests.cpu: "1"

requests.memory: 1Gi

limits.cpu: "2"

limits.memory: 2Gi

pods: 30

First create a namespace and apply the quota to the namespace

kubectl create namespace mynamespace

kubectl apply -f myquota.yml --namespace mynamespace

(kubectl create quota test --hard pods=1 - to create pod limit of 1 in default namespace)

Labels:

Every resource in Kubernetes can be labeled. And then those labels are used by the Kubernetes system to be able to do query and direct traffic to specific pods; also to be able to count the number of pods that you have in your system, so that Kubernetes can scale up and down your application.

You can list labels

kubectl get pods --show-labels - To show the pods and labels

Set the label for pods

kubectl label pods busybox foo=bar

Labels can be used for any resources like pod, service, deployment, volume etc. Label is part of the metadata of the resource and can be set for any resource.

To inspect the pod use, kubectl get pods busybox -o yaml

Query pods with label kubectl get pods -l foo=bar1 ( or use -L<key or value>)

DaemonSet

Daemonset is a controller that ensures that a single pod exists on each node in the cluster. If a new node is added the DaemonSet will add a pod automatically. Rolling out a daemon set works like starting a deployment, just set kind: DaemonSet in the app yml.

API

View all api version - kubectl api-versions

If we curl localhost:8001/api/v1 we can find all the available api exposed via v1 api.

Not all object are namespaces, we can find the details of the namespaces, short codes etc.

Basic requirement

    1. apiVersion

    2. metadata ( at lease name a the minimum, labels goes there)

    3. kind ( like pod, replication controller etc)

    4. spec (specification of the resources, for pods, it will be an array of container, volumes ...)

    5. status (run time property)

ApiVersion is the most recent version of the API that allows us to use this specific type. Then, we have kind, the type of object, that is. That can be a deployment, that can be a pod, or anything that exists as an object in a Kubernetes environment. Then, there is metadata. Metadata is current object metadata, and it may include items t such as podAffinity and nodeAffinity. It all depends on the type of object that you are working on. And, last, there is spec. Spec is the specification. It contains run-time parameters that are used by this specific object. And, finally, there is status. Status is giving current status information about the object. So, status is your run-time information. This is not the part of the configuration that you have created as an administrator. It's not in your yaml. It's the run-time information that tells you how the object currently is doing

Generator:

kubectl run ghost --image=ghost --port 2368 --expose=true

Will generate a deployment, service and replica set automatically.

Networking

Kubernetes networking happens at three different levels. There is networking between the containers within a pod, there is networking between the pods, and apart from that, there is the external exposure of services. We need to connect pods to other pods across nodes, that's what we call east-west traffic. There is a need to do service discovery and load balancing as well, and there is a need to expose services for external clients, and that's what we call north-south traffic

All nodes running pods can communicate to all other nodes running pods without NAT, with the execption of the traffic coming into the Pod. No IP masking, the IP that the pod sees is that the same as how other pods see it.

Within a pod, containers communicate directly to on another. This is internal communication handled by docker as kubernetes use the Pods as the lowest level entity. Docker offers host private networking, where a virtual bridge docker0 is created. For each container that Docker creates, a Virtual Ethernet Device is created and this is attached to the bridge. The Virtual Ethernet Device really is the glue outside of the container that connects the container to the bridge. It's like a virtual ethernet cable in fact. The eth0 within the container uses an IP address from the docker0 bridge address range. As a result, Docker containers can communicate to one another only if they're on the same host. As containers within a pod really are processes running on the same host, they can find each other via localhost.

Pod-to-Pod: In order to organize pod-to-pod networking, there is the CNI. CNI provides plugins that implement pod-to-pod networking, and these plugins can be used at three different levels to implement three different types of networking.

    • Layer2 (switching)

    • Layer3 (routing)

    • Overlay networking - Like software defined networking

If the hosts are in the same network the communication can happen using ARP (address resolution protocol), which is a general protocol used in all IP based networks.

In layer two-based networking, you need a bridge plugin. So we have a name kubenet, type bridge, the bridge is kube-bridge, and the default gateway is set to true, and then we have the ipam configuration set to host-local and the subnet, as well. You just need to make sure that this plugin exists on all of the physical Kubernetes hosts to create a layer two networking configuration. Bridge plugin example

{

"name": "kubenet",

"type": bridge",

"bridge": "kube-bridge",

"isDefaultGateway": true,

"Ipam":{

"type": "host-local",

"subnet": "10.1.0.0/16"

}

}

layer2 configurations is not scable, for scalability use routing not switching.

Layer three networking is significantly more complicated. You will need it because layer two configuration is not scalable. If you wanna use scalability, then you should consider using routing. You cannot use NAT, which is network address translation, in Kubernetes port-to-port networking, but there is no restriction telling you that you cannot use routing, and that's exactly what CNI layer three is about. To implement it, different plugins are available. The Flannel plugin is one of the most common plugins, but others exist, as well.

Overlay solutions can be used to define the network completely in software, and use encapsulation - which looks a lot like VPN - to send packets from one pod to another pod. To use encapsulation, a tunnel interface is needed. Common encapsulation mechanisms such as VXLAN, GRE and more can be used to do so.

brctl show - to show the network interface details

for i in $(docker ps | awk '{ print $1}'); do docker inspect -f '{{ .NetworkSettings.IPAddress }}' $i; done - Get the IP address of all container in the host.

We can find that some containers does not have any IP. The reason being the IP's are managed at Pod level rather than container level for some containers. If you get all the pod ips using kubectl get pods -o wide, you can find that many of the container IP are shown in the Pod IP.

Port-Forwarding: Administrators can use port-forwarding to expose a port of a pod to the host machine using

kubectl port-forward pod/mypod 8888:5000 - Here port 8000 is localport and 5000 is port port. When pod is restarted the forwarding has to be rerun again. Port forwarding can be done for http, ssh and other protocols.

MY UNDERSTANDING

Pods are the smallest unit of deployment. When you create a pod it runs independently and when it goes down we cannot recover. To have a mechanism to restart the pods automatically and to get the desired number of pods we use ReplicaSet (older is replication controller, where it cannot support set-based selector). With replica set, we are spinning up the pods with the desired number of pods. Service stands apart from the ReplicaSet and Pods and its function is only to expose a cluster IP based on the selector i.e it keeps listening to selector say ( app=web) and requests are served once the pods are available. Service primary function is to manage IP, DNS and Port. So, the service runs independently of the pod just exposing the serviceIP/clusterIP. (Further, Service can also be run without selector where in you will need to create an kind:Endpoint and provide the external IP and port. )

Normally we don't create replicaset, instead we create deployments which will create the replicaset. How it works is pod specs --> replicaset --> deployment, since deployment creates replicaset and replicaset holds details about pod, deployments will have all details about the replicaset and the pods. Since deployments are created it allows us to mange B/G deployment or Canary deployment.

Deployment - when you edit deployment, it will create a new replica set (say from 1.1 to 1.2). What happens is that deployment creates a new replicaset and scales up the new replicaset and reduce the number of replicas from the previous replica set. This makes it easy for us to rollback to previous version just by increasing the old replicaset and decreasing the new replicaset. After editing a deployment check kubectl get rs, which will show the details. This give us the ability to do rolling update. Deployments in effect are used to create and automate replicaset.

HELM - Package manager for Kubernetes

With Helm, you can browse repositories of packages and install them on your cluster with a single command.

With Helm, we have two things. We have a client, it's Helm, that you install on your local machine. And there is a server side which is called Tiller. Tiller gets installed inside your Kubernetes cluster thanks to the client. There's a command that's Helm init. When you execute that command, the client creates a deployment inside your cluster. And that deployment installs Tiller, the server side of Helm. With the client, you can browse packages in a repository and you can install them. The packages that you install with Helm are called charts, again, a reference to the nautical theme of containers (https://github.com/helm/charts)

They are actually bundles of templatized manifest. And when you install that package, the templates actually installs the values are replaced inside the templates, and then the manifests are created. All of this is done by Tiller, the service side of Helm. You can browse repos, add repos, and then install the packages. And then all your applications will be available, deployed on your cluster, via the Helm package manager

Extending K8S API - Custom Resource Definition (CRD)

The Kubernetes API has a solid core. We've seen some of the primitives like the pods, the services, the deployments. But the magic about this API is that we can extend it. We can extend this API on the fly, dynamically, without recompiling the code, without restarting the API server. It's really mind boggling. Third party resource which is renamed as Custom Resource Definition, or CRD.

apiVersion: apiextensions.k8s.io/v1beta1

kind: CustomResourceDefinition

metadata:

name: databases.foo.bar

spec:

group: foo.bar

version: v1

scope: Namespaced

names:

plural: databases

singular: database

kind: DataBase

shortNames:

- db

kubectl create -f crd.yml - Create CRD

kubectl get crd - Get the CRD

Create Manifest for CRD

apiVersion: foo.bar/v1

kind: DataBase

metadata:

name: my-db

spec:

type: mysql

kubectl create -f my-crd.yml - Create the new object using CRD

kubectl get db - Get the newly created db

kubectl get my-db -o yml - Will get you the manifest of the db

We have extended the Kubernetes API on the fly by defining a new resource type using Kubernetes manifest. Of course, now nothing's going to happen. That MySQL database that I want created-- it's not going to be created magically. But I now have an API endpoint, managed by Kubernetes. I have all the create, replays, update, delete, all the CRUD operations for that API endpoint. And I can now write an operator or a controller, which is going to watch this new end point. And when I create a database like my-new-db, my controller should now actually instantiate a mysql database. Now you also need to write a controller to actually do the work.