When I start to learn kubernetes, the first question for me is to open the cluster to outside users. I add a kubeflow application in the cluster and hope to open it for the users. I found that the service is porvided for this purpose. However, I could not figure out how to expose it. I found that the application can be accessed by the 192.168.x.x IP from all machines in the cluster. However, this IP won't work outside the cluster machine.
What I try to do in the first place is to enable the 192.168.x.x accessible outside the cluster. Since it works only for machine inside the cluster, I setup an nginx proxy to forward all incoming traffic to the 192.168.x.x IP and it works. The nginx receive traffic from the public IP address and proxy it to the IP of the tul0 interface.
However, this raise a question. Why does k8s has no native way for this obvious requiement?
Then, I found that the Service has a filed Type, which can be ClusterIP/NodePort/LoadBalancer. My reaction is to change the Type to be LoadBalancer since the loadBalancerIP is always empty. Even though I change the IP to be the addres of my NAT router, it still not working.
Therfore, I go back to read the document and it says that I need an router which support kubernetes load balancer. Actually, I have no clue if my router can satisfy this requirement or not. Then, I try NodePort and it works. However, I am very confused about its setting. For each service, it have port/nodePort and targetPort. I can not tell the difference betwwen them and make lots of mistakes before I figure out the working scenario.
However, even I get it working. I do not understand why. I will try my best to explain everything in details in this article for the people who are not famaious about famaliar with kubernetes.
Kubernetes define everything by resource. Traditionally, we will collect all setting of a system in an ini file and create multiple session inside it for different part of the system. Somestimes, if the system is complex, we may have multiple configurations files. However, we will put all of them under the same directory.
However, in the kubernetes, those configuration files are not in the filesystem directly. They can hosted by the etcd service. The etcd is a distributed object system, which is executed in every machines in the cluster. All of the servers will maintain a copy of distributed configurations. It will work even there are several machines crashs. The remain servers continue to provide configiration for all applications in the cluster.
Therefore, if we want top change the system, we need to change the cutom resource definition(CRD). We can use
kubectl get all -A
to list all CRDs in the k8s cluster. You should see thousands of them in a working system.
The Service is one of the CRD. Accoding to the k8s documentation,
The Service can expose the application inside the cluster. This is exatly what we want. However, when I open this CRD, every IP inside the Service CRD is started from 10.x.x.x, which is the private cluster IP addresses.
In addition, in the port definition
ports:
- name: tcp-ssh
nodePort: 32696
port: 2222
protocol: TCP
targetPort: 22
We have multiple ports. It's easy to guess that targetPort should be the port of the container. However, what's the differences between port and nodePort?
The port is actuially port of the Service. Actually, each service have an IP as well. Although the service does not associate to any container, it still have an unique IP address. Actually, we can solve the domain name servicename.namespace.cluster.local into an IP address inside any pods in the cluster. It will return the clusterIP of the service.
In the begining, I think that the clusterIP of service is pointed to the IP of the pod that associate to the service. I understand that this is wrong after I use the k8s for almost two years. The reasons is simple, for each service, it might be backed by multiple pods. Each of pod has its own cluster IP address. Therefore, we can not use any of them directly. The service works like the virtual server in an NAT router. We can add multiple IPs of pods into the destination of the virtual server. The "router" will distributed the incoming traffic to one of the IP for each session.
SInce the service has an IP, it has its own set of ports. Since every service has complete 65536 ports, when we use NodePorts, the port of the node can not be the same as the poprt of the service. Typically, the nodePort will reserve a range the port between 30000-32276 in the physical machine. For each service, we need to allocate one of them in the node machine and each of them must be unique.
Therefore, we need to setup the rules in the "router", which use DNAT for
nodeIP:nodePort ----> serviceIP:2222 ----> podIP:targetPort
This is usualy suppported by using a set of iptable rules, In the remaining part of this article, we will explain how this is implemented.
For each service, the k8s will generate endpointslice for it automatically. However, when you check the content of the endpointslice, you may found that there is nothing inside it. This is normal since this will hold information about the pods for the service. If you do not start the pods for the service, the endpointslice will be empty. If we start the pods, you should see that
- addresses:
- 192.168.191.22
conditions:
ready: true
serving: true
terminating: false
nodeName: gpunode054
targetRef:
kind: Pod
name: testssh-0
namespace: d000018238
uid: e02cbb43-22b3-4678-83e2-b3dbf0bc5356
ports:
- name: tcp-ssh
port: 22
protocol: TCP
The IP 192.168.191.22 is not the cluster IP of the pod. It is an ransom allocated IP in the 192.168.x.x range. Why? The clusterIP works only inside the container. The endpointslice is responsible to route traffic betwwen physical machines. Therefore, we need to use a routable IP addresses. The 192.168.x.x is actually IP of the internal tunnel system. You can use route -n to see it
/home/wycc# route -n
Kernel IP routing table
Destination Gateway Genmask Flags Metric Ref Use Iface
0.0.0.0 10.100.4.254 0.0.0.0 UG 0 0 0 bond0
10.100.4.0 0.0.0.0 255.255.255.0 U 0 0 0 bond0
172.17.0.0 0.0.0.0 255.255.0.0 U 0 0 0 docker0
192.168.63.64 10.100.4.50 255.255.255.192 UG 0 0 0 tunl0
192.168.63.128 10.100.4.50 255.255.255.192 UG 0 0 0 tunl0
192.168.94.128 10.100.4.53 255.255.255.192 UG 0 0 0 tunl0
192.168.94.192 10.100.4.53 255.255.255.192 UG 0 0 0 tunl0
192.168.95.192 10.100.4.52 255.255.255.192 UG 0 0 0 tunl0
192.168.96.0 10.100.4.52 255.255.255.192 UG 0 0 0 tunl0
192.168.148.128 10.100.4.51 255.255.255.192 UG 0 0 0 tunl0
192.168.148.192 10.100.4.51 255.255.255.192 UG 0 0 0 tunl0
192.168.191.0 10.100.4.54 255.255.255.192 UG 0 0 0 tunl0
192.168.191.64 10.100.4.54 255.255.255.192 UG 0 0 0 tunl0
192.168.207.128 0.0.0.0 255.255.255.192 U 0 0 0 *
192.168.207.164 0.0.0.0 255.255.255.255 UH 0 0 0 cali1b57fcc86ad
192.168.207.165 0.0.0.0 255.255.255.255 UH 0 0 0 cali450446c4e93
192.168.207.166 0.0.0.0 255.255.255.255 UH 0 0 0 calicfaaba7036b
192.168.207.167 0.0.0.0 255.255.255.255 UH 0 0 0 caliacf538c0138
192.168.207.168 0.0.0.0 255.255.255.255 UH 0 0 0 cali5c1de35af6f
Each of the 192.168.x.x IP has an entry to tell who can route it. Each of them has an associate gateway. Therefore, we can know how to route those traffic.
However, since we have physical address/ tunnel address/cluster address and different ports associated with them, we need to implement a NAT router to convert IP addresses betwwen those IPs. When we check the route table, we can only get the 192.168.x.x or physical IPs. We will not see clusterIP. Hopwever, the clusterIP is used inside the container. Therefore, we need a set of iptables rules which masquerade the controller IP to be either 192.1689.x.x or the physical addresses.
This is actually done inside the kube proxy daemon. It will monitor the service and endpointslice and adjust the iptables rules to ensure the packets are masqueraded correctly.
iptables is the core of the Linux firewall. It is typically used to implement the firewall of Linux machine. When we use Linux as NAT router, it porvides everything we need for a NAT router. The kube proxy adapts it to router/masquerade packets with different kinds of IP addresses. The following command can be used to list all rules.
# iptables -t nat -nvL
We can see lots of rules. Let's check rules related to the service with nodePort. YOu should be able to tell the function of others by yourself after read this article.
All incoming packets will enter PREROUTE chain in the nat table. If you are not famaliar with the iptable, please read other articles to get some idea before read the reamaining article.
/home/wycc# iptables -t nat -nvL PREROUTING
Chain PREROUTING (policy ACCEPT 49M packets, 3725M bytes)
pkts bytes target prot opt in out source destination
49M 3741M cali-PREROUTING 0 -- * * 0.0.0.0/0 0.0.0.0/0 /* cali:6gwbT8clXdHdC1b1 */
49M 3741M KUBE-SERVICES 0 -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes service portals */
21M 1224M DOCKER 0 -- * * 0.0.0.0/0 0.0.0.0/0 ADDRTYPE match dst-type LOCAL
The KUBE-SERVICES chain is defined for the whole service related rules.
/home/wycc# iptables -t nat -nvL KUBE-SERVICES
Chain KUBE-SERVICES (2 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-SVC-OVUML44H2TQBLJIO 6 -- * * 0.0.0.0/0 10.102.250.198 /* dm1261010/ppimages:http-ppimages cluster IP */ tcp dpt:80
0 0 KUBE-SVC-UA5UY3X4RW2BMASX 6 -- * * 0.0.0.0/0 10.99.127.71 /* kubeflow/notebook-controller-service cluster IP */ tcp dpt:443
0 0 KUBE-SVC-3HNW2AGXVPRAVLFB 6 -- * * 0.0.0.0/0 10.110.153.173 /* cgu/node-resource-monitor:metrics cluster IP */ tcp dpt:8080
0 0 KUBE-SVC-CV3TQLPNH6GDKAYR 6 -- * * 0.0.0.0/0 10.100.173.39 /* b1144150/vit-model-predictor-00002:http cluster IP */ tcp dpt:80
.....
.....
3844 231K KUBE-NODEPORTS 0 -- * * 0.0.0.0/0 0.0.0.0/0 /* kubernetes service nodeports; NOTE: this must be the last rule in this chain */ ADDRTYPE match dst-type LOCAL
We can see lots of KUBE-SVC-XXXX rules. Each of them are associated to one of the cluster IP addresses. Thos rules are for the internal cluster traffic, which use the clusterIP addresses directly or the external trafiic which has been DNAT to the clusterIP address.
For the NodePort traffic, it will be handled by the last rule to the KUBE-NODEPORTS chain since it will not match any of the above rules.
/home/wycc# iptables -t nat -nvL KUBE-NODEPORTS
Chain KUBE-NODEPORTS (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-EXT-3HNW2AGXVPRAVLFB 6 -- * * 0.0.0.0/0 0.0.0.0/0 /* cgu/node-resource-monitor:metrics */ tcp dpt:30135
....
....
0 0 KUBE-EXT-G26QRBIMEQ5RWSUM 6 -- * * 0.0.0.0/0 0.0.0.0/0 /* d000018238/testssh-ssh-service:tcp-ssh */ tcp dpt:32696
0 0 KUBE-EXT-6V3AP55S45OB74OG 6 -- * * 0.0.0.0/0 0.0.0.0/0 /* prometheus/kube-prom-stack-kube-prome-prometheus:reloader-web */ tcp dpt:30738
Inside this chain, it will go through the KUBE-EXT-XXXX chain for each service. We will use the nodePort to select the chain. The external traffic should connect to the nodePort. For example, it should connect to the IP of the node and the 30926 port in order to access the testssh-ssh-service.
/home/wycc# iptables -t nat -nvL KUBE-EXT-G26QRBIMEQ5RWSUM
Chain KUBE-EXT-G26QRBIMEQ5RWSUM (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ 0 -- * * 0.0.0.0/0 0.0.0.0/0 /* masquerade traffic for d000018238/testssh-ssh-service:tcp-ssh external destinations */
0 0 KUBE-SVC-G26QRBIMEQ5RWSUM 0 -- * * 0.0.0.0/0 0.0.0.0/0
In this chain, we will mark the traffic to masquerade it latter.
/home/wycc# iptables -t nat -nvL KUBE-SVC-G26QRBIMEQ5RWSUM
Chain KUBE-SVC-G26QRBIMEQ5RWSUM (2 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ 6 -- * * !192.168.0.0/16 10.101.164.84 /* d000018238/testssh-ssh-service:tcp-ssh cluster IP */ tcp dpt:2222
0 0 KUBE-SEP-VGOZ5JN35BD2HCPH 0 -- * * 0.0.0.0/0 0.0.0.0/0 /* d000018238/testssh-ssh-service:tcp-ssh -> 192.168.191.22:22 */
In this chain, the first rile is for internal traffic. The second rule is for the extranal traffic.
/home/wycc# iptables -t nat -nvL KUBE-SEP-VGOZ5JN35BD2HCPH
Chain KUBE-SEP-VGOZ5JN35BD2HCPH (1 references)
pkts bytes target prot opt in out source destination
0 0 KUBE-MARK-MASQ 0 -- * * 192.168.191.22 0.0.0.0/0 /* d000018238/testssh-ssh-service:tcp-ssh */
0 0 DNAT 6 -- * * 0.0.0.0/0 0.0.0.0/0 /* d000018238/testssh-ssh-service:tcp-ssh */ tcp to:192.168.191.22:22
Finally, we see the DNAT rule here. We will DNAT the incoming traffic to the 192.168.191.22. This is the tunnel IP of the node which host the pod for the service.
# route -n
192.168.191.0 10.100.4.54 255.255.255.192 UG 0 0 0 tunl0
The last question is where is ther 192.168.191.22. This is actually defined in the routing table of the node. We can tell that the pods is allocated in the 10.100.4.54. Therefore, the above DNAT rule will DNAT the packet to the 192.168.191.22 which will be forwarded to the 10.100.4.54.
Then, it's the responsibility for the 10.100.4.54 to route this packet to the controller IP of the pod that host the service. End of the story.