Do you think that Docker and Rkt are a magic? A container acts a lightweight VM and does many so called magical stuffs to achieve it. In reality, it is user friendly wrapper over linux features.
Technical explanation
Often thought of as cheap VMs, containers are just isolated groups of processes running on a single host. That isolation leverages several underlying technologies built into the Linux kernel: namespaces, cgroups, chroots and lots of terms you've probably heard before.
Get the base file system (known as base image)
Use chroot for file system isolation from host
Use unshare for namespace isolation
Use mount for volume mount of file/directory to the container
Use Cgroups for resource quota to the container
Use setcap and capsh to setting linux capabilities to the container
Container images, the thing you download from the internet, are literally just tarballs (or tarballs in tarballs if you're fancy). The least magic part of a container are the files you interact with.
Below tarball holds something that looks like a Debian file system and will be our playground for isolating processes.
Base file system for creating container
deepak@containerd121:~/deepak/owncontaier$ wget https://github.com/ericchiang/containers-from-scratch/releases/download/v0.1.0/rootfs.tar.gz
deepak@containerd121:~/deepak/owncontaier$ sha256sum rootfs.tar.gz
c79bfb46b9cf842055761a49161831aee8f4e667ad9e84ab57ab324a49bc828c rootfs.tar.gz
First, explode the tarball and poke around.
File system in the download tarball
deepak@containerd121:~/deepak/owncontaier$ tar -zxvf rootfs.tar.gz
deepak@containerd121:~/deepak/owncontaier$ ls rootfs
bin boot dev etc home lib lib64 media mnt opt proc root run sbin srv sys tmp usr var
The resulting directory looks an awful lot like a Linux system. There's a bin directory with executables, an etc with system configuration, a lib with shared libraries, and so on.
It allows us to restrict a process’ view of the file system. In this case, we'll restrict our process to the “rootfs” directory then exec a shell.
In below command we will set rootfs as root file directory and /bin/bash as the shell
chroot to set root directory
deepak@containerd121:~/deepak/owncontaier$ ls
rootfs rootfs.tar.gz
deepak@containerd121:~/deepak/owncontaier$ ls -ltrh /usr/local/bin/python
ls: cannot access '/usr/local/bin/python': No such file or directory
deepak@containerd121:~/deepak/owncontaier$ sudo chroot rootfs /bin/bash
root@containerd121:/#
root@containerd121:/# ls -ltrh /usr/local/bin/python
lrwxrwxrwx 1 1001 1001 7 Sep 24 2016 /usr/local/bin/python -> python3
root@containerd121:/# which python
/usr/local/bin/python
root@containerd121:/# /usr/bin/python -c 'print "Hello, container world!"'
Hello, container world!
root@containerd121:/# exit
deepak@containerd121:~/deepak/owncontaier$
deepak@containerd121:~/deepak/owncontaier$ which python
/usr/bin/python
It's worth noting that above works because of all the things baked into the tarball. When we execute the Python interpreter, we're executing rootfs/usr/bin/python, not the host's Python. That interpreter depends on shared libraries and device files that have been intentionally included in the archive.
Speaking of applications, instead of shell we can run one in our chroot. It is similar to docker exec?
Executing a program within chroot
deepak@containerd121:~/deepak/owncontaier$ sudo chroot rootfs python -m SimpleHTTPServer
Fatal Python error: Failed to open /dev/urandom
Aborted (core dumped)
Above chroot doesn't facilitate the PID namespace isolation. In other words, within the chroot, we can see host processes as well.
Below example shows that cat process can be killed from within rootfs
Host processes visible within the roots
deepak@containerd121:~/deepak/owncontaier$ cat &
[1] 130840
deepak@containerd121:~/deepak/owncontaier$ sudo chroot rootfs /bin/bash
root@containerd121:/# ps ax | grep 130840
1747 ? S+ 0:00 grep 130840
130840 ? T 0:00 cat
root@containerd121:/# exit
deepak@containerd121:~/deepak/owncontaier$ sudo chroot rootfs /bin/bash
root@containerd121:/# kill -9 130840
root@containerd121:/# ps ax | grep 130840
3162 ? S+ 0:00 grep 130840
So, we need namespace isolation here
Creating namespace is super easy, just a single syscall with one argument, unshare. The unshare command line tool gives us a nice wrapper around this syscall and lets us setup namespaces manually. In this case, we'll create a PID namespace for the shell, then execute the chroot like the last example.
namespace isolation of chroot container
deepak@containerd121:~/deepak/owncontaier$ cat &
[1] 9208
deepak@containerd121:~/deepak/owncontaier$ sudo unshare -p -f --mount-proc=$PWD/rootfs/proc chroot rootfs /bin/bash
root@containerd121:/# ps ax
PID TTY STAT TIME COMMAND
1 ? S 0:00 /bin/bash
2 ? R+ 0:00 ps ax
Our above chroot container sees host networking as shown in below example
networking is visible to chroot container
deepak@containerd121:~/deepak/owncontaier$ ip addr show | grep inet
inet 127.0.0.1/8 scope host lo
inet6 ::1/128 scope host
inet 10.106.175.121/24 brd 10.106.175.255 scope global eth0
inet6 fe80::5866:f7ff:fe3b:75b1/64 scope link
inet 10.88.0.1/16 scope global cni0
inet6 fe80::84da:c4ff:fefc:6316/64 scope link
inet6 fe80::f426:c6ff:fe36:9e6b/64 scope link
inet6 fe80::f86b:9aff:fe65:678c/64 scope link
inet 192.168.170.0/32 brd 192.168.170.0 scope global tunl0
inet6 fe80::ecee:eeff:feee:eeee/64 scope link
deepak@containerd121:~/deepak/owncontaier$ ip addr show | grep inet | wc -l
10
deepak@containerd121:~/deepak/owncontaier$ sudo unshare -p -f --mount-proc=$PWD/rootfs/proc chroot rootfs /bin/bash
root@containerd121:/# ip addr show | grep inet | wc -l
10
root@containerd121:/# exit
exit
So, we may need to create our own network and link to our chroot container. In some cases, its fair that container share the host networking
A powerful aspect of namespaces is their composability
Processes may choose to separate some namespaces but share others.
For instance it may be useful for two programs to have isolated PID namespaces, but share a network namespace (e.g. Kubernetes pods).
Run the chroot container in a shell and from other shell in host machine, try to get the PID of the chroot container bash process
Know the PID of the chroot container
root@containerd121:~# ps aux | grep /bin/bash | grep root
root 28892 0.0 0.0 51416 3708 pts/0 S 03:51 0:00 sudo unshare -p -f --mount-proc=/home/deepak/deepak/owncontaier/rootfs/proc chroot rootfs /bin/bash
root 28893 0.0 0.0 5992 652 pts/0 S 03:51 0:00 unshare -p -f --mount-proc=/home/deepak/deepak/owncontaier/rootfs/proc chroot rootfs /bin/bash
root 28894 0.0 0.0 20216 2756 pts/0 S+ 03:51 0:00 /bin/bash
root 28897 0.0 0.0 12916 980 pts/1 S+ 03:51 0:00 grep --color=auto /bin/bash
In above example, PID is 28894
The kernel exposes namespaces under /proc/(PID)/ns as files. In this case, /proc/28894/ns/pid is the process namespace we're hoping to join.
Get the namespace info of chroot container
root@containerd121:~# sudo ls -l /proc/28894/ns
total 0
lrwxrwxrwx 1 root root 0 Mar 28 03:54 cgroup -> cgroup:[4026531835]
lrwxrwxrwx 1 root root 0 Mar 28 03:54 ipc -> ipc:[4026531839]
lrwxrwxrwx 1 root root 0 Mar 28 03:54 mnt -> mnt:[4026532517]
lrwxrwxrwx 1 root root 0 Mar 28 03:54 net -> net:[4026532101]
lrwxrwxrwx 1 root root 0 Mar 28 03:54 pid -> pid:[4026532527]
lrwxrwxrwx 1 root root 0 Mar 28 03:54 user -> user:[4026531837]
lrwxrwxrwx 1 root root 0 Mar 28 03:54 uts -> uts:[4026531838]
Enter to the container namespace using nsenter
The nsenter command provides a wrapper around setns to enter a namespace. We'll provide the namespace file, then run the unshare to remount /proc and chroot to setup a chroot. This time, instead of creating a new namespace, our shell will join the existing one(28894 here).
Note that the ps command output will be same as the actual rootfs container.
enter to the chroot container namespace
deepak@containerd121:~/deepak/owncontaier$ sudo nsenter --pid=/proc/28894/ns/pid unshare -f --mount-proc=$PWD/rootfs/proc chroot rootfs /bin/bash
root@containerd121:/# ps aux
USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND
root 1 0.0 0.0 20216 2756 ? S+ 10:51 0:00 /bin/bash
root 4 0.0 0.0 5992 652 ? S 10:58 0:00 unshare -f --mount-proc=/home/deepak/deepak/owncontaier/rootfs/proc chroot rootfs /bin/bash
root 5 0.0 0.0 20216 3204 ? S 10:58 0:00 /bin/bash
root 6 0.0 0.0 17496 1984 ? R+ 10:58 0:00 ps aux
root@containerd121:/#
It uses Linux mount feature
For this example, we'll create some files on the host, then expose them read-only to the chrooted shell using mount.
First, let's make a new directory to mount into the chroot and create a file there.
Create a directory to mount into chroot container
deepak@containerd121:~/deepak/owncontaier$ sudo mkdir readonlyfiles
deepak@containerd121:~/deepak/owncontaier$ sudo touch readonlyfiles/hi.txt
deepak@containerd121:~/deepak/owncontaier$ ls -ltrh readonlyfiles/hi.txt
-rw-r--r-- 1 root root 0 Mar 28 04:14 readonlyfiles/hi.txt
deepak@containerd121:~/deepak/owncontaier$ sudo chmod 666 readonlyfiles/hi.txt
deepak@containerd121:~/deepak/owncontaier$ ls -ltrh readonlyfiles/hi.txt
-rw-rw-rw- 1 root root 0 Mar 28 04:14 readonlyfiles/hi.txt
deepak@containerd121:~/deepak/owncontaier$ sudo echo "hello" > readonlyfiles/hi.txt
deepak@containerd121:~/deepak/owncontaier$ cat readonlyfiles/hi.txt
hello
Next, we'll create a target directory in our container and bind mount the directory providing the -o ro argument to make it read-only.
Mount directory
deepak@containerd121:~/deepak/owncontaier$ sudo mkdir -p rootfs/var/readonlyfiles
deepak@containerd121:~/deepak/owncontaier$ sudo mount --bind -o ro $PWD/readonlyfiles $PWD/rootfs/var/readonlyfiles
deepak@containerd121:~/deepak/owncontaier$ ls rootfs/var/readonlyfiles/
hi.txt
The chrooted process can now see the mounted files.
Seeing mounted file within the chroot container
deepak@containerd121:~/deepak/owncontaier$ sudo chroot rootfs /bin/bash
root@containerd121:/# ls
bin boot dev etc home lib lib64 media mnt opt proc root run sbin srv sys tmp usr var
root@containerd121:/# ls /var/readonlyfiles/hi.txt
/var/readonlyfiles/hi.txt
root@containerd121:/# cat /var/readonlyfiles/hi.txt
hello
However, it can't write them since it is mounted with RO option
Verify that chroot process can't edit the file
deepak@containerd121:~/deepak/owncontaier$ sudo chroot rootfs /bin/bash
root@containerd121:/# echo "bye" > /var/readonlyfiles/hi.txt
bash: /var/readonlyfiles/hi.txt: Read-only file system
root@containerd121:/#
To remove the mount, fire below command from the host machine
umount $PWD/rootfs/var/readonlyfiles
The kernel exposes cgroups through the /sys/fs/cgroup directory.
Cgroup directory info
deepak@containerd121:~/deepak/owncontaier$ ls /sys/fs/cgroup/
blkio cpu cpu,cpuacct cpuacct cpuset devices freezer hugetlb memory net_cls net_cls,net_prio net_prio perf_event pids systemd
deepak@containerd121:~/deepak/owncontaier$ ls /sys/fs/cgroup/memory/
cgroup.clone_children memory.failcnt memory.kmem.tcp.failcnt memory.max_usage_in_bytes memory.stat system.slice
cgroup.event_control memory.force_empty memory.kmem.tcp.limit_in_bytes memory.move_charge_at_immigrate memory.swappiness tasks
cgroup.procs memory.kmem.failcnt memory.kmem.tcp.max_usage_in_bytes memory.numa_stat memory.usage_in_bytes user.slice
cgroup.sane_behavior memory.kmem.limit_in_bytes memory.kmem.tcp.usage_in_bytes memory.oom_control memory.use_hierarchy
init.scope memory.kmem.max_usage_in_bytes memory.kmem.usage_in_bytes memory.pressure_level notify_on_release
kubepods memory.kmem.slabinfo memory.limit_in_bytes memory.soft_limit_in_bytes release_agent
deepak@containerd121:~/deepak/owncontaier$
For this example we'll create a cgroup to restrict the memory of a process. Creating a cgroup is easy, just create a directory. In this case we'll create a memory group called “demo”. Once created, the kernel fills the directory with files that can be used to configure the cgroup.
Creation of Demo Cgroups for memory
deepak@containerd121:~/deepak/owncontaier$ sudo mkdir /sys/fs/cgroup/memory/demo
[sudo] password for deepak:
deepak@containerd121:~/deepak/owncontaier$ ls /sys/fs/cgroup/memory/demo
cgroup.clone_children memory.kmem.failcnt memory.kmem.tcp.limit_in_bytes memory.max_usage_in_bytes memory.soft_limit_in_bytes notify_on_release
cgroup.event_control memory.kmem.limit_in_bytes memory.kmem.tcp.max_usage_in_bytes memory.move_charge_at_immigrate memory.stat tasks
cgroup.procs memory.kmem.max_usage_in_bytes memory.kmem.tcp.usage_in_bytes memory.numa_stat memory.swappiness
memory.failcnt memory.kmem.slabinfo memory.kmem.usage_in_bytes memory.oom_control memory.usage_in_bytes
memory.force_empty memory.kmem.tcp.failcnt memory.limit_in_bytes memory.pressure_level memory.use_hierarchy
deepak@containerd121:~/deepak/owncontaier$
To adjust a value we just have to write to the corresponding file. Let's limit the cgroup to 100MB of memory and turn off swap.
Adjusting memory list and disable swap
deepak@containerd121:~/deepak/owncontaier$ sudo su
root@containerd121:/home/deepak/deepak/owncontaier#
root@containerd121:/home/deepak/deepak/owncontaier# echo "100000000" > /sys/fs/cgroup/memory/demo/memory.limit_in_bytes^C
root@containerd121:/home/deepak/deepak/owncontaier# cat /sys/fs/cgroup/memory/demo/memory.limit_in_bytes
9223372036854771712
root@containerd121:/home/deepak/deepak/owncontaier# echo "100000000" > /sys/fs/cgroup/memory/demo/memory.limit_in_bytes
root@containerd121:/home/deepak/deepak/owncontaier# cat /sys/fs/cgroup/memory/demo/memory.limit_in_bytes
99999744
root@containerd121:/home/deepak/deepak/owncontaier# cat /sys/fs/cgroup/memory/demo/memory.swappiness
60
root@containerd121:/home/deepak/deepak/owncontaier# echo "0" > /sys/fs/cgroup/memory/demo/memory.swappiness
root@containerd121:/home/deepak/deepak/owncontaier# cat /sys/fs/cgroup/memory/demo/memory.swappiness
0
root@containerd121:/home/deepak/deepak/owncontaier#
The tasks file is special, it contains the list of processes which are assigned to the cgroup. To join the cgroup we can write our own PID(our bash file).
Write your own shell in the memory CGroup
root@containerd121:/home/deepak/deepak/owncontaier# echo $$
50574
root@containerd121:/home/deepak/deepak/owncontaier# echo $$ > /sys/fs/cgroup/memory/demo/tasks
Now see the process behaviour with memory restriction applied
Process behaviour with memory restriction
root@containerd121:/home/deepak/deepak/owncontaier# echo $$ > /sys/fs/cgroup/memory/demo/tasks
root@containerd121:/home/deepak/deepak/owncontaier# which python
/usr/bin/python
root@containerd121:/home/deepak/deepak/owncontaier# python --version
Python 2.7.12
root@containerd121:/home/deepak/deepak/owncontaier# vim test.py
f = open("/dev/urandom", "r")
data = ""
i=0
while True:
data += f.read(10000000) # 10mb
i += 1
print "%dmb" % (i*10,)
root@containerd121:/home/deepak/deepak/owncontaier# echo "10000000" > /sys/fs/cgroup/memory/demo/memory.limit_in_bytes
root@containerd121:/home/deepak/deepak/owncontaier# python test.py
Killed
root@containerd121:/home/deepak/deepak/owncontaier# echo "100000000" > /sys/fs/cgroup/memory/demo/memory.limit_in_bytes
root@containerd121:/home/deepak/deepak/owncontaier# python test.py
10mb
20mb
30mb
40mb
50mb
60mb
70mb
80mb
Killed
root@containerd121:/home/deepak/deepak/owncontaier#
cgroups can't be removed until every processes in the tasks file has exited or been reassigned to another group. Exit the shell and remove the directory with rmdir (don't use rm -r).
Remove CGroups used for trying out
deepak@containerd121:~/deepak/owncontaier$ rmdir /sys/fs/cgroup/memory/demo
rmdir: failed to remove '/sys/fs/cgroup/memory/demo': Permission denied
deepak@containerd121:~/deepak/owncontaier$ sudo rmdir /sys/fs/cgroup/memory/demo
[sudo] password for deepak:
rmdir: failed to remove '/sys/fs/cgroup/memory/demo': Device or resource busy
deepak@containerd121:~/deepak/owncontaier$ sudo rmdir /sys/fs/cgroup/memory/demo
deepak@containerd121:~/deepak/owncontaier$
Containers are extremely effective ways of running arbitrary code from the internet as root, and this is where the low overhead of containers hurts us. Containers are significantly easier to break out of than a VM. As a result many technologies used to improve the security of containers, such as SELinux, seccomp, and capabilities involve limiting the power of processes already running as root.
In this section we'll be exploring Linux capabilities.
Consider a go routine which listens on port-80
A go routine failed to listen on port-80
deepak@containerd121:~/deepak/owncontaier$ cat listen.go
package main
import (
"fmt"
"net"
"os"
)
func main() {
if _, err := net.Listen("tcp", ":80"); err != nil {
fmt.Fprintln(os.Stdout, err)
os.Exit(2)
}
fmt.Println("success")
}
deepak@containerd121:~/deepak/owncontaier$ rm listen
deepak@containerd121:~/deepak/owncontaier$ go build -o listen listen.go
deepak@containerd121:~/deepak/owncontaier$ ./listen
listen tcp :80: bind: permission denied
deepak@containerd121:~/deepak/owncontaier$
Predictably above program fails; listing on port 80 requires(standard ports) permissions we don't have. Of course we can just use sudo, but we'd like to give the binary just the one permission to listen on lower ports.
Capabilities are a set of discrete powers that together make up everything root can do. This ranges from things like setting the system clock, to kill arbitrary processes. In this case, CAP_NET_BIND_SERVICE allows executables to listen on lower ports.
We can grant the executable CAP_NET_BIND_SERVICE using the setcap command.
Listen port-80 success after setting capability
deepak@containerd121:~/deepak/owncontaier$ go build -o listen listen.go
deepak@containerd121:~/deepak/owncontaier$ ./listen
listen tcp :80: bind: permission denied
deepak@containerd121:~/deepak/owncontaier$ getcap listen
deepak@containerd121:~/deepak/owncontaier$ sudo setcap cap_net_bind_service=+ep listen
deepak@containerd121:~/deepak/owncontaier$ getcap listen
listen = cap_net_bind_service+ep
deepak@containerd121:~/deepak/owncontaier$ ./listen
success
deepak@containerd121:~/deepak/owncontaier$
Setting capability to allow 80 port
deepak@containerd121:~/deepak/owncontaier$ go build -o listen listen.go
deepak@containerd121:~/deepak/owncontaier$ ./listen
listen tcp :80: bind: permission denied
deepak@containerd121:~/deepak/owncontaier$ getcap listen
deepak@containerd121:~/deepak/owncontaier$ sudo setcap cap_net_bind_service=+ep listen
deepak@containerd121:~/deepak/owncontaier$ getcap listen
listen = cap_net_bind_service+ep
deepak@containerd121:~/deepak/owncontaier$ ./listen
success
deepak@containerd121:~/deepak/owncontaier$
Similarly, we can use capsh to drop a few capabilities including CAP_CHOWN to process with root permission.
https://ericchiang.github.io/post/containers-from-scratch/
https://www.youtube.com/watch?v=gMpldbcMHuI
https://ericchiang.github.io