An intro to containers without Docker or rkt

Introduction

Laymen explanation

Do you think that Docker and Rkt are a magic? A container acts a lightweight VM and does many so called magical stuffs to achieve it. In reality, it is user friendly wrapper over linux features.

Technical explanation

Often thought of as cheap VMs, containers are just isolated groups of processes running on a single host. That isolation leverages several underlying technologies built into the Linux kernel: namespaces, cgroups, chroots and lots of terms you've probably heard before.

Steps in brief

Get the base file system (known as base image)
- Use chroot for file system isolation from host
- Use unshare for namespace isolation
- Use mount for volume mount of file/directory to the container
- Use Cgroups for resource quota to the container
- Use setcap and capsh to setting linux capabilities to the container

Container file systems

Container images, the thing you download from the internet, are literally just tarballs (or tarballs in tarballs if you're fancy). The least magic part of a container are the files you interact with.

Below tarball holds something that looks like a Debian file system and will be our playground for isolating processes.

Base file system for creating container

deepak@containerd121:~/deepak/owncontaier$ wget https://github.com/ericchiang/containers-from-scratch/releases/download/v0.1.0/rootfs.tar.gz

deepak@containerd121:~/deepak/owncontaier$ sha256sum rootfs.tar.gz

c79bfb46b9cf842055761a49161831aee8f4e667ad9e84ab57ab324a49bc828c rootfs.tar.gz

First, explode the tarball and poke around.

File system in the download tarball

deepak@containerd121:~/deepak/owncontaier$ tar -zxvf rootfs.tar.gz

deepak@containerd121:~/deepak/owncontaier$ ls rootfs

bin boot dev etc home lib lib64 media mnt opt proc root run sbin srv sys tmp usr var

The resulting directory looks an awful lot like a Linux system. There's a bin directory with executables, an etc with system configuration, a lib with shared libraries, and so on.

chroot for container file system isolation from host

It allows us to restrict a process’ view of the file system. In this case, we'll restrict our process to the “rootfs” directory then exec a shell.

In below command we will set rootfs as root file directory and /bin/bash as the shell

chroot to set root directory

deepak@containerd121:~/deepak/owncontaier$ ls

rootfs rootfs.tar.gz

deepak@containerd121:~/deepak/owncontaier$ ls -ltrh /usr/local/bin/python

ls: cannot access '/usr/local/bin/python': No such file or directory

deepak@containerd121:~/deepak/owncontaier$ sudo chroot rootfs /bin/bash

root@containerd121:/#

root@containerd121:/# ls -ltrh /usr/local/bin/python

lrwxrwxrwx 1 1001 1001 7 Sep 24 2016 /usr/local/bin/python -> python3

root@containerd121:/# which python

/usr/local/bin/python

root@containerd121:/# /usr/bin/python -c 'print "Hello, container world!"'

Hello, container world!

root@containerd121:/# exit

deepak@containerd121:~/deepak/owncontaier$

deepak@containerd121:~/deepak/owncontaier$ which python

/usr/bin/python

It's worth noting that above works because of all the things baked into the tarball. When we execute the Python interpreter, we're executing rootfs/usr/bin/python, not the host's Python. That interpreter depends on shared libraries and device files that have been intentionally included in the archive.

Speaking of applications, instead of shell we can run one in our chroot. It is similar to docker exec?

Executing a program within chroot

deepak@containerd121:~/deepak/owncontaier$ sudo chroot rootfs python -m SimpleHTTPServer

Fatal Python error: Failed to open /dev/urandom

Aborted (core dumped)

Why Namespace isolation needed?

Above chroot doesn't facilitate the PID namespace isolation. In other words, within the chroot, we can see host processes as well.

Below example shows that cat process can be killed from within rootfs

Host processes visible within the roots

deepak@containerd121:~/deepak/owncontaier$ cat &

[1] 130840

deepak@containerd121:~/deepak/owncontaier$ sudo chroot rootfs /bin/bash

root@containerd121:/# ps ax | grep 130840

1747 ? S+ 0:00 grep 130840

130840 ? T 0:00 cat

root@containerd121:/# exit

deepak@containerd121:~/deepak/owncontaier$ sudo chroot rootfs /bin/bash

root@containerd121:/# kill -9 130840

root@containerd121:/# ps ax | grep 130840

3162 ? S+ 0:00 grep 130840

So, we need namespace isolation here

unshare command for namespace isolation

Creating namespace is super easy, just a single syscall with one argument, unshare. The unshare command line tool gives us a nice wrapper around this syscall and lets us setup namespaces manually. In this case, we'll create a PID namespace for the shell, then execute the chroot like the last example.

namespace isolation of chroot container

deepak@containerd121:~/deepak/owncontaier$ cat &

[1] 9208

deepak@containerd121:~/deepak/owncontaier$ sudo unshare -p -f --mount-proc=$PWD/rootfs/proc chroot rootfs /bin/bash

root@containerd121:/# ps ax

PID TTY STAT TIME COMMAND

1 ? S 0:00 /bin/bash

2 ? R+ 0:00 ps ax

Is host networking visible to chroot container?

Our above chroot container sees host networking as shown in below example

networking is visible to chroot container

deepak@containerd121:~/deepak/owncontaier$ ip addr show | grep inet

inet 127.0.0.1/8 scope host lo

inet6 ::1/128 scope host

inet 10.106.175.121/24 brd 10.106.175.255 scope global eth0

inet6 fe80::5866:f7ff:fe3b:75b1/64 scope link

inet 10.88.0.1/16 scope global cni0

inet6 fe80::84da:c4ff:fefc:6316/64 scope link

inet6 fe80::f426:c6ff:fe36:9e6b/64 scope link

inet6 fe80::f86b:9aff:fe65:678c/64 scope link

inet 192.168.170.0/32 brd 192.168.170.0 scope global tunl0

inet6 fe80::ecee:eeff:feee:eeee/64 scope link

deepak@containerd121:~/deepak/owncontaier$ ip addr show | grep inet | wc -l

deepak@containerd121:~/deepak/owncontaier$ sudo unshare -p -f --mount-proc=$PWD/rootfs/proc chroot rootfs /bin/bash

root@containerd121:/# ip addr show | grep inet | wc -l

root@containerd121:/# exit

exit

So, we may need to create our own network and link to our chroot container. In some cases, its fair that container share the host networking

Power of namespace

A powerful aspect of namespaces is their composability
Processes may choose to separate some namespaces but share others.
For instance it may be useful for two programs to have isolated PID namespaces, but share a network namespace (e.g. Kubernetes pods).

Know namespace of container

Run the chroot container in a shell and from other shell in host machine, try to get the PID of the chroot container bash process

Know the PID of the chroot container

root@containerd121:~# ps aux | grep /bin/bash | grep root

root 28892 0.0 0.0 51416 3708 pts/0 S 03:51 0:00 sudo unshare -p -f --mount-proc=/home/deepak/deepak/owncontaier/rootfs/proc chroot rootfs /bin/bash

root 28893 0.0 0.0 5992 652 pts/0 S 03:51 0:00 unshare -p -f --mount-proc=/home/deepak/deepak/owncontaier/rootfs/proc chroot rootfs /bin/bash

root 28894 0.0 0.0 20216 2756 pts/0 S+ 03:51 0:00 /bin/bash

root 28897 0.0 0.0 12916 980 pts/1 S+ 03:51 0:00 grep --color=auto /bin/bash

In above example, PID is 28894

Get the namespace info using /proc

The kernel exposes namespaces under /proc/(PID)/ns as files. In this case, /proc/28894/ns/pid is the process namespace we're hoping to join.

Get the namespace info of chroot container

root@containerd121:~# sudo ls -l /proc/28894/ns

total 0

lrwxrwxrwx 1 root root 0 Mar 28 03:54 cgroup -> cgroup:[4026531835]

lrwxrwxrwx 1 root root 0 Mar 28 03:54 ipc -> ipc:[4026531839]

lrwxrwxrwx 1 root root 0 Mar 28 03:54 mnt -> mnt:[4026532517]

lrwxrwxrwx 1 root root 0 Mar 28 03:54 net -> net:[4026532101]

lrwxrwxrwx 1 root root 0 Mar 28 03:54 pid -> pid:[4026532527]

lrwxrwxrwx 1 root root 0 Mar 28 03:54 user -> user:[4026531837]

lrwxrwxrwx 1 root root 0 Mar 28 03:54 uts -> uts:[4026531838]

Enter to the container namespace using nsenter

The nsenter command provides a wrapper around setns to enter a namespace. We'll provide the namespace file, then run the unshare to remount /proc and chroot to setup a chroot. This time, instead of creating a new namespace, our shell will join the existing one(28894 here).

Note that the ps command output will be same as the actual rootfs container.

enter to the chroot container namespace

deepak@containerd121:~/deepak/owncontaier$ sudo nsenter --pid=/proc/28894/ns/pid unshare -f --mount-proc=$PWD/rootfs/proc chroot rootfs /bin/bash

root@containerd121:/# ps aux

USER PID %CPU %MEM VSZ RSS TTY STAT START TIME COMMAND

root 1 0.0 0.0 20216 2756 ? S+ 10:51 0:00 /bin/bash

root 4 0.0 0.0 5992 652 ? S 10:58 0:00 unshare -f --mount-proc=/home/deepak/deepak/owncontaier/rootfs/proc chroot rootfs /bin/bash

root 5 0.0 0.0 20216 3204 ? S 10:58 0:00 /bin/bash

root 6 0.0 0.0 17496 1984 ? R+ 10:58 0:00 ps aux

root@containerd121:/#

Enabling volume mounts to chroot

It uses Linux mount feature

For this example, we'll create some files on the host, then expose them read-only to the chrooted shell using mount.

First, let's make a new directory to mount into the chroot and create a file there.

Create a directory to mount into chroot container

deepak@containerd121:~/deepak/owncontaier$ sudo mkdir readonlyfiles

deepak@containerd121:~/deepak/owncontaier$ sudo touch readonlyfiles/hi.txt

deepak@containerd121:~/deepak/owncontaier$ ls -ltrh readonlyfiles/hi.txt

-rw-r--r-- 1 root root 0 Mar 28 04:14 readonlyfiles/hi.txt

deepak@containerd121:~/deepak/owncontaier$ sudo chmod 666 readonlyfiles/hi.txt

deepak@containerd121:~/deepak/owncontaier$ ls -ltrh readonlyfiles/hi.txt

-rw-rw-rw- 1 root root 0 Mar 28 04:14 readonlyfiles/hi.txt

deepak@containerd121:~/deepak/owncontaier$ sudo echo "hello" > readonlyfiles/hi.txt

deepak@containerd121:~/deepak/owncontaier$ cat readonlyfiles/hi.txt

hello

Next, we'll create a target directory in our container and bind mount the directory providing the -o ro argument to make it read-only.

Mount directory

deepak@containerd121:~/deepak/owncontaier$ sudo mkdir -p rootfs/var/readonlyfiles

deepak@containerd121:~/deepak/owncontaier$ sudo mount --bind -o ro $PWD/readonlyfiles $PWD/rootfs/var/readonlyfiles

deepak@containerd121:~/deepak/owncontaier$ ls rootfs/var/readonlyfiles/

hi.txt

The chrooted process can now see the mounted files.

Seeing mounted file within the chroot container

deepak@containerd121:~/deepak/owncontaier$ sudo chroot rootfs /bin/bash

root@containerd121:/# ls

bin boot dev etc home lib lib64 media mnt opt proc root run sbin srv sys tmp usr var

root@containerd121:/# ls /var/readonlyfiles/hi.txt

/var/readonlyfiles/hi.txt

root@containerd121:/# cat /var/readonlyfiles/hi.txt

hello

However, it can't write them since it is mounted with RO option

Verify that chroot process can't edit the file

deepak@containerd121:~/deepak/owncontaier$ sudo chroot rootfs /bin/bash

root@containerd121:/# echo "bye" > /var/readonlyfiles/hi.txt

bash: /var/readonlyfiles/hi.txt: Read-only file system

root@containerd121:/#

To remove the mount, fire below command from the host machine

umount $PWD/rootfs/var/readonlyfiles

cgroups for resource management

The kernel exposes cgroups through the /sys/fs/cgroup directory.

Cgroup directory info

deepak@containerd121:~/deepak/owncontaier$ ls /sys/fs/cgroup/

blkio cpu cpu,cpuacct cpuacct cpuset devices freezer hugetlb memory net_cls net_cls,net_prio net_prio perf_event pids systemd

deepak@containerd121:~/deepak/owncontaier$ ls /sys/fs/cgroup/memory/

cgroup.clone_children memory.failcnt memory.kmem.tcp.failcnt memory.max_usage_in_bytes memory.stat system.slice

cgroup.event_control memory.force_empty memory.kmem.tcp.limit_in_bytes memory.move_charge_at_immigrate memory.swappiness tasks

cgroup.procs memory.kmem.failcnt memory.kmem.tcp.max_usage_in_bytes memory.numa_stat memory.usage_in_bytes user.slice

cgroup.sane_behavior memory.kmem.limit_in_bytes memory.kmem.tcp.usage_in_bytes memory.oom_control memory.use_hierarchy

init.scope memory.kmem.max_usage_in_bytes memory.kmem.usage_in_bytes memory.pressure_level notify_on_release

kubepods memory.kmem.slabinfo memory.limit_in_bytes memory.soft_limit_in_bytes release_agent

deepak@containerd121:~/deepak/owncontaier$

For this example we'll create a cgroup to restrict the memory of a process. Creating a cgroup is easy, just create a directory. In this case we'll create a memory group called “demo”. Once created, the kernel fills the directory with files that can be used to configure the cgroup.

Creation of Demo Cgroups for memory

deepak@containerd121:~/deepak/owncontaier$ sudo mkdir /sys/fs/cgroup/memory/demo

[sudo] password for deepak:

deepak@containerd121:~/deepak/owncontaier$ ls /sys/fs/cgroup/memory/demo

cgroup.clone_children memory.kmem.failcnt memory.kmem.tcp.limit_in_bytes memory.max_usage_in_bytes memory.soft_limit_in_bytes notify_on_release

cgroup.event_control memory.kmem.limit_in_bytes memory.kmem.tcp.max_usage_in_bytes memory.move_charge_at_immigrate memory.stat tasks

cgroup.procs memory.kmem.max_usage_in_bytes memory.kmem.tcp.usage_in_bytes memory.numa_stat memory.swappiness

memory.failcnt memory.kmem.slabinfo memory.kmem.usage_in_bytes memory.oom_control memory.usage_in_bytes

memory.force_empty memory.kmem.tcp.failcnt memory.limit_in_bytes memory.pressure_level memory.use_hierarchy

deepak@containerd121:~/deepak/owncontaier$

To adjust a value we just have to write to the corresponding file. Let's limit the cgroup to 100MB of memory and turn off swap.

Adjusting memory list and disable swap

deepak@containerd121:~/deepak/owncontaier$ sudo su

root@containerd121:/home/deepak/deepak/owncontaier#

root@containerd121:/home/deepak/deepak/owncontaier# echo "100000000" > /sys/fs/cgroup/memory/demo/memory.limit_in_bytes^C

root@containerd121:/home/deepak/deepak/owncontaier# cat /sys/fs/cgroup/memory/demo/memory.limit_in_bytes

9223372036854771712

root@containerd121:/home/deepak/deepak/owncontaier# echo "100000000" > /sys/fs/cgroup/memory/demo/memory.limit_in_bytes

root@containerd121:/home/deepak/deepak/owncontaier# cat /sys/fs/cgroup/memory/demo/memory.limit_in_bytes

99999744

root@containerd121:/home/deepak/deepak/owncontaier# cat /sys/fs/cgroup/memory/demo/memory.swappiness

root@containerd121:/home/deepak/deepak/owncontaier# echo "0" > /sys/fs/cgroup/memory/demo/memory.swappiness

root@containerd121:/home/deepak/deepak/owncontaier# cat /sys/fs/cgroup/memory/demo/memory.swappiness

root@containerd121:/home/deepak/deepak/owncontaier#

The tasks file is special, it contains the list of processes which are assigned to the cgroup. To join the cgroup we can write our own PID(our bash file).

Write your own shell in the memory CGroup

root@containerd121:/home/deepak/deepak/owncontaier# echo $$

50574

root@containerd121:/home/deepak/deepak/owncontaier# echo $$ > /sys/fs/cgroup/memory/demo/tasks

Now see the process behaviour with memory restriction applied

Process behaviour with memory restriction

root@containerd121:/home/deepak/deepak/owncontaier# echo $$ > /sys/fs/cgroup/memory/demo/tasks

root@containerd121:/home/deepak/deepak/owncontaier# which python

/usr/bin/python

root@containerd121:/home/deepak/deepak/owncontaier# python --version

Python 2.7.12

root@containerd121:/home/deepak/deepak/owncontaier# vim test.py

f = open("/dev/urandom", "r")

data = ""

i=0

while True:

data += f.read(10000000) # 10mb

i += 1

print "%dmb" % (i*10,)

root@containerd121:/home/deepak/deepak/owncontaier# echo "10000000" > /sys/fs/cgroup/memory/demo/memory.limit_in_bytes

root@containerd121:/home/deepak/deepak/owncontaier# python test.py

Killed

root@containerd121:/home/deepak/deepak/owncontaier# echo "100000000" > /sys/fs/cgroup/memory/demo/memory.limit_in_bytes

root@containerd121:/home/deepak/deepak/owncontaier# python test.py

10mb

20mb

30mb

40mb

50mb

60mb

70mb

80mb

Killed

root@containerd121:/home/deepak/deepak/owncontaier#

cgroups can't be removed until every processes in the tasks file has exited or been reassigned to another group. Exit the shell and remove the directory with rmdir (don't use rm -r).

Remove CGroups used for trying out

deepak@containerd121:~/deepak/owncontaier$ rmdir /sys/fs/cgroup/memory/demo

rmdir: failed to remove '/sys/fs/cgroup/memory/demo': Permission denied

deepak@containerd121:~/deepak/owncontaier$ sudo rmdir /sys/fs/cgroup/memory/demo

[sudo] password for deepak:

rmdir: failed to remove '/sys/fs/cgroup/memory/demo': Device or resource busy

deepak@containerd121:~/deepak/owncontaier$ sudo rmdir /sys/fs/cgroup/memory/demo

deepak@containerd121:~/deepak/owncontaier$

Use setcap to restrict capability

Containers are extremely effective ways of running arbitrary code from the internet as root, and this is where the low overhead of containers hurts us. Containers are significantly easier to break out of than a VM. As a result many technologies used to improve the security of containers, such as SELinux, seccomp, and capabilities involve limiting the power of processes already running as root.

In this section we'll be exploring Linux capabilities.

Consider a go routine which listens on port-80

A go routine failed to listen on port-80

deepak@containerd121:~/deepak/owncontaier$ cat listen.go

package main

import (

"fmt"

"net"

"os"

)

func main() {

if _, err := net.Listen("tcp", ":80"); err != nil {

fmt.Fprintln(os.Stdout, err)

os.Exit(2)

}

fmt.Println("success")

}

deepak@containerd121:~/deepak/owncontaier$ rm listen

deepak@containerd121:~/deepak/owncontaier$ go build -o listen listen.go

deepak@containerd121:~/deepak/owncontaier$ ./listen

listen tcp :80: bind: permission denied

deepak@containerd121:~/deepak/owncontaier$

Predictably above program fails; listing on port 80 requires(standard ports) permissions we don't have. Of course we can just use sudo, but we'd like to give the binary just the one permission to listen on lower ports.

Capabilities are a set of discrete powers that together make up everything root can do. This ranges from things like setting the system clock, to kill arbitrary processes. In this case, CAP_NET_BIND_SERVICE allows executables to listen on lower ports.

We can grant the executable CAP_NET_BIND_SERVICE using the setcap command.

Listen port-80 success after setting capability

deepak@containerd121:~/deepak/owncontaier$ go build -o listen listen.go

deepak@containerd121:~/deepak/owncontaier$ ./listen

listen tcp :80: bind: permission denied

deepak@containerd121:~/deepak/owncontaier$ getcap listen

deepak@containerd121:~/deepak/owncontaier$ sudo setcap cap_net_bind_service=+ep listen

deepak@containerd121:~/deepak/owncontaier$ getcap listen

listen = cap_net_bind_service+ep

deepak@containerd121:~/deepak/owncontaier$ ./listen

success

deepak@containerd121:~/deepak/owncontaier$

Setting capability to allow 80 port

deepak@containerd121:~/deepak/owncontaier$ go build -o listen listen.go

deepak@containerd121:~/deepak/owncontaier$ ./listen

listen tcp :80: bind: permission denied

deepak@containerd121:~/deepak/owncontaier$ getcap listen

deepak@containerd121:~/deepak/owncontaier$ sudo setcap cap_net_bind_service=+ep listen

deepak@containerd121:~/deepak/owncontaier$ getcap listen

listen = cap_net_bind_service+ep

deepak@containerd121:~/deepak/owncontaier$ ./listen

success

deepak@containerd121:~/deepak/owncontaier$

Similarly, we can use capsh to drop a few capabilities including CAP_CHOWN to process with root permission.

Reference

https://ericchiang.github.io/post/containers-from-scratch/

https://www.youtube.com/watch?v=gMpldbcMHuI

https://ericchiang.github.io

Page updated

Google Sites

Report abuse