Linux containers are built upon the following underlying kernel features:
pivot_root system call
pivot_root changes the root directory of a process and all children of that process inherit the changed root directory. It's kind of a secure version of the chroot system call.
control groups (cgroups)
cgroups provide resource controls of a process. With cgroups resources can be limited and monitored.
namespaces
namespaces
capabilities
capabilities allow fine-grained privilege control for processes, thus limiting what operations processes are allowed to use.
Others
Other used features are OverlayFS, Seccomp, AppArmor, SELinux, ...
Namespaces provide process isolation, allowing each container to have its own view of system resources like processes, network interfaces, and file systems. The main namespaces used in containers are PID (process ID), NET (network), IPC (inter-process communication), UTS (hostname), and Mount (file system).
Control Groups (cgroups): Cgroups allow resource limitation and prioritization for processes within a container. This feature enables controlling and allocating CPU, memory, block I/O, and other resources, preventing a single container from monopolizing the host's resources.
Union File Systems (UnionFS): Union file systems, like OverlayFS and AUFS, enable the layering of read-only images and a writable container layer. This allows sharing common read-only components among multiple containers while keeping container-specific changes isolated.
Chroot: Chroot changes the root directory for a process, restricting its view of the file system to a specific directory. In containers, chroot is used as a basic form of isolation, although namespaces provide more advanced and robust isolation.
Seccomp (Secure Computing Mode): Seccomp allows fine-grained control over system call usage within a container. It can restrict the system calls a containerized process can make, reducing its attack surface and enhancing security.
Capability Management: Linux capabilities allow fine-grained privilege adjustments for processes within containers. This ensures that processes have only the necessary privileges and not the full privileges of the root user.
AppArmor and SELinux: These are Linux security modules that enforce mandatory access controls and limit the actions processes can perform within a container, providing an additional layer of security.
Together, these kernel features enable the creation of isolated, lightweight, and portable containers, making them a powerful tool for modern software development and deployment.
kernel features used by containers
pivot_root syscall
-> new root directory structure
control groups (cgroups)
-> resource control of processes
namespaces
-> limit process view of host resources
capabilities
-> fine-grained privilege control
Pivot_root
The pivot_root system call
pivot_root system call
mount namespace
Chroot
The chroot system call changes the root directory for a process, restricting its view of the file system to a specific directory hierarchy. The new hierarchy is also called chroot jail or sandbox.
Below we can see a small example how to use chroot to create a sandbox that consists of a bash shell only. First we check for the dependent libraries of the bash binary, then we create the directory structure beginning under the /tmp/sandbox directory which will be our root directory once we enter the jail. We can use all the buildin bash commands, but nothing more (e.g. ls isn't available).
Chroot system call
chroot jail
sandbox
Simple chroot jail
check dependent libraries
create directory structure
copy libs and bins into the new structure
enter the sandbox with the chroot command
exit the sandbox with the exit command
Control Groups (cgroup)
Cgroups allow resource limitation and prioritization for a set of processes. By having all processes of a container in one set (cgroup.procs) we controll the entire container. Thus Cgroups giving us more control over how the hosts resources are distributed among the containers and thus preventing a single container from monopolizing the host's resources.
There are 2 different cgroup implementations version 1 and version 2. Controllers from both implementations can be mixed, but a specific controller can only be active in one version at the time. E.g. if the pids controller is active in cgroups version 1 it cannot be activated in cgroups version 2 at the same time.
Cgroups
cgroup.procs (list of controlled processes)
resource control and prioritization of processes
Controller Types
cpu
memory
network
devices
io
In version 1 the controllers were implemented without a unified design which resulted in having different mounts for each controller.
Cgroups version 1
each controller has its own mountpoint
version2 unified the directory hierarchy with the root mounted on /sys/fs/cgroup, i.e. there is only one mount point
Cgroups version 2
only one mount point
unified directory hierarchie
cgroups.controllers list the available controllers
cgroup.subtree_control list/modify set of available controllers for the subtree
Namespaces
Namespaces provide process isolation, allowing each container to have its own view of system resources like processes, network interfaces, and file systems. The main namespaces used in containers are PID (process ID), NET (network), IPC (inter-process communication), UTS (hostname), and Mount (file system).
Namespaces
cgroup (root directory)
ipc (inter process communication)
network
mount
pid
time
user
uts (unix timesharing system)
Capabilities
With capabilities processes are limited regarding what operations they are allowed to use. E.g.
Capabilities
CAP_SYS_ADMIN
CAP_NET_ADMIN
CAP_SYS_CHROOT
Adding network interfaces to a container
sh# lxc config device add <container_id> \
<container_nic> nic nictype=macvlan \
parent=<host_nic> \
name=<container_nic>
lxc config device
Removing network interfaces from a container
sh# lxc config remove <container_id> <container_nic>