This document lays out a technical vision for making Chromium OS-based systems difficult for remote attackers to compromise using various system-level mechanisms. Three objectives guide this vision:
Efforts to secure Linux environments tend to revolve around the principle of least privilege and applying exploit mitigation tactics wherever possible. While the exploit mitigation techniques are effective, they are never a perfect defense and often the specific techniques deployed vary from distribution to distribution. In addition, the principle of least privilege is excellent in a server environment and for locking down system services on desktops. However, desktop systems are meant to be general purpose. This makes it incredibly difficult to determine the least privilege needed if a program has not ever been seen on the system before (or was written since the system was installed!). The end result is that the risks from interactively executed applications are addressed only using exploit mitigations and not as comprehensively as desired.
OS has an advantage. All native programs run by the end user are
known in advance since all general purpose applications are web applications. We use this knowledge to apply comprehensive access control enforcement in addition to the well-known exploit mitigation techniques. This combination allows Chromium OS to benefit from the great work securing Linux in both end-user and server enviroments!
Control Groups (cgroups) are a somewhat recent addition to the Linux kernel. They are a hierarchical collection of tasks that can be arbitrarily created. Once a task has been associated with a hierarchy, it falls under runtime constraints ranging from limited device node access to limited CPU and memory usage. cgroups restrictions can be specified in terms of percentages or as constants which provide an intuitive means of "guaranteeing" that processes operate within their given bounds. This feature is great for constraining denial of service attacks and general robustness issues (and combines well with rlimit). The device filtering is useful for limiting /dev access in constrained namespaces.Namespacing provide a means of isolating processes, and process trees, from other running processes on a system. Their use has been driven largely by the Linux VServer. Namespaces can be created only when a new process is started using clone(). The following namespaces are currently available in the upstream kernel:
In addition to namespacing and cgroups, Linux supports parceling superuser privileges using capabilities. Privileges that were once limited to uid=0 are now available in a coarse-grained fashion using runtime and file system (extended attribute)-based labeling. In addition, process tree capabilities inheritance is possible with a lightweight kernel patch. With file system capabilities enabled, specific process trees can also disable uid=0 from having any default privilege (other than that granted by file system permissions) using the securebits SECURE_NOROOT and SECURE_NO_SETUID_FIXUP. All processes in the subtree could then be locked into this pure capability-based superuser privilege mode barring a kernel privilege escalation vulnerability. All capabilities are broken into effective, permitted, and inherited sets, which can be applied to a file or process. In addition, processes all have a starting bounding set that places an upper limit on which capabilities can be used, even if in one of the other sets. At present, we're not aware of any Linux distributions that make heavy use of capabilities.
Linux Security Modules (LSM) is a subsystem of the Linux Kernel Modules that implements a number of kernel task-based hooks. It allows for a security module to be implemented that enforces mandatory access controls and can be stacked (if supported by the module) with other security modules. Tomoyo 2.x, SELinux, and SMACK are examples of these systems.
grsecurity is a standalone kernel patch that provides role-based access controls, kernel hardening, and bug fixes, as well as extensive detection functionality.
The kernel supports a notification system to inform userland of process events: fork, id changes, so on. The process events subsystem allows for low-cost monitoring of task creation and removal as well as other pertinent runtime information. Communication is handled over a netlink connection provided by the CONFIG_CONNECTOR option.
This project will be undertaken iteratively to increase both the amount of process isolation and overall system security as the implementation proceeds.
Regardless of the implementation, the binary will need the cap_setpcaps extended attribute set (+ep) to be able to operate without root privileges. (If we move to a root file system that doesn't support extended attributes, any specialized binaries can be mounted off of an ext3 loopback, or the kernel-based inheritance patch can be used.)
Tomoyo will provide mandatory access controls to ensure that processes don't exceed their expected boundaries. Initially, the coverage will be largely limited to additional file system enforcements. Protection around ptrace, ioctl, and other areas are needed before this will be considered a good solution.
Whichever MAC solution we choose will be configured to run in enforcement mode for all processes, except when running in development mode. In that case, the development mode parent process runs in permissive mode to allow the developer to take whatever actions he wishes.
It turns out that this is pretty easy as long as we don't focus on making all of Xorg non-root-based. For example, there is a command run by the login manager upon successful user authentication:
/bin/bash --login /etc/X11/Xsessionrc %session
which we could change to
/sbin/capsh --secbits=0x2f --drop=[all] -- --login /etc/X11/Xsessionrc %session
in order to run the user's session in SECURE_NOROOT, if we were using the capsh tool (capabilities-aware shell).
However, since we have minijail, we can use it to do more than just run in SECURE_NOROOT:
/sbin/minijail --init --namespace=pid,vfs --cgroup=chromeuser
--secbits=0x2f --drop=[all] --exec=/bin/bash -- --login /etc/Xsessionrc
This will dump the new process in its own namespace where minijail acts as pid 1. The entire X session will only be able to see a /proc that is related to its pid namespace, and it will have access only to devices whitelisted for the cgroup: chromeuser. No SUID binaries or binaries with any additional extended capability attributes set will be executable in this session.
The biggest impact is that when it comes time to perform screen
unlocking, the xscreensaver process will not be able to do anything
Note: We have not chroot'd or net-namespaced the binaries. At present, /dev/ is limited by cgroup filtering (discussed below) and by mounting a fresh /proc with the namespace view. While we will have stripped power from any simple root privilege escalation attack, a user running as uid=0 will still have normal discretionary access controls to a crazy number of files and devices as the root user. In Phase 2, we will look at further segmenting access.
Big Note: Any privileged actions must be brokered by preconfigured, preexecuted binaries.
With this Big Note in mind, we can look back at our slim.conf. If we find that we need to launch some processes with some privileges, we can tone down how aggressive the first call to minijail is. It can set up the namespace and lock down root, but it can leave the bounding set a bit wider; that way, we can launch utilities that may need capabilities like pulseaudio. This can be done inside the Xsessionrc by calling minijail on all subsequent binaries with specific bounding set changes, etc. They can all, thankfully, live in their new pid namespace unless we lock them down further (chroot, etc). Initially, we'll start with the above configuration and tweak as problems are introduced.
Guiding resource utilization with control groupsControl groups (cgroups) will be used to segment the population with respect to device access and resource utilization. To that end, we can preconfigure a few control groups at start via a simple /etc/init.d/cgroups script:
chown root:root /cgroup
chmod 700 /cgroup
mount none /cgroups -t cgroup -o cpu,memory,dev
With that done, we can leave it to minijail invocations to add the pid to the /cgroups/<cgroup>/task file. The biggest challenge will be nested cgroups like chrome/sandbox, since we will not mount /cgroups in the chrome user namespace and users will be unable to see the /cgroups file system. In Phase 1, we'll just have to let renderers live in the chrome cgroup and hope for the best. Segmenting chrome user processes from the system services should be enough to guarantee that a Chromium browser CPU DoS won't peg the system too badly, but we'll see. If it can tie up xorg, the user experience will be the same. However, in Phase 2, we will introduce a cgroupsd daemon. This daemon will monitor new process creation (via a TBD mechanism with _low_ power/cpu needs) and automatically add them to the appropriate cgroup.
Devices will be added quite simply with:
echo 'c 1:3 mr' > /cgroups/1/devices.allowMemory per group can be determined based on system memory. Below, limit chrome to using 80 percent of available memory:
total_mem=$(free -b | grep Mem | tr -s ' ' | cut -f2 -d' ')
echo $((total_mem / 5 * 4)) > /cgroups/user/chrome/memory.limit_in_bytes
CPU usage can also be determined using the system total, which is available in cpu.shares. Below, we give all chrome processes 80 percent of the CPU shares:
echo $((total_cpu / 5 * 4)) > /cgroups/user/chrome/cpu.shares
Of course, we can tweak the total number of shares to make specific allocations. The allocations should then be used for fair scheduling.
Longer term, we may be able to use 'freezer' support to freeze all processes prior to suspend or use cpusets to ensure that the Chromium browser, or perhaps even an extension, is privately allocated an entire CPU core (using cpusets). In addition, if any of these items imply too much overhead, it is possible to achieve similar (and even more focused) results using RLIMITS.
The final goal is to move to a pure capability-based system which means that no service should need root access. To this end, we'll need to modify the startup process for these daemons. The easiest approach is to just wrap their start-stop-daemon calls with calls to minijail, either in the control panel or in /etc/init.d. Each one should get its own namespacing with chroot'ing if possible. (If it seems difficult to chroot a specific binary, then we should consider doing so with the Chromium OS project's LinuxSUIDSandbox.) Capabilities required will be determined using strace | grep 'EPERM|EACCESS' while locking down the binaries. Each binary will have the capabilities it needs added to its extended attributes. For example, dhclient needs CAP_NET_BIND_SERVICE|CAP_NET_BROADCAST|CAP_NET_ADMIN:
capset cap_net_bind_service,cap_net_broadcast,cap_net_admin=ep /sbin/dhclient3
Then, dhclient will be called with minijail dropping all capabilities except those three (and cap_setcaps if needed). If dhclient is called from connman, then that process can already be running with a restricted capability bounding set.
At present, our root file system supports extended attributes. If we move to squashfs or another file system without xattr, we will have to work around the restriction. This can be done using the patch referenced in the Technology section or by mounting a loopback file system with the desired binaries with the appropriate xattrs.
Once we dump the user in SECURE_NOROOT land, sudo and su become
useless. To allow continued tinkering, we can use the secondary
console if enabled or a loopback ssh daemon. As long as we don't lock
these alternative entry points the same way as the primary user session, they will be perfectly sane ways to
implement a secure but useful development mode.
For our Mandatory Access Control (MAC) needs, we are considering the external grsecurity kernel patch as well as the currently in-kernel Tomoyo module and accompanying tomoyo-ccstools
package. In either case, a similar approach applies. Once installed, an initial policy
can be configured using learning mode. We can then enable enforcement
on process trees which shouldn't change: dhclient, wpa_supplicant,
etc. The Chromium browser itself will likely run in a permissive
mode as we explore extensions and other changes. However, as we
approach releases, we can configure a final policy using learning mode
both with active users and through automated testing that exercises
all the expected use cases for the system.
The Tomoyo tool for editing the policy is ccs-editpolicy. Initially, there should be no need to perform any special customization other than converting process trees from disabled to learning or permissive. In addition, we will need to make sure that the development mode runs in permissive or disabled mode.
In addition to the normal file system, /dev and /proc can be dangerous for an unprivileged root user. While we are filtering devices per-cgroup, we should ensure that CONFIG_STRICT_DEVMEM is set to limit /dev/mem usefulness. If we patch Xorg, we may be able to get rid of /dev/mem entirely. In addition, a review of what is available in /proc and /dev for each group will be crucial. Whenever we place a process tree in a new VFS, we can mount --bind in only the files we want. This is harder with /proc, but doable if needed. /proc may be remounted read-only at the very least. We may be able to make use of the Linux VServer's 'setattr' tool to hide /proc entries on namespace mount. If so, this would be done in the minijail code but would require that we support the vserver kernel patches. However, namespacing and chroot'ing will hopefully cover a lot of ground.
User home directories and the /home partition should be mounted nosuid, nodev, and ideally, noexec. We should attempt to limit user access to included scripting engines if possible to to aid in enforcing noexec (dash, bash, or any others).
cgroupsd will also be used to enforce device filtering and resource management on plugins like Flash.
We're probably also going to see some pain with Adobe Flash and other binary plugins unless we give them access. Reviewing and integrating plugins with this design will be critical to avoiding introducing a trivial backdoor through the protections.
net namespace. We can then expose to the system a virtual interface with a virtual, internal, IP address. We could even optionally enforce userland proxy use (for truly dodgy inmates). However, this may introduce robustness issues if a user is assigned a physical address that is the same or in the same netmask as the virtual ones. Given that we can't control the eth0 address, we have delayed pursuing this until Phase 2. When we get there, it will be worth investigating and deploying if possible to keep any process from being able to bind to an external port.
That change may be overkill or add more complexity than the gain. Another option would be to add a new secure bit along the lines of SECURE_UNSAFE. If that secure bit is set on a tree, then no process in the tree can change to UID/GID 0.
GSTFakeVideo. Not only will this avoid direct attacks on random webcam drivers, it will also mean we can later offer an interface for doing real-time video stream filtering: custom effects, etc.
In addition to /dev/video, we'll want to position userland code for audio interception (e.g, /dev/dsp, etc). This can be done using something like esound or pulseaudio. If we go with one of those daemons anyway, we can get this for free.
After the audio/video experience, we're left with one major exposed surface for plugins which require video card device access. Since we will want to support accelerated 3D and other fast rendering, we'll be exposing (possibly binary-only) video card drivers via X/DRI. This is a larger problem that will be addressed in a more detailed design document on the issue.
A single daemon can monitor process creation and uid changes via the proc events kernel interface. If it sees any process become uid/euid==0, then we have someone running a privilege escalation exploit. If we determine an exceptional-event user interface or a reboot path that will notify the user to put-a-paperclip-in to reset the device, then we can trigger it immediately. While an exploit can target this behavior, it is just one more layer of defense.
The Linux Auditing Framework may be very useful for doing detection, but its cost may outweigh the benefit. Since we are not expecting a huge number of process creation events, we can monitor system calls ranging from ptrace to clone(2) to fork. If we avoid high traffic system calls, we should be able to enforce some basic system call detection without sandboxing explicitly. In addition, if we don't use auditd, but instead a custom listener, we can immediately react to an event—such as terminating the calling process or triggering a reboot into the recovery system.
Here are some of the ideas we have, but there is a lot of area to research:
There are a huge number of potential changes to the kernel we can pursue. We'll start with known approaches and then expand as permitted into newer areas as possible:
Longer term, we'll also look at:
In order to quantify effectiveness, we should take the time to
truly enumerate the attack surfaces yielded by these hardening steps: