Container Security Protection—Linux Kernel Security Mechanism
As a lightweight virtualized implementation, the container technology took into account security factors at the time of design, which constitute an important basis for container security protection. This chapter describes security risks and threats facing containers and common protection ideas and methods.
Linux Kernel Security Mechanism
Basic security mainly refers to the implementation of functional modules related to the Linux kernel, including the isolation and management of container resources. At the time of design, the container technology (both Docker and LXC) has similar security considerations. For example, the kernel namespace is used to build a relatively isolated running environment, guaranteeing that each container runs independently. cgroups is used to isolate, restrict, and audit shared resources (such as CPU, memory, and disk I/O), so as to avoid the competition of containers for system resource. These Linux kernel modular functions provide the basic security assurance for containers.
Namespaces are a powerful feature of the Linux kernel. When you start a container, behind the scenes Docker creates a set of namespaces and control groups for the container. Namespaces provide the first and most straightforward form of isolation for containers: Processes running within a container cannot see, and even less affect, processes running in another container, or in the host system. There are process ID (PID), network, interprocess communication (IPC), mount (mnt), UTS, and user ID namespaces, among others.
For instance, PID namespaces isolate the processes of different users, ensuring that processes in different PID namespaces can have the same PID and facilitating container nesting. For process interaction in a container, Docker adopts the Inter-Process Communication (IPC) mechanisms, including semaphores, message queues, and shared memory. Namespace information should be added at the time of applying for IPC resources, ensuring that each IPC resource has a unique 32-bit ID.
For container network isolation, Docker uses the network namespace mechanism. Each network namespace has its own network device, IP address, and routing table. Each container also gets its own network stack, meaning that a container does not get privileged access to the sockets or interfaces of another container.
Kernel namespaces were introduced in the Linux kernel version 2.6.26. This means that since July 2008 (date of the 2.6.26 release), namespace code has been exercised and scrutinized on a large number of production systems. So both the design and the implementation are pretty mature.
Control Groups (cgroups) are another key component of Linux Containers. They implement resource accounting and limiting. The code of control groups was first started by Google in 2006, and initially merged in kernel 2.6.24.
cgroups are important in the implementation of container technology. They provide many useful metrics, but they also help ensure that each container gets its fair share of memory, CPU, and disk I/O; and, more importantly, that a single container cannot bring the system down by exhausting one of those resources.
So while they do not play a role in preventing one container from accessing or affecting the data and processes of another container, they are essential to fend off some denial-of-service attacks. They are particularly important on multi-tenant platforms, like public and private PaaS, to guarantee a consistent uptime (and performance) even when some applications start to misbehave.
Linux Kernel Capabilities
Your average server needs to run a bunch of processes as root, including the SSH, cron, syslog daemon, hardware management tools, network configuration tools, and more. Once the service is compromised, attackers will have the highest privileges. Linux kernel capabilities provide finer-grained access control over file system mounting and access and kernel module loading.
In most cases, container services can exploit Linux kernel capabilities to avoid using real host root privileges. Therefore, containers can run with a reduced capability set. For instance, it is possible to deny all “mount” operations, deny access to raw sockets, and deny access to some file system operations like creating new device nodes and changing the owner/attributes of files. This means that even if an intruder manages to compromise a container, it will be much harder to do serious damage, to the host from within the container.
By default, Docker adopts a whitelist approach to drop all capabilities except those needed. You can see a full list of available capabilities[i] in Linux manpages.
One primary risk with running Docker containers is that the default set of capabilities and mounts given to a container may cause incomplete isolation between containers. Furthermore, attackers could exploit host system vulnerabilities to launch an attack against containers. Docker supports the addition and removal of capabilities, allowing use of a non-default profile. This may make Docker more secure. The best practice for users would be to remove all capabilities except those explicitly required for their processes.
Other Kernel Security Features
Capabilities are just one of the many security features provided by modern Linux kernels. It is also possible to leverage existing, well-known systems like TOMOYO[ii], AppArmor[iii], SELinux[iv], and GRSEC[v], with Docker to harden the security of Docker containers.
You can run a kernel with GRSEC and PAX. This adds many safety checks, both at compile-time and run-time; it also defeats many exploits, thanks to techniques like address randomization. It does not require Docker-specific configuration, since those security features apply system-wide, independent of containers.
If your distribution comes with security model templates for Docker containers, you can use them out of the box. For instance, Docker has officially released a template that works with AppArmor and Red Hat comes with SELinux policies for Docker. These templates provide an extra safety net. You can define your own policies using your favorite access control mechanism.
Mandatory access control (MAC) is a security strategy that restricts the ability individual resource owners have to grant or deny access to resource objects in a file system. That is to say, the system forcibly performs access control, regardless of user behaviors.
Introduced by the National Security Agency (NSA), Security-Enhanced Linux (SELinux) is an implementation of MAC and is the most outstanding security subsystem on Linux. Under the restrictions of SELinux, processes can access only files needed for a specific task. Among currently available Linux security modules, SELinux is the most powerful and thoroughly tested. It is built on the basis of 20 years of research on MAC.
Method: Add the SELinux option when launching the Docker service.
AppArmor is also a MAC system to confine programs to a limited set of resources. It can deny programs to read/write a certain directory and open/read/write network ports.
During running, the Docker service determines whether the current kernel supports AppArmor. If yes, it creates the default AppArmor configuration file /etc/apparmor.d/docker and then applies this file. When a container is launched, Docker applies the corresponding AppArmor configuration during initiation. The –security-opt option can also be set to specify the AppArmor configuration file for the container.
To create an AppArmor rule and apply it to the container, you need to first copy a template.
Compared with the virtualization technology, the container technology does not have Hypervisor. That is to say, the container’s resource management does not have Virtual Machine Manager (VMM) and therefore entirely relies on the host’s operating system. Thus, how to make sure that containers are properly used according to their specification becomes an important prerequisite for container security.
Together with the Center for Internet Security (CIS), Docker released a detailed benchmark configuration file[vi], which provides configuration suggestions from the perspectives of container host, Docker dameon, image configuration, and running configuration.
(To be continued)
 1Here, it refers to Shocker of Github, which describes how to escape from Docker containers and read the content in /etc/host on the host. For complete code, visit the following link: http://stealth.openwall.net/xSports/shocker.c