Intel's Clear Containers technology allows admins to benefit from the ease of container-based deployment without giving up the security of virtualization. For more than a year, rkt's KVM stage1 has supported VM-based container isolation, but we can build more advanced security features atop it. Using introspection technology, we can automatically detect a wide range of privilege escalation attacks on containers and provide appropriate remediation, making it significantly more difficult for attackers to make a single compromised container the beachhead for an infrastructure-wide assault.
Today we announce rkt’s ability to automatically detect privilege escalation attacks on containers. If such an attack is detected, the container will automatically shut down and a new instance will be started. Direct integration with rkt means users will benefit from this detection and remediation technology with minimal local configuration changes and without having to modify their application containers in any way.
The Unix security model inherited by Linux separates users into two classes: privileged (the root user) and unprivileged (any other user on the system), with the kernel enforcing the separation. If an attacker is able to exploit a vulnerability in the kernel, they can bypass that separation and cause a process running as an unprivileged user to gain root privileges. From there they can attack the rest of the system with ease.
This scenario is difficult to detect and handle, since in a traditional environment the kernel is the most privileged component in a system. If the kernel has been tampered with, the kernel can no longer be trusted to provide accurate information. Virtualized environments add another level of privilege in the form of the hypervisor, and by integrating with the hypervisor, it becomes possible to obtain a more accurate view of the actual system state of virtualized guests.
In our implementation, the kernel notifies the hypervisor each time a process is created or destroyed. The permissions associated with that process are stored at the hypervisor level and verified to ensure that they are internally consistent. For instance, if a process is running as an unprivileged user, it should not be able to directly create a child process that is running as root. An attack on the kernel may be able to modify the kernel’s internal representation of this state, but will not be able to affect the hypervisor’s state.
This state can then be verified whenever a process performs an action requiring a permissions check. For example, when a process requests that a file be opened, the kernel now calls out to the hypervisor. The hypervisor is then able to examine the process state and ensure that it remains consistent with its internal representation of process state. If so, execution is allowed to continue. If not, this indicates that the kernel’s internal process state has been modified and the administrator can be alerted that the container has been compromised. The container state can be saved to disk and the container either terminated or restarted in a clean state.
By isolating examination to cases where a permissions check is performed, the overhead of this approach is minimised to the point where most real-world use cases will see no measurable performance impact.
The SUID flag on an executable file indicates that executable should run as a different user, no matter who executes it. This is most commonly used to allow users to execute a subset of binaries as root even if they are themselves an unprivileged user. This violates the expectations outlined above – it becomes legitimate for an unprivileged process to gain root privileges.
This can be avoided by simply having the kernel notify the hypervisor that such a transition has occurred, allowing the hypervisor to update its internal state. However, if an attacker is able to influence the kernel’s control flow, it is potentially possible for them to trigger the same state update for illegitimate transitions. We avoid this by storing process-specific data in an otherwise unused CPU register when entering the SUID execution path via legitimate means. When the hypervisor receives a notification that the kernel wishes to update a process’s credentials, it reads this CPU register and verifies that the state matches. If so, the update is allowed to proceed. If not, it is treated as any other unauthorized privilege escalation.
rkt incorporates a pluggable architecture, allowing for multiple “stage1” modules that are responsible for configuring and starting a container. This makes it straightforward to incorporate additional container management and monitoring code with minimal modification to rkt itself. This feature was implemented with under 30 altered lines of code in the core rkt runtime, all additional modifications being isolated in the stage1.
Privilege escalation detection doesn’t solve all security issues, but plays an important role in container security. It can identify privilege escalation attacks that are triggered by modification of existing kernel state. Vulnerabilities such as “Dirty COW”, which instead rely on injecting new code into legitimate SUID applications, would not be caught by this monitoring. Attacks that occur entirely in userspace will also not be identified – a vulnerability in a privileged component within a container would still allow an attacker to gain that component’s privileges.
The Kernel Self Protection Project continues to develop Linux features that will help mitigate many kernel vulnerabilities before they even reach the point of being exploitable, and the GRSecurity project already provides patches that will block many of these attacks. We see these as complementary to our work, and hope that eventually the kernel will become secure enough that it can be entirely trusted without needing additional monitoring. Until then, we hope that this feature will help identify new exploits earlier.
The bleeding edge of this work, along with more details, can be found in the pull request that adds privilege escalation detection to rkt and the KVM stage1. Check it out today and join us in this new approach to securing the internet.
SELinux is a kernel feature that allows fine-grained policies restricting the behaviour of applications, and rkt makes use of SELinux to increase isolation between containers running on a shared system. However, SELinux suffers from the same issue as traditional Unix privilege separation – it relies on the kernel to impose those restrictions. If the kernel can be tampered with, SELinux isolation can be disabled.
Some specific workloads such as the serving of significant quantities of static content may be impacted, but most workloads will have negligible overhead.