Containers to Clusters: Advancing Kubernetes, etcd, and more at CoreOS

December 13, 2016 · By Brandon Philips

At Tectonic Summit on Monday, we discussed the core premise of CoreOS: securing the internet and applying operational knowledge into software. We shared how CoreOS makes infrastructure run well and update itself automatically, from Container Linux by CoreOS, to CoreOS Tectonic – what we refer to as self-driving infrastructure.

Over the last few years we’ve been working with customers to best understand the areas to push forward on this technology. Today we dive into other parts of the stack, including rkt, etcd, and dex, exploring how these all tie in with Kubernetes and the investments we are planning for 2017.

Containers and standards

Containers are the shared foundation of the entire ecosystem movement, enabling projects like Container Linux and Kubernetes. Because they are such a fundamental part of the stack ,we have worked hard for the last two years at CoreOS to ensure the container ecosystem is built on shared standards. We believe that standards are central to bringing engineers together from disparate organizations, building a wide ecosystem of interoperating products, and showing customers that containers are a stable foundation to build their infrastructure upon.

To that end, eighteen months ago CoreOS – alongside Docker, Google, Intel, and about 50 other organizations – formed the Open Container Initiative. This initiative is working on both a runtime and image specification for the container industry. Today both specifications are making regular 1.0 release candidate releases, and in early 2017, as the final 1.0 releases are made, we look forward to a set of container registries, runtimes, and other products to emerge that enable users to directly leverage the specification.

Pod native container engine, rkt by CoreOS

Our pod native container engine, rkt, first focused on security and standards. It has largely achieved these goals as a container engine using best practices in privilege architecture, enabling users to have a virtualization wrapper around pods, and being a system that can consume both OCI and Docker images.

We have also continued to see continued adoption of rkt in many important settings:

  • Laptop Kubernetes, minikube, can use rkt with a single flag
  • BlaBlaCar, a popular European car sharing service, uses rkt in production and is moving to Kubernetes
  • Container Linux services, like the kubelet and etcd, now run under rkt
  • Google ContainerVM, used by GKE, is beginning to use rkt for Kubelet mount management

We have published a post with details of our plans for rkt in 2017, where you can learn more about the Kubernetes + rkt CRI progress and our plans for leveraging the reference OCI container runtime, runc. Our big hope is that in 2017, by bringing a Pod Native container engine to Kubernetes, we can further accelerate innovation in the platform.

And we also want to note that while we continue to invest in rkt and its future in Kubernetes we will fully support users using either the Docker Engine or rkt Engine moving forward, end-to-end, in CoreOS Tectonic and Container Linux. Just like we support users making other choices in their deployments.

Privilege Monitoring with rkt VM pods

rkt is the only container engine that can, out of the box, execute container images using both native Linux containerization or virtual machine containment for greater isolation. Using a virtual machine, described in detail on the original Clear Containers blog post, adds additional protection if a process gets escalated privileges. But security is never a finished process, and defense in depth is a well established best practice. So rather than stopping here, we asked: what if we could detect those escalated privileges and stop a container from continuing to run?

The result is a new prototype (currently available as a pull request on rkt's GitHub) which introduces an external process privilege state machine that traps out of the virtual machine on important kernel events – like fork, exit, exec, setuid, etc – and the external state machine ensures that valid transitions are happening. If a process has performed an invalid transition and then encounters a permissions check within the kernel, the policy agent will kill the VM. In the future you could imagine capturing the VM's core for later analysis.

A quick demo of this concept is below. In this example, we are using a kernel containing an intentional exploit that raises permissions of a process when /proc/interrupts is opened.

This is an experimental concept and we hope to make it available in a future release of rkt. It is all part of our ongoing efforts to explore new security concepts and ideas.

Kubernetes Scaling

In many ways, Kubernetes is an application like any other: it has a horizontally scalable "frontend" tier and a backend database. Kubernetes's database is etcd, a consistent key-value database, introduced by CoreOS in 2013. As the database of Kubernetes, we have a big responsibility to the scaling properties of the system.

To help with this, earlier this year we introduced etcd v3.0, with a new datastore and API that enables Kubernetes to reach significant new scaling milestones: 5,000 nodes and 150,000 pods. This new backend can be enabled in Kubernetes v1.5 and will be enabled by default in v1.6.

As part of the etcd v3 release, we worked hard to ensure etcd had stable and consistent memory usage and latency. Over the last few months we have been benchmarking other consistent key-value systems, notably ZooKeeper and Consul, to see how etcd squares up - and early results show etcd taking the top spot on important metrics. The detailed results will be published soon in a more comprehensive blog post, but in the meantime here are two graphs comparing etcd and Zookeeper on resident memory for a 512MB dataset and write latency under load.

Self-driving Kubernetes

We discussed self-driving Kubernetes and now let’s take a quick look into the underlying architecture that enables it: self-hosted Kubernetes. The idea behind self-hosted Kubernetes is rather straightforward: use Kubernetes to manage Kubernetes to the greatest degree possible. And it is easy to understand by analogy to Linux. Early on in its history, Linus Torvalds didn't have Linux on which to do development of Linux; instead he used a Unix-like kernel called Minix. Later, once Linux (the kernel) was working, and the compiler was ported, Linus could use Linux to develop Linux.

In a similar way, self-hosted Kubernetes puts Kubernetes components into containers running pods. You can read more and see a demo on our blog post.

We are happy to say that our work with the Kubernetes community is enabling the self-driving capabilities of CoreOS Tectonic. And it isn't just Tectonic that is adopting this method of running Kubernetes, we have been working with the upstream teams working on kubeadm and other projects to ensure self-hosted becomes a first class Kubernetes experience.

Self-driving is just part of the self-hosted story. It also enables opportunities to remove monitoring and scaling toil. For example, want to scale up the Kubernetes Scheduler for redundancy? Easy, just a few clicks in Tectonic Console or keystrokes using kubectl:

And just because you have a self-driving mode on your car doesn’t mean you ignore the check engine light. You use the same APIs and monitoring you use for your applications to monitor the Kubernetes system itself. See above where CPU and memory of the scheduler pod is just a few clicks away, powered by Prometheus (another project to which CoreOS heavily contributes).

Kubernetes User Identity

Another point is the importance of user management and identity in either single or federated Kubernetes cluster. Dex on Tectonic uses well understood standards like OpenID connect and other widely used protocols. It plugs into LDAP systems too.

A major milestone with dex v2 is that no external databases are required so it is highly available.

Self-Driving: What we have done and will continue to do

We are charging down a path to deliver users 100% API-driven application deployment using cloud native technologies like Kubernetes, etcd, rkt, Dex, and Prometheus. As experts in open source, we realize the benefits of community-driven development and building on a shared standards enables faster innovation. Through the entire stack we are committed to ensure it is delivered to users in the best way possible so we can continue on our quest to secure the internet.