A Site Reliability Engineer (SRE) is a person that operates an application by writing software. They are an engineer, a developer, who knows how to develop software specifically for a particular application domain. The resulting piece of software has an application's operational domain knowledge programmed into it.
Our team has been busy in the Kubernetes community designing and implementing this concept to reliably create, configure, and manage complex application instances atop Kubernetes.
We call this new class of software Operators. An Operator is an application-specific controller that extends the Kubernetes API to create, configure, and manage instances of complex stateful applications on behalf of a Kubernetes user. It builds upon the basic Kubernetes resource and controller concepts but includes domain or application-specific knowledge to automate common tasks.
With Kubernetes, it is relatively easy to manage and scale web apps, mobile backends, and API services right out of the box. Why? Because these applications are generally stateless, so the basic Kubernetes APIs, like Deployments, can scale and recover from failures without additional knowledge.
A larger challenge is managing stateful applications, like databases, caches, and monitoring systems. These systems require application domain knowledge to correctly scale, upgrade, and reconfigure while protecting against data loss or unavailability. We want this application-specific operational knowledge encoded into software that leverages the powerful Kubernetes abstractions to run and manage the application correctly.
An Operator is software that encodes this domain knowledge and extends the Kubernetes API through the third party resources mechanism, enabling users to create, configure, and manage applications. Like Kubernetes's built-in resources, an Operator doesn't manage just a single instance of the application, but multiple instances across the cluster.
To demonstrate the Operator concept in running code, we have two concrete examples to announce as open source projects today:
The etcd Operator creates, configures, and manages etcd clusters. etcd is a reliable, distributed key-value store introduced by CoreOS for sustaining the most critical data in a distributed system, and is the primary configuration datastore of Kubernetes itself.
The Prometheus Operator creates, configures, and manages Prometheus monitoring instances. Prometheus is a powerful monitoring, metrics, and alerting tool, and a Cloud Native Computing Foundation (CNCF) project supported by the CoreOS team.
Operators build upon two central Kubernetes concepts: Resources and Controllers. As an example, the built-in ReplicaSet resource lets users set a desired number number of Pods to run, and controllers inside Kubernetes ensure the desired state set in the ReplicaSet resource remains true by creating or removing running Pods. There are many fundamental controllers and resources in Kubernetes that work in this manner, including Services, Deployments, and Daemon Sets.
An Operator builds upon the basic Kubernetes resource and controller concepts and adds a set of knowledge or configuration that allows the Operator to execute common application tasks. For example, when scaling an etcd cluster manually, a user has to perform a number of steps: create a DNS name for the new etcd member, launch the new etcd instance, and then use the etcd administrative tools (
etcdctl member add) to tell the existing cluster about this new member. Instead with the etcd Operator a user can simply increase the etcd cluster size field by 1.
Other examples of complex administrative tasks that an Operator might handle include safe coordination of application upgrades, configuration of backups to offsite storage, service discovery via native Kubernetes APIs, application TLS certificate configuration, and disaster recovery.
Operators, by their nature, are application-specific, so the hard work is going to be encoding all of the application operational domain knowledge into a reasonable configuration resource and control loop. There are some common patterns that we have found while building operators that we think are important for any application:
Operators should install as a single deployment e.g.
kubectl create -f https://coreos.com/operators/etcd/latest/deployment.yaml and take no additional action once installed.
Operators should create a new third party type when installed into Kubernetes. A user will create new application instance using this type.
Operators should leverage built-in Kubernetes primitives like Services and Replica Sets when possible to leverage well-tested and well-understood code.
Operators should be backwards compatible and always understand previous versions of resources a user has created.
Operators should be designed so application instances continue to run unaffected if the Operator is stopped or removed.
Operators should give users the ability to declare a desired version and orchestrate application upgrades based on the desired version. Not upgrading software is a common source of operational bugs and security issues and Operators can help users more confidently address this burden.
Operators should be tested against a "Chaos Monkey" test suite that simulates potential failures of Pods, configuration, and networking.
The etcd Operator and Prometheus Operator introduced by CoreOS today showcase the power of the Kubernetes platform. For the last year, we have worked alongside the wider Kubernetes community, laser-focused on making Kubernetes stable, secure, easy to manage, and quick to install.
Now, as the foundation for Kubernetes has been laid, our new focus is the system to be built on top: software that extends Kubernetes with new capabilities. We envision a future where users install Postgres Operators, Cassandra Operators, or Redis Operators on their Kubernetes clusters, and operate scalable instances of these programs as easily they deploy replicas of their stateless web applications today.
To learn more, dive into the GitHub repos, discuss on our community channels, or come talk with the CoreOS team at KubeCon on Tuesday, November 8. Don't miss my keynote on Tuesday, November 8 at 5:25 p.m. PT, where I'll cover Operators and other Kubernetes topics.
Q: How is this different than StatefulSets (previously PetSets)?
A: StatefulSets are designed to enable support in Kubernetes for applications that require the cluster to give them "stateful resources" like static IPs and storage. Applications that need this more stateful deployment model still need Operator automation to alert and act on failure, backup, or reconfigure. So, an Operator for applications needing these deployment properties could use StatefulSets instead of leveraging ReplicaSets or Deployments.
Q: How is this different from configuration management like Puppet or Chef?
A: Containers and Kubernetes are the big differentiation that make Operators possible. With these two technologies deploying new software, coordinating distributed configuration, and checking on multi-host system state is consistent and easy using Kubernetes APIs. Operators glue these primitives together in a useful way for application consumers; it isn't just about configuration but the entire, live, application state.
Q: How is this different than Helm?
A: Helm is a tool for packaging multiple Kubernetes resources into a single package. The concept of packaging up multiple applications together and using Operators that actively manage applications are complementary. For example, traefik is a load balancer that can use etcd as its backend database. You could create a Helm Chart that deploys a traefik Deployment and etcd cluster instance together. The etcd cluster would then be deployed and managed by the etcd Operator.
Q: What if someone is new to Kubernetes? What does this mean?
A: This shouldn't change anything for new users except make it easier for them to deploy complex applications like etcd, Prometheus, and others in the future. Our recommended onboarding path for Kubernetes is still minikube, kubectl run, and then maybe start playing with the Prometheus Operator to monitor the app you deployed with
Q: Is the code available for etcd Operator and Prometheus Operator today?
Q: Do you have plans for other Operators?
A: Yes, that is likely in the future. We would also love to see new Operators get built by the community as well. Let us know what other Operators you would like to see built next.
Q: How do Operators help secure a cluster?
A: Not upgrading software is a common source of operational bugs and security issues and Operators can help users more confidently address the burden of doing a correct upgrade.
Q: Can Operators help with disaster recovery?
A: Operators can make it easy to periodically back up application state and recover previous state from the backup. A feature we hope will become common with Operators is easily enabling users to deploy new instances from backups.