The ability to autoscale workloads based on metrics such as CPU and memory usage is one of the most powerful features of Kubernetes. Of course, to enable this feature we first need a method of gathering and storing these metrics. Today this is most often accomplished using Heapster, but this method can be cumbersome and support from the various contributors to the project has been inconsistent – and in fact it may soon be phased out.
Fortunately, the new Kubernetes metrics APIs are paving the way for a more consistent and efficient way to supply metrics data for the purpose of autoscaling based on Prometheus. It's no secret that we at CoreOS are big fans of Prometheus, so in this post we will explain the metrics APIs, what's new, and our recommended method of scaling Kubernetes workloads, going forward.
This post assumes you have a basic understanding of Kubernetes and monitoring.
The Heapster problem
Heapster provides metric collection and basic monitoring capabilities and it supports multiple data sinks to store the collected metrics. The code for each sink resides within the Heapster repository. Heapster also enables the use of the Horizontal Pod Autoscaler (HPA) to automatically scale workloads based on metrics.
There are two problems with the architecture Heapster has chosen to implement. First, it assumes the data store is a bare time-series database for which there is a direct write path. This makes it fundamentally incompatible with Prometheus, as Prometheus uses a pull-based model. Because the rest of the Kubernetes ecosystem has first class Prometheus support, however, it's not uncommon to run Prometheus, Heapster, an an additional non-Prometheus data store exclusively for Heapster (which typically is InfluxDB) – a less-than-ideal scenario.
Second, because the code for each sink is considered part of the core Heapster code base, the result is a "vendor dump," where vendors implement support for their systems but often swiftly abandon the code. This is a common cause of frustration when maintaining Heapster. At the time of this writing, many of the 15 available sinks have been unsupported for a long time.
What's more, even though Heapster doesn’t implement Prometheus as a data sink, it exposes metrics in Prometheus format. This often causes additional confusion.
A bit over a year ago, SIG-Instrumentation was founded and this problem was one of the first we tackled. Contributors and maintainers of the Heapster, Prometheus, and Kubernetes projects came together to design the Kubernetes resource and custom metrics APIs, which point the way forward to a better approach to autoscaling.
Resource and custom metrics APIs
To avoid repeating Heapster's mistakes, the resource and custom metrics APIs were intentionally created as mere API definitions and not implementations. They are installed into a Kubernetes cluster as aggregated APIs, which allows implementations to be switched out while the APIs stay the same. Both APIs are defined to respond with the current value of the requested metric/query and are both available in beta starting with Kubernetes 1.8.0. Historical metrics APIs may be defined and implemented in the future.
The canonical implementation of the resource metrics API is the Metrics Server, which simply gathers what is referred to as the resource metrics: CPU and memory (and possibly more in the future). It gathers these from all the kubelets in a cluster through the kubelet’s stats AP and simply keeps all values on Pods and Nodes in memory.
The custom metrics API, as the name says, allows requesting arbitrary metrics. Custom metrics API implementations are specific to the respective backing monitoring system. Prometheus was the first monitoring system that an adapter was developed for, simply due to it being a very popular choice to monitor Kubernetes. This Kubernetes Custom Metrics Adapter for Prometheus can be found in the k8s-prometheus-adapter repository on GitHub. Requests to the adapter (aka the Prometheus implementation of the custom metrics API) are converted to a Prometheus query and executed against the respective Prometheus server. The result Prometheus returns is then returned by the custom metrics API adapter.
This architecture solves all the problems we intended to solve:
- Resource metrics can be used more reliably and consistently.
- There is no "vendor dump" for data sinks. Whoever implements an adapter must maintain it.
- Pull-based as well as push-based monitoring systems can be supported.
- Running Heapster with a datastore like InfluxDB in addition to Prometheus will not be necessary anymore.
- Prometheus can consistently be used to monitor, alert and autoscale.
Better yet, because the Kubernetes metrics APIs are standardized, we can now also consistently autoscale on custom metrics, such as worker queue size, in addition to plain CPU and memory.
What to do going forward
Using the Custom Metrics Adapter for Prometheus means we can autoscale on arbitrary metrics that we already collect with Prometheus, without the need to run Heapster at all. In fact, one of the areas SIG-Instrumentation is currently working on is phasing out Heapster – meaning it will eventually be unsupported. Thus, I recommend switching to using the resource and custom metrics APIs sooner rather than later. To enable using the resource and custom metrics APIs with the HPA one must pass the following flag to the kube-controller-manager:
If you have any questions, feel free to follow up with me on Twitter (@FredBrancz) or Kubernetes Slack (@brancz). I also want to give Solly Ross (@directXMan12) a huge shout-out as he worked on all of this from the HPA to defining the resource and custom metrics APIs as well as implementing the Custom Metrics Adapter for Prometheus.
Finally, if you are interested in this area and would like to contribute, please join us on the SIG-Instrumentation biweekly call on Thursdays at 17:30 UTC. See you there!