This blog post is the first in a series exploring the write performance of three distributed, consistent key-value stores: etcd, Zookeeper, and Consul. The post is written by the etcd team.
The Role of Consistent Key-value Stores
Many modern distributed applications are built on top of distributed consistent key-value stores. Applications in the Hadoop ecosystem and many parts of the "Netflix stack" use Zookeeper. Consul exposes a service discovery and health checking API and supports cluster tools such as Nomad. The Kubernetes container orchestration system, Vitess horizontal scaling for MySQL, Google Key Transparency project, and many others are built on etcd. With so many mission critical clustering, service discovery, and database applications built on these consistent key-value stores, measuring the reliability and performance is paramount.
The Need for Write Performance
The ideal key-value store ingests many keys per second, quickly persists and acknowledges each write, and holds lots of data. If a store can’t keep up with writes then requests will time-out, possibly triggering failovers and downtime. If writes are slow then applications appear sluggish. With too much data, a store may crawl or even be rendered inoperable.
We used dbtester to simulate writes and found that etcd outperforms similar consistent distributed key-value store software on these benchmarks. At a low level, the architectural decisions behind etcd demonstrably utilize resources more uniformly and efficiently. These decisions translate to reliably good throughput, latency, and total capacity under reasonable workloads and at scale. This in turn helps applications utilizing etcd, such as Kubernetes, be reliable, easy to monitor, and efficient.
There are many dimensions to performance; this post will drill down on key creation, populating the key-value store, to illustrate the mechanics under the hood.
Before jumping to high-level performance, it’s helpful to first highlight differences in key-value store behavior through resource utilization and concurrency; writes offer a good opportunity for this. Writes work the disk because they must be persisted down to media. That data must then replicate across machines, inducing considerable inter-cluster network traffic. That traffic makes up part of the complete overhead from processing writes, which consumes CPU. Finally, putting keys into the store draws on memory directly for key user-data and indirectly for book-keeping.
According to a recent user survey, the majority of etcd deployments use virtual machines. To abide by the most common platform, all tests run on Google Cloud Platform Compute Engine virtual machines with a Linux OS1. Each cluster uses three VMs, enough to tolerate a single node failure. Each VM has 16 dedicated vCPUs, 30GB memory, and a 300GB SSD with 150 MB/s sustained writes. This configuration is powerful enough to simulate traffic from 1,000 clients, which is a minimum for etcd’s use cases and the chosen target for the following resource measurements. All tests were run with multiple trials; the deviation among runs was relatively small and did not impact any general conclusions. The setup (with etcd) is diagrammed below:
All benchmarks use the following software configuration:
|Zookeeper||r3.4.9||Java 8 (JRE build 1.8.0_121-b13)|
Each resource utilization test creates one million unique 256-byte keys with 1024-byte values. The key length was selected to stress the store using a common maximum path length. The value length was selected because it’s the expected average size for protobuf-encoded Kubernetes values. Although exact average key length and value lengths are workload-dependent, the lengths are representative of a trade-off between extremes. A more precise sensitivity study would shed more insight on best-case performance characteristics for each store, but risks belaboring the point.
Write operations must persist to disk; they log consensus proposals, compact away old data, and save store snapshots. For the most part, writes should be dominated by logging consensus proposals. etcd’s log streams protobuf-encoded proposals to a sequence of preallocated files, syncing at page boundaries with a rolling CRC32 checksum on each entry. Zookeeper’s transaction log is similar, but is jute-encoded and checksums with Adler32. Consul takes a different approach, instead logging to its boltdb/bolt backend, raft-boltdb.
The chart below shows how scaling client concurrency impacts disk writes. As expected, when concurrency increases, disk bandwidth, as measured from
/proc/diskstats over an ext4 filesystem, tends to increase to match increased request pressure. The disk bandwidth for etcd grows steadily; it writes more data than Zookeeper because it must also write to boltDB in addition to its log. Zookeeper, on the other hand, loses its data rate, on account of writing out full state snapshots; these full snapshots contrast to etcd’s incremental and concurrent commits to its backend, which write only updates and without stopping the world. Consul’s data rate is initially greater than etcd, possibly due to both write amplification by removing committed raft proposals from its B+Tree, before fluctuating due to taking several seconds to write out snapshots.
The network is the central to a distributed key-value store. Clients communicate with the key-value store cluster’s servers and the servers, being distributed, communicate with each other. Each key-value store has its own client protocol; etcd clients use gRPC with Protocol Buffer v3 over HTTP/2, Zookeeper clients use Jute over a custom streaming TCP protocol, and Consul speaks JSON. Likewise, each has its own server protocol over TCP; etcd peers stream protobuf-encoded raft RPC proposals, Zookeeper uses TCP streams for ordered bi-directional jute-encoded ZAB channels, and Consul issues raft RPCs encoded with MsgPack.
The chart below shows total network utilization for all servers and clients. For the most part, etcd has the lowest network usage, aside from Consul clients receiving slightly less data. This can be explained by etcd’s
Put responses containing a header with revision data whereas Consul simply responds with a plaintext
true. Inter-server traffic for Zookeeper and Consul is likely due to transmitting large snapshots and less space efficient protocol encoding.
Even if the storage and network are fast, the cluster must be careful with processing overhead. Opportunities to waste CPU abound: many messages must be encoded and decoded, poor concurrency control can contend on locks, system calls can be made with alarming frequency, and memory heaps can thrash. Since etcd, Zookeeper, and Consul all expect a leader server to process writes, poor CPU utilization can easily sink performance.
The graph below shows the server CPU utilization, measured with
top -b -d 1, when scaling clients.
etcd CPU utilization scales as expected both on average and for maximum load; as more connections are added, CPU load increases in turn. Most striking is Zookeeper’s average drop at 700 but rise at 1000 clients; the logs report
Too busy to snap, skipping in its
Creating new log file, going from 1,073% utilization to 230%. This drop also happens at 1,000 clients but is less obvious from the average, utilization goes from 894% to 321%. Similarly, Consul CPU utilization drops for ten seconds when processing snapshots, going from 389% CPU to 16%.
When a key-value store is designed for only managing metadata-sized data, most of that data can be cached in memory. Maintaining an in-memory database buys speed, but at the cost of an excessive memory footprint that may trigger frequent garbage collection and disk swapping, degrading overall performance. While Zookeeper and Consul load all key-value data in-memory, etcd only keeps a small resident, in-memory index, backing most of its data directly through a memory-mapped file in boltdb. Keeping the data only in boltDB incurs disk accesses on account of demand paging but, overall, etcd better respects operating system facilities.
The graph below shows the effect of adding more keys into a cluster on its total memory footprint. Most notably, etcd uses less than half the amount of memory as Zookeeper or Consul once an appreciable number of keys are in the store. Zookeeper places second, claiming four times as much memory; this in line with the recommendation to carefully tuning JVM heap settings. Finally, although Consul uses boltDB like etcd, its in-memory store negates the footprint advantage found in etcd, consuming the most memory of the three.
Blasting the store
With physical resources settled, focus can return to aggregated benchmarking. First, to find the maximum key ingestion rate, system concurrency is scaled up to a thousand clients. These best ingest rates give a basis for measuring the latency under load; thus gauging the total wait time. Likewise per-system client counts with the best ingest rate, the total capacity can be stressed by measuring drop of throughput as keys scale up from one million to three million keys.
As more clients concurrently write to the cluster, the ingestion rate should ideally steadily rise before leveling off. However, the graph below shows this is not the case when scaling the number of clients when writing out a million keys. Instead, Zookeeper (maximum rate 43,558 req/sec) fluctuates wildly; this is not surprising since it must be explicitly configured to allow large numbers of connections. Consul’s throughput (maximum rate 16,486 req/sec) cleanly scales, but dips under concurrency pressure to low rates. The throughput for etcd (maximum rate 34,747 req/sec) is overall stable, slowly rising with concurrency. Finally, despite Consul and Zookeeper using significantly more CPU, the maximum throughput still lags behind etcd.
Given the best throughput for the store, the latency should be at a local minima and stable; queuing effects will delay additional concurrent operations. Likewise, ideally latencies would remain low and stable as total keys increases; if requests become unpredictable, there may be cascading timeouts, flapping monitoring alerts, or failures. However, judging by the latency measurements shown below, only etcd has both the lowest average latencies and tight, stable bounds at scale.
A word on what happened to the other servers. Zookeeper struggles serving concurrent clients at its best throughput; once it triggers a snapshot, client requests begin failing. The server logs list errors such as
Too busy to snap, skipping, fsync-ing the write ahead log and
fsync-ing the write ahead log in SyncThread: 1 took 1,038 ms which will adversely effect operation latency, finally culminating in leader loss,
Exception when following the leader. Client requests occasionally failed, including errors such as
zk: could not connect to a server errors and
zk: connection closed errors. Consul reports no errors, although it probably should; it experiences wide variance degraded performance, likely due to its heavy write amplification.
Armed with the best amount of concurrency for the best average throughput for one million keys, it’s possible to test throughput as capacity scales. The graph below shows the time-series latency, with a logarithmic scale for latency spikes, as keys are added to the store up to three million keys. Both Zookeeper and Consul latency spikes grow after about half a million keys. No spikes happen with etcd, owing to efficient concurrent snapshotting, but there is a slight latency increase starting slightly before a million keys.
Of particular note, just before two million keys, Zookeeper completely fails. Its followers fall behind, failing to receive snapshots in time; this manifests in leader elections taking up to 20 seconds, live-locking the cluster.
etcd stably delivers better throughput and latency than Zookeeper or Consul when creating a million keys and more. Furthermore, it achieves this with half as much memory, showing better efficiency. However, there is some room for improvement; Zookeeper manages to deliver better minimum latency over etcd, at the cost of unpredictable average latency.
All benchmarks were generated with etcd’s open source dbtester. The sample testing parameters for above tests are available for anyone wishing to reproduce the results. For simpler, etcd-only benchmarks, try the etcd3 benchmark tool.
A future post will cover the read and update performance.
To learn more about etcd, please visit coreos.com/etcd.