Announcing etcd 3.1

January 20, 2017 · By Anthony Romano

A new year and a new milestone release of etcd. Hot on the heels of 17 bugfix releases to etcd 3.0, two alphas, and two release candidates, the etcd team is proud to announce etcd 3.1. This edition of etcd features performance, reliability, and API enhancements over the 3.0 series. It also introduces the first iteration of the etcd v3 gRPC proxy, a smart proxy for offloading client requests away from the core cluster.

Fast linearized reads

etcd provides both serialized and linearized consistency models for reading keys. A serialized read is fast, no consensus is necessary, but unsuitable for many applications because it may return stale data. A linearized read returns the most recent keys by going through etcd’s underlying raft protocol, and therefore carries greater overhead. While etcd 3.0 processes linearized reads through direct raft proposals, chewing precious disk bandwidth and incurring the corresponding latency penalty, linearized reads in 3.1 issue idempotent, fsync-free linearized raft index requests. These index requests are also batched; concurrent linearized reads coalesce into a single index request. The upshot is linearized reads have lower latency and better throughput.

The graph below, based on quick measurements from a laptop with the etcd benchmark tool, illustrates the improvement as concurrency increases. The linearized read implementation in 3.1 clearly outperforms 3.0 linearized reads given little concurrency and rapidly reaches its maximum throughput. These quick measurements are here to give a sense of the overall trend of these improvements, and we have deliberately redacted specific numbers since only relative trends matter here. We have an in-depth benchmarking series, with rigorous benchmarks, coming soon.

Read Throughput over Concurrency
Read Throughput over Concurrency

Availability and Reliability

In the past, upgrading an etcd cluster meant temporarily losing the leader and therefore a brief loss of availability. If an etcd member needed to be taken offline, when upgrading for example, and if it was the cluster leader, then the cluster followers would timeout on that leader and initiate an election. Waiting for a new leader would cause a short cluster outage. To improve availability in this case, the leader will automatically transfer its leadership to another member before going offline.

The quorum-based consensus for etcd has a drawback in that permanent quorum loss permanently downs the cluster. To avoid instances of operator error which inflict quorum loss on the cluster, etcd now by default checks member health before reconfiguring cluster membership. These checks use peer liveness information to ensure membership changes are safe. A member removal request is rejected if the removal causes quorum loss when considering how many members are active. A member add request is rejected if quorum would be lost if the new member never joins the cluster, in case the member is configured with a bad address.

New APIs

Based on user feedback, the etcd v3 API now includes features to better manage leases, efficiently process keys by revision, and reduce total round-trips. Leases now support non-destructive TTL fetches, useful for checking the time left on a lease, and listing attached keys, useful for finding all resources attached to a session. Key range requests can specify minimum and maximum modification and creation revisions, useful when monitoring wait lists for distributed locks and elections. Watches can optionally return the old key value on delete events, saving the cost of a round trip.

The etcd v3 authentication API, which was alpha in the 3.0 series, is now stable; any future change to the authentication API will not break older etcd v3 clients. The role-based authentication model is similar to the one found in etcd v2 API. The major differences include authentication tokens, a much faster mechanism than etcd v2’s per-request bcrypt calls, and permissions governed by key range intervals, instead of only key prefixes like in etcd v2.

Introducing gRPC proxy

An etcd cluster replicates its data to all its members. The overhead from this replication counterintuitively causes an etcd cluster to slow down, instead of scaling, after adding more members. This performance loss fixes the number of members to the desired fault tolerance; scaling must be achieved through other means. The new etcd gRPC proxy aims to reduce the amount of load on the core etcd cluster through caching and request coalescing.

The gRPC proxy includes a cache of recently accessed keys. The cache serves serialized key fetches, which don’t need to go through consensus, that would otherwise be handled by a cluster member. The advantage is the proxy absorbs serialized key request spam from misbehaving or misconfigured clients. The graph below shows the effect of this cache on repeatedly fetching the same key until CPU saturation; it effectively eliminates CPU load on the etcd server.

Proxy versus Unproxied CPU Utilization
Proxy versus Unproxied CPU Utilization

The proxy also coalesces watches from many clients into a single watch stream. This coalescing conserves total open connections to the cluster and reduces overall network traffic from the cluster by deferring event fan-out to the proxy.

Learn more

The latest and greatest etcd developments can be found in the etcd github repository. The project also hosts signed binaries for 3.1.0 and historical releases on the etcd release page. The github repository also has the most up-to-date etcd documentation.

As always, the etcd team is committed to building the best distributed consistent key-value store; feel free to report any bugs, ask questions, or make suggestions on the etcd issue tracker.