etcd3: A new etcd

June 30, 2016 · By Anthony Romano and Xiang Li

Over the past few months, CoreOS has been diligently finalizing the etcd3 API beta, testing the system and working with users to make etcd even better. Today etcd v3.0.0, the distributed key value store developed by CoreOS, is available.

In practice, etcd3 is already integrated into a large-scale distributed system, Kubernetes, and we have implemented distributed coordination primitives including distributed locks, elections, and software transactional memory, to ensure the etcd3 API is flexible enough to support a variety of applications. Today we’re proud to announce that etcd3 is ready for general use.

"The increased efficiency, resiliency and scale of etcd v3.0 helps enable Kubernetes to scale to tens of thousands of machines under management and beyond,” said David Aronchick, product manager, Google. “The Google Container Engine (GKE) team thanks the CoreOS team for their continued contributions to the Kubernetes project that make it a production-grade container management system for the cloud native community."

etcd 3.0 marks the first stable release of the etcd3 API and data model. Upgrades are simple, because the same etcd2 JSON endpoints and internal cluster protocol are still provided in etcd3. Nevertheless, etcd3 is a wholesale API redesign based on feedback from etcd2 users and experience with scaling etcd2 in practice. This post highlights some notable etcd3 improvements in efficiency, reliability, and concurrency control.

"It's been exciting to watch the etcd technology and community evolve and grow to as part of the Kubernetes project. The etcd 3 release can help further this evolution and we look forward to bringing many of the new features and capabilities in the Red Hat OpenShift container application platform products," said Timothy St. Clair, principal software engineer, Red Hat.

From etcd2 to etcd3

etcd was originally designed to solve machine coordination for CoreOS updates. Now it is used for distributed networking, discovery, configuration data, scheduling, and load balancing services. Parts of the original design proved successful: etcd evolved into a key value store with JSON endpoints, watchers on continuous key updates, and time-to-live (TTL) keys. Unfortunately, some of these features tended to be chatty on the wire with clients, could put quorum pressure on the cluster when idling, and could unpredictably garbage-collect older key revisions. Although etcd2 can satisfactorily coordinate machines, coordinating the proliferating microservices in today’s infrastructure requires an etcd supporting tens of thousands of active clients operating on a million keys within a single cluster.

The etcd3 system reflects the lessons learned from etcd2. The base server interface uses gRPC instead of JSON for increased efficiency. Support for JSON endpoints is maintained through a gRPC gateway. The new API revisits the design of key expiry TTLs, replacing them with a lightweight streaming lease keepalive model. Watchers are redesigned as well, replacing the older event model with one that streams and multiplexes events over key intervals. The v3 data model does away with explicit key hierarchies and unreliable watch windows, replacing them with a flat binary key space with transactional, multiversion concurrency control semantics.

"etcd has been used as a central configuration management store in our Elastic Services Infrastructure (ESI) system, which is working in production," said Ichiro Fukuda, chief architect, infrastructure at NTT Innovation Institute, Inc. "We are very excited about the etcd v3 release from CoreOS and look forward to expanding our use of it to enhance and scale our services."

We invite you to join us in the celebration of the performance and scalability improvements that make etcd and the production-ready v3 the cornerstone of cloud native, distributed systems. Companies like Huawei, NTT Innovation Institute, and PingCAP rely on this fast key-value store for their projects and products.

“Congratulations to CoreOS for the etcd v3.0 release. etcd is used in Huawei’s PaaS platform as a critical component in the architecture for synchronizing distributed data and for watching data changes. With etcd 3.0, we expect it will improve the scalability, performance and reliability of the overall platform significantly. We are pleased that we have been working with CoreOS team as well as the community on this technology, and we look forward to continuing our collaboration with CoreOS in hope to further advance the technology and its ecosystem.” – said Ying Xiong, chief architect of PaaS at Huawei.

Efficiency and scaling

RPCs

Native etcd3 clients communicate over a gRPC protocol. The protocol messages are defined using protobuf, which simplifies the generation of efficient RPC stub code and makes protocol extensions easier to manage. For comparison, even after optimizing etcd2’s client-side JSON parsing, etcd3 gRPC still holds a 2x message processing performance edge over etcd2. Likewise, gRPC is better at handling connections. Where gRPC uses HTTP2 to multiplex multiple streams of RPCs over a single TCP connection, a JSON client must establish a new connection for every request.

Leases

Keys expire in etcd2 through a time-to-live (TTL) mechanism. For every key with a TTL, a client must periodically refresh the key to keep it from being automatically deleted when the TTL expires. Each refresh establishes a new connection and issues a consensus proposal to etcd to update the key. To keep all TTL keys alive, an idling cluster must have a minimum request throughput of the number of TTL keys divided by the average TTL.

Leases in etcd3 replace the earlier notion of key TTLs. Leases reduce keep-alive traffic and eliminate steady-state consensus updates. Instead of a key having a TTL, a lease with a TTL is attached to a key. When the lease’s TTL expires, it deletes all attached keys. This model reduces keep-alive traffic when multiple keys are attached to the same lease. The keep-alive connection thrashing in etcd2 is avoided by multiplexing keep-alives on a lease’s single gRPC stream. Likewise, keep-alives are processed by the leader, avoiding any consensus overhead when idling.

Watchers

A watch in etcd waits for changes to keys. Unlike systems such as ZooKeeper or Consul that return one event per watch request, etcd can continuously watch from the current revision. In etcd2, these streaming watches use long polling over HTTP, forcing the etcd2 server to wastefully hold open a TCP connection per watch. When an application with thousands of clients watches thousands of keys, it can quickly exhaust etcd2 server socket and memory resources.

The etcd3 API multiplexes watches on a single connection. Instead of opening a new connection, a client registers a watcher on a bidirectional gRPC stream. The stream delivers events tagged with a watcher’s registered ID. Multiple watch streams can even share the same TCP connection. Multiplexing and stream connection sharing reduce etcd3’s memory footprint by at least an order of magnitude.

Data model with reliable events

Like any key-value store, etcd’s data model maps keys to values. The etcd2 model only keeps the most recent key-value mappings available; older versions are discarded. However, applications which track all key changes or scan the entire key space need a reliable event stream to consistently reconstruct past key states. To avoid prematurely dropping events so that these applications can work even if briefly disconnected, etcd2 maintains a short global sliding window of events. However, if a watch begins on a revision that the window passed over, the watch may miss discarded events.

etcd3 does away with this unpredictable window, instead retaining historical key revisions through a new multi-version concurrency control model. The retention policy for this history can be configured by cluster administrators for fine-grained storage management. Usually etcd3 discards old revisions of keys on a timer. A typical etcd3 cluster retains superseded key data for hours. To reliably handle longer client disconnection, not just transient network disruptions, watchers simply resume from the last observed historical revision. To read from the store at a particular point-in-time, key read requests may be tagged with a revision that will return the key’s value at the given revision.

In addition to preserving historical data in the key value store, etcd3 trades etcd2’s hierarchical key structure for a flat binary key space. In practice, applications tended to either fetch individual keys, or to recursively fetch all keys under a directory. Usage patterns didn’t justify the overhead of maintaining hierarchy information. etcd3 instead uses key ranges that search for keys in an interval. This interval model supports both querying on prefixes and, with a convention for key naming, listing keys as if from a directory.

Concurrency control

When multiple clients concurrently read and modify a key or a set of keys, it is important to have synchronization primitives to prevent data races from corrupting application state. For this purpose, etcd2 offers both load-link/store-conditional and compare-and-swap operations; a client specifies a previous revision index or previous value to match before updating a key. Although these operations suffice for simple semaphores and limited atomic updates, they are inadequate for describing more sophisticated approaches to serializing data access, such as distributed locks and transactional memory.

etcd3 can serialize multiple operations into a single conditional mini-transaction. Each transaction includes a conjunction of conditional guards (e.g., checks on key version, value, modified revision, and creation revision), a list of the operations to apply when all conditions evaluate to true (i.e., Get, Put, Delete), and a list of operations to apply if any of the conditions evaluate to false. Transactions make distributed locks safe in etcd3 because accesses can be conditional based on whether the client still holds its lock. This means that even if a client loses its claim on a lock, whether due to clock skew or missing expiration events, etcd3 will refuse to honor the stale request.

Looking forward

etcd3 represents a conceptual leap over the etcd2 model. The new etcd3 API is more efficient, scaling to the evolving demands that applications place on etcd today, and to the hyperscale clusters of tomorrow.

“We use etcd extensively in our large scale distributed storage product - TiKV. We ported etcd’s raft library to rust to support consistent replication in our storage layer, and use etcd to coordinate the storage nodes in the TiKV cluster,” said Qi Liu, CEO of PingCAP. “We are glad to work with the etcd team, and very happy about their new release!”

With etcd3, data delivery is more reliable through multi-versioned historical key data. Likewise, etcd3’s transaction primitives make atomic updates to multiple keys easy, and synchronizing multiple processes safe.

etcd is under active development to continue to be the best in class cluster consensus software. The project follows a stable release model for fast development of new features without sacrificing stability. In the future, the etcd project plans to add smart proxying for better scale-out, protocol gateways for custom etcd personalities, and more testing for better assurance throughout the system.

Bugs, suggestions, or general questions? Feel free to tell us about it!

Eager to try out the power of distributed computing based on etcd, Kubernetes, and other technologies from CoreOS? Check out the free Tectonic Starter plan and explore today.