Because etcd writes data to disk and persists proposals on disk, its performance depends on disk performance.
Although etcd is not particularly I/O intensive, it requires a low latency block device for optimal performance and stability. Because etcd’s consensus protocol depends on persistently storing metadata to a log (WAL), etcd is sensitive to disk-write latency. Slow disks and disk activity from other processes can cause long fsync latencies.
Those latencies can cause etcd to miss heartbeats, not commit new proposals to the disk on time, and ultimately experience request timeouts and temporary leader loss. High write latencies also lead to an OpenShift API slowness, which affects cluster performance. Because of these reasons, avoid colocating other workloads on the control-plane nodes that are I/O sensitive or intensive and share the same underlying I/O infrastructure.
In terms of latency, run etcd on top of a block device that can write at least 50 IOPS of 8000 bytes long sequentially. That is, with a latency of 10ms, keep in mind that uses fdatasync to synchronize each write in the WAL. For heavy loaded clusters, sequential 500 IOPS of 8000 bytes (2 ms) are recommended. To measure those numbers, you can use a benchmarking tool, such as fio.
To achieve such performance, run etcd on machines that are backed by SSD or NVMe disks with low latency and high throughput. Consider single-level cell (SLC) solid-state drives (SSDs), which provide 1 bit per memory cell, are durable and reliable, and are ideal for write-intensive workloads.
|
The load on etcd arises from static factors, such as the number of nodes and pods, and dynamic factors, including changes in endpoints due to pod autoscaling, pod restarts, job executions, and other workload-related events. To accurately size your etcd setup, you must analyze the specific requirements of your workload. Consider the number of nodes, pods, and other relevant factors that impact the load on etcd.
|
The following hard drive practices provide optimal etcd performance:
-
Use dedicated etcd drives. Avoid drives that communicate over the network, such as iSCSI. Do not place log files or other heavy workloads on etcd drives.
-
Prefer drives with low latency to support fast read and write operations.
-
Prefer high-bandwidth writes for faster compactions and defragmentation.
-
Prefer high-bandwidth reads for faster recovery from failures.
-
Use solid state drives as a minimum selection. Prefer NVMe drives for production environments.
-
Use server-grade hardware for increased reliability.
|
Avoid NAS or SAN setups and spinning drives. Ceph Rados Block Device (RBD) and other types of network-attached storage can result in unpredictable network latency. To provide fast storage to etcd nodes at scale, use PCI passthrough to pass NVM devices directly to the nodes.
|
Always benchmark by using utilities such as fio. You can use such utilities to continuously monitor the cluster performance as it increases.
|
Avoid using the Network File System (NFS) protocol or other network based file systems.
|
Some key metrics to monitor on a deployed OpenShift Container Platform cluster are p99 of etcd disk write ahead log duration and the number of etcd leader changes. Use Prometheus to track these metrics.
|
The etcd member database sizes can vary in a cluster during normal operations. This difference does not affect cluster upgrades, even if the leader size is different from the other members.
|
To validate the hardware for etcd before or after you create the OpenShift Container Platform cluster, you can use fio.
The output reports whether the disk is fast enough to host etcd by comparing the 99th percentile of the fsync metric captured from the run to see if it is less than 10 ms. A few of the most important etcd metrics that might affected by I/O performance are as follow:
-
etcd_disk_wal_fsync_duration_seconds_bucket
metric reports the etcd’s WAL fsync duration
-
etcd_disk_backend_commit_duration_seconds_bucket
metric reports the etcd backend commit latency duration
-
etcd_server_leader_changes_seen_total
metric reports the leader changes
Because etcd replicates the requests among all the members, its performance strongly depends on network input/output (I/O) latency. High network latencies result in etcd heartbeats taking longer than the election timeout, which results in leader elections that are disruptive to the cluster. A key metric to monitor on a deployed OpenShift Container Platform cluster is the 99th percentile of etcd network peer latency on each etcd cluster member. Use Prometheus to track the metric.
The histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[2m]))
metric reports the round trip time for etcd to finish replicating the client requests between the members. Ensure that it is less than 50 ms.