This document describes the process to replace a single etcd member. This procedure assumes that there is still an etcd quorum in the cluster.

If you have lost the majority of your master hosts, leading to etcd quorum loss, then you must follow the disaster recovery procedure to recover from lost master hosts instead of this procedure.

If the TLS certificates are not valid on the member being replaced, then you must follow the procedure to recover from expired control plane certificates instead of this procedure.

To replace a single master host:

Removing a failed master host from the etcd cluster

Follow these steps to remove a failed master host from the etcd cluster.

Prerequisites
  • Access to the cluster as a user with the cluster-admin role.

  • SSH access to an active master host.

Procedure
  1. Access an active master host.

  2. View the list of Pods associated with etcd:

    [core@ip-10-0-128-73 ~]$ oc get pods -n openshift-etcd
    NAME                                                     READY   STATUS    RESTARTS   AGE
    etcd-member-ip-10-0-128-73.us-east-2.compute.internal    2/2     Running   0          15h
    etcd-member-ip-10-0-147-172.us-east-2.compute.internal   2/2     Running   7          122m
    etcd-member-ip-10-0-171-108.us-east-2.compute.internal   2/2     Running   0          15h
  3. Run the etcd-member-remove.sh script and pass in the name of the etcd member to remove:

    [core@ip-10-0-128-73 ~]$ sudo -E etcd-member-remove.sh etcd-member-ip-10-0-147-172.us-east-2.compute.internal
    Downloading etcdctl binary..
    etcdctl version: 3.3.10
    API version: 3.3
    etcd client certs already backed up and available ./assets/backup/
    Member 23e4736df4451b32 removed from cluster 6e25bab1bb556673
    etcd member etcd-member-ip-10-0-147-172.us-east-2.compute.internal with 23e4736df4451b32 successfully removed..
  4. Verify that the etcd member has been successfully removed from the cluster:

    1. Connect to the running etcd container:

      [core@ip-10-0-128-73 ~] id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh
    2. In the etcd container, export the variables needed for connecting to etcd:

      sh-4.2# export ETCDCTL_API=3 ETCDCTL_CACERT=/etc/ssl/etcd/ca.crt ETCDCTL_CERT=$(find /etc/ssl/ -name *peer*crt) ETCDCTL_KEY=$(find /etc/ssl/ -name *peer*key)
    3. In the etcd container, execute etcdctl member list and verify that the removed member is no longer listed:

      sh-4.2#  etcdctl member list -w table
      
      +------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+
      |        ID        | STATUS  |                   NAME                   |                            PEER ADDRS                            |       CLIENT ADDRS        |
      +------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+
      | 29e461db6be4eaaa | started | etcd-member-ip-10-0-128-73.us-east-2.compute.internal | https://etcd-2.clustername.devcluster.openshift.com:2380 | https://10.0.128.73:2379 |
      |  cbe982c74cbb42f | started |  etcd-member-ip-10-0-171-108.us-east-2.compute.internal | https://etcd-1.clustername.devcluster.openshift.com:2380 |   https://10.0.171.108:2379 |
      +------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+

Adding a master host back to the etcd cluster

Follow these steps to add a master host back to the etcd cluster. This procedure assumes that you previously removed the master host from the cluster and that its etcd dependencies, such as TLS certificates and DNS, are valid.

Prerequisites
  • Access to the cluster as a user with the cluster-admin role.

  • SSH access to the master host to add to the etcd cluster.

  • The IP address of an existing active etcd member.

Procedure
  1. Access the master host to add to the etcd cluster.

    You must run this procedure on the master host that is being added to the etcd cluster.

  2. Run the etcd-member-add.sh script and pass in two parameters:

    • the IP address of an existing etcd member

    • the name of the etcd member to add

    [core@ip-10-0-147-172 ~]$ sudo -E etcd-member-add.sh 10.0.128.73 etcd-member-ip-10-0-147-172.us-east-2.compute.internal
    Downloading etcdctl binary..
    etcdctl version: 3.3.10
    API version: 3.3
    etcd-member.yaml found in ./assets/backup/
    etcd.conf backup upready exists ./assets/backup/etcd.conf
    Stopping etcd..
    Waiting for etcd-member to stop
    etcd data-dir backup found ./assets/backup/etcd..
    Updating etcd membership..
    Removing etcd data_dir /var/lib/etcd..
    
    ETCD_NAME="etcd-member-ip-10-0-147-172.us-east-2.compute.internal"
    ETCD_INITIAL_CLUSTER="etcd-member-ip-10-0-147-172.us-east-2.compute.internal=https://etcd-1.clustername.devcluster.openshift.com:2380,etcd-member-ip-10-0-171-108.us-east-2.compute.internal=https://etcd-2.clustername.devcluster.openshift.com:2380,etcd-member-ip-10-0-128-73.us-east-2.compute.internal=https://etcd-0.clustername.devcluster.openshift.com:2380"
    ETCD_INITIAL_ADVERTISE_PEER_URLS="https://etcd-1.clustername.devcluster.openshift.com:2380"
    ETCD_INITIAL_CLUSTER_STATE="existing"'
    Member  1e42c7070decd39 added to cluster 6e25bab1bb556673
    Starting etcd..
  3. Verify that the new member is in the list of Pods associated with etcd and that its status is Running:

    [core@ip-10-0-147-172 ~]$ oc get pods -n openshift-etcd
    NAME                                                     READY   STATUS    RESTARTS   AGE
    etcd-member-ip-10-0-128-73.us-east-2.compute.internal    2/2     Running   0          15h
    etcd-member-ip-10-0-147-172.us-east-2.compute.internal   2/2     Running   7          122m
    etcd-member-ip-10-0-171-108.us-east-2.compute.internal   2/2     Running   0          15h
  4. Verify that the etcd member has been successfully added to the etcd cluster:

    1. Connect to the running etcd container:

      [core@ip-10-0-147-172 ~] id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh
    2. In the etcd container, export the variables needed for connecting to etcd:

      sh-4.2# export ETCDCTL_API=3 ETCDCTL_CACERT=/etc/ssl/etcd/ca.crt ETCDCTL_CERT=$(find /etc/ssl/ -name *peer*crt) ETCDCTL_KEY=$(find /etc/ssl/ -name *peer*key)
    3. In the etcd container, execute etcdctl member list and verify that the new member is listed:

      sh-4.2#  etcdctl member list -w table
      
      +------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+
      |        ID        | STATUS  |                   NAME                   |                            PEER ADDRS                            |       CLIENT ADDRS        |
      +------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+
      | 29e461db6be4eaaa | started | etcd-member-ip-10-0-128-73.us-east-2.compute.internal | https://etcd-2.clustername.devcluster.openshift.com:2380 | https://10.0.128.73:2379 |
      |  cbe982c74cbb42f | started | etcd-member-ip-10-0-147-172.us-east-2.compute.internal | https://etcd-0.clustername.devcluster.openshift.com:2380 | https://10.0.147.172:2379 |
      | a752f80bcb0da3e8 | started |   etcd-member-ip-10-0-171-108.us-east-2.compute.internal | https://etcd-1.clustername.devcluster.openshift.com:2380 |   https://10.0.171.108:2379 |
      +------------------+---------+------------------------------------------+------------------------------------------------------------------+---------------------------+

      It may take up to 10 minutes for the new member to start.

    4. In the etcd container, execute etcdctl endpoint health and verify that the new member is healthy:

      sh-4.2# etcdctl endpoint health --cluster
      https://10.0.128.73:2379 is healthy: successfully committed proposal: took = 4.5576ms
      https://10.0.147.172:2379 is healthy: successfully committed proposal: took = 5.1521ms
      https://10.0.171.108:2379 is healthy: successfully committed proposal: took = 4.2631ms