×

In a highly available control plane, three etcd pods run as a part of a stateful set in an etcd cluster. To recover an etcd cluster, identify unhealthy etcd pods by checking the etcd cluster health.

Checking the status of an etcd cluster

You can check the status of the etcd cluster health by logging into any etcd pod.

Procedure
  1. Log in to an etcd pod by entering the following command:

    $ oc rsh -n <hosted_control_plane_namespace> -c etcd <etcd_pod_name>
  2. Print the health status of an etcd cluster by entering the following command:

    sh-4.4$ etcdctl endpoint health --cluster -w table
    Example output
    ENDPOINT                                                HEALTH  TOOK        ERROR
    https://etcd-0.etcd-discovery.clusters-hosted.svc:2379  true    9.117698ms

Recovering a failing etcd pod

Each etcd pod of a 3-node cluster has its own persistent volume claim (PVC) to store its data. An etcd pod might fail because of corrupted or missing data. You can recover a failing etcd pod and its PVC.

Procedure
  1. To confirm that the etcd pod is failing, enter the following command:

    $ oc get pods -l app=etcd -n <hosted_control_plane_namespace>
    Example output
    NAME     READY   STATUS             RESTARTS     AGE
    etcd-0   2/2     Running            0            64m
    etcd-1   2/2     Running            0            45m
    etcd-2   1/2     CrashLoopBackOff   1 (5s ago)   64m

    The failing etcd pod might have the CrashLoopBackOff or Error status.

  2. Delete the failing pod and its PVC by entering the following command:

    $ oc delete pvc/<etcd_pvc_name> pod/<etcd_pod_name> --wait=false
Verification
  • Verify that a new etcd pod is up and running by entering the following command:

    $ oc get pods -l app=etcd -n <hosted_control_plane_namespace>
    Example output
    NAME     READY   STATUS    RESTARTS   AGE
    etcd-0   2/2     Running   0          67m
    etcd-1   2/2     Running   0          48m
    etcd-2   2/2     Running   0          2m2s