To restore the cluster to a previous state, you must have previously backed up etcd data by creating a snapshot. You will use this snapshot to restore the cluster state.

Restoring to a previous cluster state

You can use a saved etcd backup to restore back to a previous cluster state.

  • Access to the cluster as a user with the cluster-admin role.

  • SSH access to master hosts.

  • A backup directory containing both the etcd snapshot and static Kubernetes API server resources taken from the same backup. The file names in the directory must be in the following formats: snapshot_<datetimestamp>.db and static_kuberesources_<datetimestamp>.tar.gz.

    You must use the same etcd backup directory on all master hosts in the cluster.

    If the etcd backup was taken from OpenShift Container Platform 4.3.0 or 4.3.1, then it is a single file that contains the etcd snapshot and static Kubernetes API server resources. The etcd-snapshot-restore.sh script is backward compatible to accept this single file, which must be in the format of snapshot_db_kuberesources_<datetimestamp>.tar.gz.

  1. Prepare each master host in your cluster to be restored.

    You should run the restore script on all of your master hosts within a short period of time so that the cluster members come up at about the same time and form a quorum. For this reason, it is recommended to stage each master host in a separate terminal, so that the restore script can then be started quickly on each.

    1. Copy the etcd backup directory to a master host.

      This procedure assumes that you copied the backup directory containing the etcd snapshot and static Kubernetes API server resources to the /home/core/ directory of your master host.

    2. Access the master host.

    3. Set the INITIAL_CLUSTER variable to the list of members in the format of <name>=<url>. This variable will be passed to the restore script and must be exactly the same for each member.

      [core@ip-10-0-143-125 ~]$ export INITIAL_CLUSTER="etcd-member-ip-10-0-143-125.ec2.internal=https://etcd-0.clustername.devcluster.openshift.com:2380,etcd-member-ip-10-0-35-108.ec2.internal=https://etcd-1.clustername.devcluster.openshift.com:2380,etcd-member-ip-10-0-10-16.ec2.internal=https://etcd-2.clustername.devcluster.openshift.com:2380"
    4. If the cluster-wide proxy is enabled, be sure that you have exported the NO_PROXY, HTTP_PROXY, and HTTPS_PROXY environment variables.

      You can check whether the proxy is enabled by reviewing the output of oc get proxy cluster -o yaml. The proxy is enabled if the httpProxy, httpsProxy, and noProxy fields have values set.

    5. Repeat these steps on your other master hosts, each in a separate terminal. Be sure to use the backup directory containing the same set of backup files on each master host.

  2. Run the restore script on all of your master hosts.

    1. Start the etcd-snapshot-restore.sh script on your first master host. Pass in two parameters: the path to the etcd backup directory and list of members, which is defined by the INITIAL_CLUSTER variable.

      [core@ip-10-0-143-125 ~]$ sudo -E /usr/local/bin/etcd-snapshot-restore.sh /home/core/backup $INITIAL_CLUSTER
      Creating asset directory ./assets
      Downloading etcdctl binary..
      etcdctl version: 3.3.10
      API version: 3.3
      Backing up /etc/kubernetes/manifests/etcd-member.yaml to ./assets/backup/
      Stopping all static pods..
      ..stopping kube-scheduler-pod.yaml
      ..stopping kube-controller-manager-pod.yaml
      ..stopping kube-apiserver-pod.yaml
      ..stopping etcd-member.yaml
      Stopping etcd..
      Waiting for etcd-member to stop
      Stopping kubelet..
      Stopping all containers..
      Backing up etcd data-dir..
      Removing etcd data-dir /var/lib/etcd
      Restoring etcd member etcd-member-ip-10-0-143-125.ec2.internal from snapshot..
      2019-05-15 19:03:34.647589 I | pkg/netutil: resolving etcd-0.clustername.devcluster.openshift.com:2380 to
      2019-05-15 19:03:34.883545 I | mvcc: restore compact to 361491
      2019-05-15 19:03:34.915679 I | etcdserver/membership: added member cbe982c74cbb42f [https://etcd-0.clustername.devcluster.openshift.com:2380] to cluster 807ae3bffc8d69ca
      Starting static pods..
      ..starting kube-scheduler-pod.yaml
      ..starting kube-controller-manager-pod.yaml
      ..starting kube-apiserver-pod.yaml
      ..starting etcd-member.yaml
      Starting kubelet..
    2. Once the restore starts, run the script on your other master hosts.

  3. Verify that the Machine Configs have been applied.

    In a terminal that has access to the cluster as a cluster-admin user, run the following command.

    $ oc get machineconfigpool
    NAME     CONFIG                                             UPDATED   UPDATING
    master   rendered-master-50e7e00374e80b767fcc922bdfbc522b   True      False

    When the snapshot has been applied, the currentConfig of the master will match the ID from when the etcd snapshot was taken. The currentConfig name for masters is in the format rendered-master-<currentConfig>.

  4. Verify that all master hosts have started and joined the cluster.

    1. Access a master host and connect to the running etcd container.

      [core@ip-10-0-143-125 ~] id=$(sudo crictl ps --name etcd-member | awk 'FNR==2{ print $1}') && sudo crictl exec -it $id /bin/sh
    2. In the etcd container, export variables needed for connecting to etcd.

      sh-4.3# export ETCDCTL_API=3 ETCDCTL_CACERT=/etc/ssl/etcd/ca.crt ETCDCTL_CERT=$(find /etc/ssl/ -name *peer*crt) ETCDCTL_KEY=$(find /etc/ssl/ -name *peer*key)
    3. In the etcd container, execute etcdctl member list and verify that the three members show as started.

      sh-4.3#  etcdctl member list -w table
      |        ID        | STATUS  |                   NAME                   |                            PEER ADDRS                            |       CLIENT ADDRS        |
      | 29e461db6be4eaaa | started | etcd-member-ip-10-0-164-170.ec2.internal | https://etcd-2.clustername.devcluster.openshift.com:2380 | |
      |  cbe982c74cbb42f | started | etcd-member-ip-10-0-143-125.ec2.internal | https://etcd-0.clustername.devcluster.openshift.com:2380 | |
      | a752f80bcb0da3e8 | started |   etcd-member-ip-10-0-156-2.ec2.internal | https://etcd-1.clustername.devcluster.openshift.com:2380 | |

      It may take up to 20 minutes for each new member to start.