Overview

Following an OpenShift Container Platform upgrade, it may be desirable in extreme cases to downgrade your cluster to a previous version. The following sections outline the required steps for each system in a cluster to perform such a downgrade for the OpenShift Container Platform 3.7 to 3.6 downgrade path.

These steps are currently only supported for RPM-based installations of OpenShift Container Platform and assumes downtime of the entire cluster.

Verifying Backups

The Ansible playbook used during the upgrade process should have created a backup of the master-config.yaml file and the etcd data directory. Ensure these exist on your masters and etcd members:

/etc/origin/master/master-config.yaml.<timestamp>
/var/lib/etcd/openshift-backup-pre-upgrade-<timestamp>

Also, back up the node-config.yaml file on each node (including masters, which have the node component on them) with a timestamp:

/etc/origin/node/node-config.yaml.<timestamp>

If you are using an external etcd cluster (versus the single embedded etcd), the backup is likely created on all etcd members, though only one is required for the recovery process.

The RPM downgrade process in a later step should create .rpmsave backups of the following files, but it may be a good idea to keep a separate copy regardless:

/etc/sysconfig/atomic-openshift-master
/etc/sysconfig/atomic-openshift-master-api
/etc/sysconfig/atomic-openshift-master-controller
/etc/etcd/etcd.conf (1)
1 Only required if using external etcd.

Shutting Down the Cluster

On all masters, nodes, and etcd members (if using an external etcd cluster), ensure the relevant services are stopped.

# systemctl stop atomic-openshift-master-api atomic-openshift-master-controllers

On all master and node hosts:

# systemctl stop atomic-openshift-node

On any external etcd hosts:

# systemctl stop etcd

Removing RPMs

  1. The *-excluder packages add entries to the exclude directive in the host’s /etc/yum.conf file when installed. Run the following command on each host to remove the atomic-openshift-* and docker packages from the exclude list:

    # atomic-openshift-excluder unexclude
    # atomic-openshift-docker-excluder unexclude
  2. On all masters, nodes, and etcd members (if using an external etcd cluster), remove the following packages:

    # yum remove atomic-openshift \
        atomic-openshift-clients \
        atomic-openshift-node \
        atomic-openshift-master-api \
        atomic-openshift-master-controllers \
        openvswitch \
        atomic-openshift-sdn-ovs \
        tuned-profiles-atomic-openshift-node\
        atomic-openshift-excluder\
        atomic-openshift-docker-excluder
  3. If you are using external etcd, also remove the etcd package:

    # yum remove etcd

    If using the embedded etcd, leave the etcd package installed. It is required for running the etcdctl command to issue operations in later steps.

Downgrading Docker

Both OpenShift Container Platform 3.6 and 3.7 require Docker 1.12, so Docker does not need to be downgraded.

Reinstalling RPMs

Disable the OpenShift Container Platform 3.7 repositories, and re-enable the 3.6 repositories:

# subscription-manager repos \
    --disable=rhel-7-server-ose-3.7-rpms \
    --enable=rhel-7-server-ose-3.6-rpms

On each master, install the following packages:

# yum install atomic-openshift \
    atomic-openshift-clients \
    atomic-openshift-node \
    atomic-openshift-master-api \
    atomic-openshift-master-controllers \
    openvswitch \
    atomic-openshift-sdn-ovs \
    tuned-profiles-atomic-openshift-node \
    atomic-openshift-excluder \
    atomic-openshift-docker-excluder

On each node, install the following packages:

# yum install atomic-openshift \
    atomic-openshift-node \
    openvswitch \
    atomic-openshift-sdn-ovs \
    tuned-profiles-atomic-openshift-node \
    atomic-openshift-excluder \
    atomic-openshift-docker-excluder

If using an external etcd cluster, install the following package on each etcd member:

# yum install etcd

Restoring etcd

The restore procedure for etcd configuration files replaces the appropriate files, then restarts the service.

If an etcd host has become corrupted and the /etc/etcd/etcd.conf file is lost, restore it using:

$ ssh master-0
# cp /backup/yesterday/master-0-files/etcd.conf /etc/etcd/etcd.conf
# restorecon -Rv /etc/etcd/etcd.conf
# systemctl restart etcd.service

In this example, the backup file is stored in the /backup/yesterday/master-0-files/etcd.conf path where it can be used as an external NFS share, S3 bucket, or other storage solution.

Restoring etcd v2 & v3 data

The following process restores healthy data files and starts the etcd cluster as a single node, then adds the rest of the nodes if an etcd cluster is required.

Procedure

  1. Stop all etcd services:

    # systemctl stop etcd.service
  2. To ensure the proper backup is restored, delete the etcd directories:

    • To back up the current etcd data before you delete the directory, run the following command:

      # mv /var/lib/etcd /var/lib/etcd.old
      # mkdir /var/lib/etcd
      # chown -R etcd.etcd /var/lib/etcd/
      # restorecon -Rv /var/lib/etcd/
    • Or, to delete the directory and the etcd, data, run the following command:

      # rm -Rf /var/lib/etcd/*

      In an all-in-one cluster, the etcd data directory is located in the /var/lib/origin/openshift.local.etcd directory.

  3. Restore a healthy backup data file to each of the etcd nodes. Perform this step on all etcd hosts, including master hosts collocated with etcd.

    # cp -R /backup/etcd-xxx/* /var/lib/etcd/
    # mv /var/lib/etcd/db /var/lib/etcd/member/snap/db
    # chcon -R --reference /backup/etcd-xxx/* /var/lib/etcd/
    # chown -R etcd:etcd /var/lib/etcd/R
  4. Run the etcd service on each host, forcing a new cluster.

    This creates a custom file for the etcd service, which overwrites the execution command adding the --force-new-cluster option:

    # mkdir -p /etc/systemd/system/etcd.service.d/
    # echo "[Service]" > /etc/systemd/system/etcd.service.d/temp.conf
    # echo "ExecStart=" >> /etc/systemd/system/etcd.service.d/temp.conf
    # sed -n '/ExecStart/s/"$/ --force-new-cluster"/p' \
        /usr/lib/systemd/system/etcd.service \
        >> /etc/systemd/system/etcd.service.d/temp.conf
    
    # systemctl daemon-reload
    # systemctl restart etcd
  5. Check for error messages:

    $ journalctl -fu etcd.service
  6. Check for health status:

    # etcdctl2 cluster-health
    member 5ee217d17301 is healthy: got healthy result from https://192.168.55.8:2379
    cluster is healthy
  7. Restart the etcd service in cluster mode:

    # rm -f /etc/systemd/system/etcd.service.d/temp.conf
    # systemctl daemon-reload
    # systemctl restart etcd
  8. Check for health status and member list:

    # etcdctl2 cluster-health
    member 5ee217d17301 is healthy: got healthy result from https://192.168.55.8:2379
    cluster is healthy
    
    # etcdctl2 member list
    5ee217d17301: name=master-0.example.com peerURLs=http://localhost:2380 clientURLs=https://192.168.55.8:2379 isLeader=true
  9. After the first instance is running, you can restore the rest of your etcd servers.

Fix the peerURLS parameter

After restoring the data and creating a new cluster, the peerURLs parameter shows localhost instead of the IP where etcd is listening for peer communication:

# etcdctl2 member list
5ee217d17301: name=master-0.example.com peerURLs=http://*localhost*:2380 clientURLs=https://192.168.55.8:2379 isLeader=true
Procedure
  1. Get the member ID using etcdctl member list:

    `etcdctl member list`
  2. Get the IP where etcd listens for peer communication:

    $ ss -l4n | grep 2380
  3. Update the member information with that IP:

    # etcdctl2 member update 5ee217d17301 https://192.168.55.8:2380
    Updated member with ID 5ee217d17301 in cluster
  4. To verify, check that the IP is in the member list:

    $ etcdctl2 member list
    5ee217d17301: name=master-0.example.com peerURLs=https://*192.168.55.8*:2380 clientURLs=https://192.168.55.8:2379 isLeader=true

Restoring etcd for v3

The restore procedure for v3 data is similar to the restore procedure for the v2 data.

Snapshot integrity may be optionally verified at restore time. If the snapshot is taken with etcdctl snapshot save, it will have an integrity hash that is checked by etcdctl snapshot restore. If the snapshot is copied from the data directory, there is no integrity hash and it will only restore by using --skip-hash-check.

The procedure to restore only the v3 data must be performed on a single etcd host. You can then add the rest of the nodes to the cluster.

Procedure

  1. Stop all etcd services:

    # systemctl stop etcd.service
  2. Clear all old data, because etcdctl recreates it in the node where the restore procedure is going to be performed:

    # rm -Rf /var/lib/etcd
  3. Run the snapshot restore command, substituting the values from the /etc/etcd/etcd.conf file:

    # etcdctl3 snapshot restore /backup/etcd-xxxxxx/backup.db \
      --data-dir /var/lib/etcd \
      --name master-0.example.com \
      --initial-cluster "master-0.example.com=https://192.168.55.8:2380" \ --initial-cluster-token "etcd-cluster-1" \
      --initial-advertise-peer-urls https://192.168.55.8:2380
    
    2017-10-03 08:55:32.440779 I | mvcc: restore compact to 1041269
    2017-10-03 08:55:32.468244 I | etcdserver/membership: added member 40bef1f6c79b3163 [https://192.168.55.8:2380] to cluster 26841ebcf610583c
  4. Restore permissions and selinux context to the restored files:

    # chown -R etcd.etcd /var/lib/etcd/
    # restorecon -Rv /var/lib/etcd
  5. Start the etcd service:

    # systemctl start etcd
  6. Check for any error messages:

    $ journalctl -fu etcd.service

Bringing OpenShift Container Platform services back online

After you finish your changes, bring OpenShift Container Platform back online.

Procedure

  1. On each OpenShift Container Platform master, restore your master and node configuration from backup and enable and restart all relevant services:

    # cp ${MYBACKUPDIR}/etc/sysconfig/atomic-openshift-master-api /etc/sysconfig/atomic-openshift-master-api
    # cp ${MYBACKUPDIR}/etc/sysconfig/atomic-openshift-master-controllers /etc/sysconfig/atomic-openshift-master-controllers
    # cp ${MYBACKUPDIR}/etc/origin/master/master-config.yaml.<timestamp> /etc/origin/master/master-config.yaml
    # cp ${MYBACKUPDIR}/etc/origin/node/node-config.yaml.<timestamp> /etc/origin/node/node-config.yaml
    # systemctl enable atomic-openshift-master-api
    # systemctl enable atomic-openshift-master-controllers
    # systemctl enable atomic-openshift-node
    # systemctl start atomic-openshift-master-api
    # systemctl start atomic-openshift-master-controllers
    # systemctl start atomic-openshift-node
  2. On each OpenShift Container Platform node, restore your node-config.yaml file from backup and enable and restart the atomic-openshift-node service:

# cp /etc/origin/node/node-config.yaml.<timestamp> /etc/origin/node/node-config.yaml
# systemctl enable atomic-openshift-node
# systemctl start atomic-openshift-node

Verifying the Downgrade

  1. To verify the downgrade, first check that all nodes are marked as Ready:

    # oc get nodes
    NAME                        STATUS                     AGE
    master.example.com          Ready,SchedulingDisabled   165d
    node1.example.com           Ready                      165d
    node2.example.com           Ready                      165d
  2. Then, verify that you are running the expected versions of the docker-registry and router images, if deployed:

    # oc get -n default dc/docker-registry -o json | grep \"image\"
        "image": "openshift3/ose-docker-registry:v3.6.173.0.49",
    # oc get -n default dc/router -o json | grep \"image\"
        "image": "openshift3/ose-haproxy-router:v3.6.173.0.49",
  3. You can use the diagnostics tool on the master to look for common issues and provide suggestions:

    # oc adm diagnostics
    ...
    [Note] Summary of diagnostics execution:
    [Note] Completed with no errors or warnings seen.