Hibernating a cluster

About cluster hibernation
Prerequisites
Hibernating a cluster
Resuming a hibernated cluster

You can hibernate your OpenShift Container Platform cluster for up to 90 days.

About cluster hibernation

OpenShift Container Platform clusters can be hibernated in order to save money on cloud hosting costs. You can hibernate your OpenShift Container Platform cluster for up to 90 days and expect it to resume successfully.

You must wait at least 24 hours after cluster installation before hibernating your cluster to allow for the first certification rotation.

If you must hibernate your cluster before the 24 hour certificate rotation, use the following procedure instead: Enabling OpenShift 4 Clusters to Stop and Resume Cluster VMs.

When hibernating a cluster, you must hibernate all cluster nodes. It is not supported to suspend only certain nodes.

After resuming, it can take up to 45 minutes for the cluster to become ready.

Prerequisites

Take an etcd backup prior to hibernating the cluster.

It is important to take an etcd backup before hibernating so that your cluster can be restored if you encounter any issues when resuming the cluster.

For example, the following conditions can cause the resumed cluster to malfunction:

etcd data corruption during hibernation
Node failure due to hardware
Network connectivity issues

If your cluster fails to recover, follow the steps to restore to a previous cluster state.

You can hibernate a cluster for up to 90 days. The cluster can recover if certificates expire while the cluster was in hibernation.

Prerequisites

The cluster has been running for at least 24 hours to allow the first certificate rotation to complete.

If you must hibernate your cluster before the 24 hour certificate rotation, use the following procedure instead: Enabling OpenShift 4 Clusters to Stop and Resume Cluster VMs.
You have taken an etcd backup.
You have access to the cluster as a user with the cluster-admin role.

Procedure

Confirm that your cluster has been installed for at least 24 hours.

Ensure that all nodes are in a good state by running the following command:

$ oc get nodes

Example output

NAME                                      STATUS  ROLES                 AGE   VERSION
ci-ln-812tb4k-72292-8bcj7-master-0        Ready	  control-plane,master  32m   v1.31.3
ci-ln-812tb4k-72292-8bcj7-master-1        Ready	  control-plane,master  32m   v1.31.3
ci-ln-812tb4k-72292-8bcj7-master-2        Ready	  control-plane,master  32m   v1.31.3
Ci-ln-812tb4k-72292-8bcj7-worker-a-zhdvk  Ready	  worker                19m   v1.31.3
ci-ln-812tb4k-72292-8bcj7-worker-b-9hrmv  Ready	  worker                19m   v1.31.3
ci-ln-812tb4k-72292-8bcj7-worker-c-q8mw2  Ready	  worker                19m   v1.31.3

All nodes should show Ready in the STATUS column.

Ensure that all cluster Operators are in a good state by running the following command:

$ oc get clusteroperators

Example output

NAME                      VERSION   AVAILABLE  PROGRESSING  DEGRADED  SINCE   MESSAGE
authentication            4.18.0-0  True       False        False     51m
baremetal                 4.18.0-0  True       False        False     72m
cloud-controller-manager  4.18.0-0  True       False        False     75m
cloud-credential          4.18.0-0  True       False        False     77m
cluster-api               4.18.0-0  True       False        False     42m
cluster-autoscaler        4.18.0-0  True       False        False     72m
config-operator           4.18.0-0  True       False        False     72m
console                   4.18.0-0  True       False        False     55m
...

All cluster Operators should show AVAILABLE=True, PROGRESSING=False, and DEGRADED=False.

Ensure that all machine config pools are in a good state by running the following command:

$ oc get mcp

Example output

NAME    CONFIG                                            UPDATED  UPDATING  DEGRADED  MACHINECOUNT  READYMACHINECOUNT  UPDATEDMACHINECOUNT  DEGRADEDMACHINECOUNT  AGE
master  rendered-master-87871f187930e67233c837e1d07f49c7  True     False     False     3             3                  3                    0                     96m
worker  rendered-worker-3c4c459dc5d90017983d7e72928b8aed  True     False     False     3             3                  3                    0                     96m

All machine config pools should show UPDATING=False and DEGRADED=False.

Stop the cluster virtual machines:

Use the tools native to your cluster’s cloud environment to shut down the cluster’s virtual machines.

If you use a bastion virtual machine, do not shut down this virtual machine.

Additional resources

Backing up etcd

Resuming a hibernated cluster

When you resume a hibernated cluster within 90 days, you might have to approve certificate signing requests (CSRs) for the nodes to become ready.

It can take around 45 minutes for the cluster to resume, depending on the size of your cluster.

Prerequisites

You hibernated your cluster less than 90 days ago.
You have access to the cluster as a user with the cluster-admin role.

Procedure

Within 90 days of cluster hibernation, resume the cluster virtual machines:

Use the tools native to your cluster’s cloud environment to resume the cluster’s virtual machines.
Wait about 5 minutes, depending on the number of nodes in your cluster.

Approve CSRs for the nodes:

Check that there is a CSR for each node in the NotReady state:

$ oc get csr

Example output

NAME       AGE  SIGNERNAME                                   REQUESTOR                                                                  REQUESTEDDURATION  CONDITION
csr-4dwsd  37m  kubernetes.io/kube-apiserver-client          system:node:ci-ln-812tb4k-72292-8bcj7-worker-c-q8mw2                       24h                Pending
csr-4vrbr  49m  kubernetes.io/kube-apiserver-client          system:node:ci-ln-812tb4k-72292-8bcj7-master-1                             24h                Pending
csr-4wk5x  51m  kubernetes.io/kubelet-serving                system:node:ci-ln-812tb4k-72292-8bcj7-master-1                             <none>             Pending
csr-84vb6  51m  kubernetes.io/kube-apiserver-client-kubelet  system:serviceaccount:openshift-machine-config-operator:node-bootstrapper  <none>             Pending

Approve each valid CSR by running the following command:
```
$ oc adm certificate approve <csr_name>
```

Verify that all necessary CSRs were approved by running the following command:

$ oc get csr

Example output

NAME       AGE  SIGNERNAME                                   REQUESTOR                                                                  REQUESTEDDURATION  CONDITION
csr-4dwsd  37m  kubernetes.io/kube-apiserver-client          system:node:ci-ln-812tb4k-72292-8bcj7-worker-c-q8mw2                       24h                Approved,Issued
csr-4vrbr  49m  kubernetes.io/kube-apiserver-client          system:node:ci-ln-812tb4k-72292-8bcj7-master-1                             24h                Approved,Issued
csr-4wk5x  51m  kubernetes.io/kubelet-serving                system:node:ci-ln-812tb4k-72292-8bcj7-master-1                             <none>             Approved,Issued
csr-84vb6  51m  kubernetes.io/kube-apiserver-client-kubelet  system:serviceaccount:openshift-machine-config-operator:node-bootstrapper  <none>             Approved,Issued

CSRs should show Approved,Issued in the CONDITION column.

Verify that all nodes now show as ready by running the following command:

$ oc get nodes

Example output

NAME                                      STATUS  ROLES                 AGE   VERSION
ci-ln-812tb4k-72292-8bcj7-master-0        Ready	  control-plane,master  32m   v1.31.3
ci-ln-812tb4k-72292-8bcj7-master-1        Ready	  control-plane,master  32m   v1.31.3
ci-ln-812tb4k-72292-8bcj7-master-2        Ready	  control-plane,master  32m   v1.31.3
Ci-ln-812tb4k-72292-8bcj7-worker-a-zhdvk  Ready	  worker                19m   v1.31.3
ci-ln-812tb4k-72292-8bcj7-worker-b-9hrmv  Ready	  worker                19m   v1.31.3
ci-ln-812tb4k-72292-8bcj7-worker-c-q8mw2  Ready	  worker                19m   v1.31.3

All nodes should show Ready in the STATUS column. It might take a few minutes for all nodes to become ready after approving the CSRs.

Wait for cluster Operators to restart to load the new certificates.

This might take 5 or 10 minutes.

Verify that all cluster Operators are in a good state by running the following command:

$ oc get clusteroperators

Example output

NAME                      VERSION   AVAILABLE  PROGRESSING  DEGRADED  SINCE   MESSAGE
authentication            4.18.0-0  True       False        False     51m
baremetal                 4.18.0-0  True       False        False     72m
cloud-controller-manager  4.18.0-0  True       False        False     75m
cloud-credential          4.18.0-0  True       False        False     77m
cluster-api               4.18.0-0  True       False        False     42m
cluster-autoscaler        4.18.0-0  True       False        False     72m
config-operator           4.18.0-0  True       False        False     72m
console                   4.18.0-0  True       False        False     55m
...

All cluster Operators should show AVAILABLE=True, PROGRESSING=False, and DEGRADED=False.

Hibernating an OpenShift Container Platform cluster

About cluster hibernation

Prerequisites

Hibernating a cluster

Resuming a hibernated cluster