# RELEASE_IMAGE=<release_image> (1)
Follow this procedure to recover from a situation where your control plane certificates have expired.
SSH access to master hosts.
Access a master host with an expired certificate as the root user.
Obtain the cluster-kube-apiserver-operator
image reference for a release.
# RELEASE_IMAGE=<release_image> (1)
1 | An example value for <release_image> is quay.io/openshift-release-dev/ocp-release:4.3.0-x86_64 . See the Repository Tags page for a list of available tags. |
# KAO_IMAGE=$( oc adm release info --registry-config='/var/lib/kubelet/config.json' "${RELEASE_IMAGE}" --image-for=cluster-kube-apiserver-operator )
Pull the cluster-kube-apiserver-operator
image.
# podman pull --authfile=/var/lib/kubelet/config.json "${KAO_IMAGE}"
Create a recovery API server.
# podman run -it --network=host -v /etc/kubernetes/:/etc/kubernetes/:Z --entrypoint=/usr/bin/cluster-kube-apiserver-operator "${KAO_IMAGE}" recovery-apiserver create
Run the export KUBECONFIG
command from the output of the above command, which is needed for the oc
commands later in this procedure.
# export KUBECONFIG=/<path_to_recovery_kubeconfig>/admin.kubeconfig
Wait for the recovery API server to come up.
# until oc get namespace kube-system 2>/dev/null 1>&2; do echo 'Waiting for recovery apiserver to come up.'; sleep 1; done
Run the regenerate-certificates
command. It fixes the certificates in the API, overwrites the old certificates on the local drive, and restarts static Pods to pick them up.
# podman run -it --network=host -v /etc/kubernetes/:/etc/kubernetes/:Z --entrypoint=/usr/bin/cluster-kube-apiserver-operator "${KAO_IMAGE}" regenerate-certificates
After the certificates are fixed in the API, use the following commands to force new rollouts for the control plane. It will reinstall itself on the other nodes because the kubelet is connected to API servers using an internal load balancer.
# oc patch kubeapiserver cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
# oc patch kubecontrollermanager cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
# oc patch kubescheduler cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge
Create a bootstrap kubeconfig with a valid user.
Run the recover-kubeconfig.sh
script and save the output to a file called kubeconfig
.
# recover-kubeconfig.sh > kubeconfig
Copy the kubeconfig
file to all master hosts and move it to /etc/kubernetes/kubeconfig
.
Get the CA certificate used to validate connections from the API server.
# oc get configmap kube-apiserver-to-kubelet-client-ca -n openshift-kube-apiserver-operator --template='{{ index .data "ca-bundle.crt" }}' > /etc/kubernetes/kubelet-ca.crt
Copy the /etc/kubernetes/kubelet-ca.crt
file to all other master hosts and nodes.
Add the machine-config-daemon-force
file to all master hosts and nodes to
force the Machine Config Daemon to accept this certificate update.
# touch /run/machine-config-daemon-force
Recover the kubelet on all masters.
On a master host, stop the kubelet.
# systemctl stop kubelet
Delete stale kubelet data.
# rm -rf /var/lib/kubelet/pki /var/lib/kubelet/kubeconfig
Restart the kubelet.
# systemctl start kubelet
Repeat these steps on all other master hosts.
If necessary, recover the kubelet on the worker nodes.
After the master nodes are restored, the worker nodes might restore themselves. You can verify this by running the oc get nodes
command. If the worker nodes are not listed, then perform the following steps on each worker node.
Stop the kubelet.
# systemctl stop kubelet
Delete stale kubelet data.
# rm -rf /var/lib/kubelet/pki /var/lib/kubelet/kubeconfig
Restart the kubelet.
# systemctl start kubelet
Approve the pending node-bootstrapper
certificates signing requests (CSRs).
Get the list of current CSRs.
# oc get csr
Review the details of a CSR to verify it is valid.
# oc describe csr <csr_name> (1)
1 | <csr_name> is the name of a CSR from the list of current CSRs. |
Approve each valid CSR.
# oc adm certificate approve <csr_name>
Be sure to approve all pending node-bootstrapper
CSRs.
Destroy the recovery API server because it is no longer needed.
# podman run -it --network=host -v /etc/kubernetes/:/etc/kubernetes/:Z --entrypoint=/usr/bin/cluster-kube-apiserver-operator "${KAO_IMAGE}" recovery-apiserver destroy
Wait for the control plane to restart and pick up the new certificates. This might take up to 10 minutes.