$ skopeo inspect docker://<imagename> | jq -r '.Labels."com.openshift.lifecycle-agent.seed_cluster_info" | fromjson | .release_registry'
You can use the Lifecycle Agent to do a manual image-based upgrade of a single-node OpenShift cluster.
When you deploy the Lifecycle Agent on a cluster, an ImageBasedUpgrade
CR is automatically created.
You update this CR to specify the image repository of the seed image and to move through the different stages.
When you deploy the Lifecycle Agent on a cluster, an ImageBasedUpgrade
custom resource (CR) is automatically created.
After you created all the resources that you need during the upgrade, you can move on to the Prep
stage.
For more information, see the "Creating ConfigMap objects for the image-based upgrade with Lifecycle Agent" section.
In a disconnected environment, if the seed cluster’s release image registry is different from the target cluster’s release image registry, you must create an You can retrieve the release registry used in the seed image by running the following command:
|
You have created resources to back up and restore your clusters.
Check that you have patched your ImageBasedUpgrade
CR:
apiVersion: lca.openshift.io/v1
kind: ImageBasedUpgrade
metadata:
name: upgrade
spec:
stage: Idle
seedImageRef:
version: 4.15.2 (1)
image: <seed_container_image> (2)
pullSecretRef: <seed_pull_secret> (3)
autoRollbackOnFailure: {}
# initMonitorTimeoutSeconds: 1800 (4)
extraManifests: (5)
- name: example-extra-manifests-cm
namespace: openshift-lifecycle-agent
- name: example-catalogsources-cm
namespace: openshift-lifecycle-agent
oadpContent: (6)
- name: oadp-cm-example
namespace: openshift-adp
1 | Target platform version. The value must match the version of the seed image. |
2 | Repository where the target cluster can pull the seed image from. |
3 | Reference to a secret with credentials to pull container images if the images are in a private registry. |
4 | Optional: Time frame in seconds to roll back if the upgrade does not complete within that time frame after the first reboot. If not defined or set to 0 , the default value of 1800 seconds (30 minutes) is used. |
5 | Optional: List of ConfigMap resources that contain your custom catalog sources to retain after the upgrade and your extra manifests to apply to the target cluster that are not part of the seed image. |
6 | List of ConfigMap resources that contain the OADP Backup and Restore CRs. |
To start the Prep
stage, change the value of the stage
field to Prep
in the ImageBasedUpgrade
CR by running the following command:
$ oc patch imagebasedupgrades.lca.openshift.io upgrade -p='{"spec": {"stage": "Prep"}}' --type=merge -n openshift-lifecycle-agent
If you provide ConfigMap
objects for OADP resources and extra manifests, Lifecycle Agent validates the specified ConfigMap
objects during the Prep
stage.
You might encounter the following issues:
Validation warnings or errors if the Lifecycle Agent detects any issues with the extraManifests
parameters.
Validation errors if the Lifecycle Agent detects any issues with the oadpContent
parameters.
Validation warnings do not block the Upgrade
stage but you must decide if it is safe to proceed with the upgrade.
These warnings, for example missing CRDs, namespaces, or dry run failures, update the status.conditions
for the Prep
stage and annotation
fields in the ImageBasedUpgrade
CR with details about the warning.
# ...
metadata:
annotations:
extra-manifest.lca.openshift.io/validation-warning: '...'
# ...
However, validation errors, such as adding MachineConfig
or Operator manifests to extra manifests, cause the Prep
stage to fail and block the Upgrade
stage.
When the validations pass, the cluster creates a new ostree
stateroot, which involves pulling and unpacking the seed image, and running host-level commands.
Finally, all the required images are precached on the target cluster.
Check the status of the ImageBasedUpgrade
CR by running the following command:
$ oc get ibu -o yaml
conditions:
- lastTransitionTime: "2024-01-01T09:00:00Z"
message: In progress
observedGeneration: 13
reason: InProgress
status: "False"
type: Idle
- lastTransitionTime: "2024-01-01T09:00:00Z"
message: Prep completed
observedGeneration: 13
reason: Completed
status: "False"
type: PrepInProgress
- lastTransitionTime: "2024-01-01T09:00:00Z"
message: Prep stage completed successfully
observedGeneration: 13
reason: Completed
status: "True"
type: PrepCompleted
observedGeneration: 13
validNextStages:
- Idle
- Upgrade
After you generate the seed image and complete the Prep
stage, you can upgrade the target cluster.
During the upgrade process, the OADP Operator creates a backup of the artifacts specified in the OADP custom resources (CRs), then the Lifecycle Agent upgrades the cluster.
If the upgrade fails or stops, an automatic rollback is initiated. If you have an issue after the upgrade, you can initiate a manual rollback. For more information about manual rollback, see "Moving to the Rollback stage of the image-based upgrade with Lifecycle Agent".
You have completed the Prep
stage.
To move to the Upgrade
stage, change the value of the stage
field to Upgrade
in the ImageBasedUpgrade
CR by running the following command:
$ oc patch imagebasedupgrades.lca.openshift.io upgrade -p='{"spec": {"stage": "Upgrade"}}' --type=merge
Check the status of the ImageBasedUpgrade
CR by running the following command:
$ oc get ibu -o yaml
status:
conditions:
- lastTransitionTime: "2024-01-01T09:00:00Z"
message: In progress
observedGeneration: 5
reason: InProgress
status: "False"
type: Idle
- lastTransitionTime: "2024-01-01T09:00:00Z"
message: Prep completed
observedGeneration: 5
reason: Completed
status: "False"
type: PrepInProgress
- lastTransitionTime: "2024-01-01T09:00:00Z"
message: Prep completed successfully
observedGeneration: 5
reason: Completed
status: "True"
type: PrepCompleted
- lastTransitionTime: "2024-01-01T09:00:00Z"
message: |-
Waiting for system to stabilize: one or more health checks failed
- one or more ClusterOperators not yet ready: authentication
- one or more MachineConfigPools not yet ready: master
- one or more ClusterServiceVersions not yet ready: sriov-fec.v2.8.0
observedGeneration: 1
reason: InProgress
status: "True"
type: UpgradeInProgress
observedGeneration: 1
rollbackAvailabilityExpiration: "2024-05-19T14:01:52Z"
validNextStages:
- Rollback
The OADP Operator creates a backup of the data specified in the OADP Backup
and Restore
CRs and the target cluster reboots.
Monitor the status of the CR by running the following command:
$ oc get ibu -o yaml
If you are satisfied with the upgrade, finalize the changes by patching the value of the stage
field to Idle
in the ImageBasedUpgrade
CR by running the following command:
$ oc patch imagebasedupgrades.lca.openshift.io upgrade -p='{"spec": {"stage": "Idle"}}' --type=merge
You cannot roll back the changes once you move to the |
The Lifecycle Agent deletes all resources created during the upgrade process.
You can remove the OADP Operator and its configuration files after a successful upgrade. For more information, see "Deleting Operators from a cluster".
Check the status of the ImageBasedUpgrade
CR by running the following command:
$ oc get ibu -o yaml
status:
conditions:
- lastTransitionTime: "2024-01-01T09:00:00Z"
message: In progress
observedGeneration: 5
reason: InProgress
status: "False"
type: Idle
- lastTransitionTime: "2024-01-01T09:00:00Z"
message: Prep completed
observedGeneration: 5
reason: Completed
status: "False"
type: PrepInProgress
- lastTransitionTime: "2024-01-01T09:00:00Z"
message: Prep completed successfully
observedGeneration: 5
reason: Completed
status: "True"
type: PrepCompleted
- lastTransitionTime: "2024-01-01T09:00:00Z"
message: Upgrade completed
observedGeneration: 1
reason: Completed
status: "False"
type: UpgradeInProgress
- lastTransitionTime: "2024-01-01T09:00:00Z"
message: Upgrade completed
observedGeneration: 1
reason: Completed
status: "True"
type: UpgradeCompleted
observedGeneration: 1
rollbackAvailabilityExpiration: "2024-01-01T09:00:00Z"
validNextStages:
- Idle
- Rollback
Check the status of the cluster restoration by running the following command:
$ oc get restores -n openshift-adp -o custom-columns=NAME:.metadata.name,Status:.status.phase,Reason:.status.failureReason
NAME Status Reason
acm-klusterlet Completed <none> (1)
apache-app Completed <none>
localvolume Completed <none>
1 | The acm-klusterlet is specific to RHACM environments only. |
An automatic rollback is initiated if the upgrade does not complete within the time frame specified in the initMonitorTimeoutSeconds
field after rebooting.
apiVersion: lca.openshift.io/v1
kind: ImageBasedUpgrade
metadata:
name: upgrade
spec:
stage: Idle
seedImageRef:
version: 4.15.2
image: <seed_container_image>
autoRollbackOnFailure: {}
# initMonitorTimeoutSeconds: 1800 (1)
# ...
1 | Optional: The time frame in seconds to roll back if the upgrade does not complete within that time frame after the first reboot. If not defined or set to 0 , the default value of 1800 seconds (30 minutes) is used. |
You can manually roll back the changes if you encounter unresolvable issues after an upgrade.
You have logged into the hub cluster as a user with cluster-admin
privileges.
You ensured that the control plane certificates on the original stateroot are valid. If the certificates expired, see "Recovering from expired control plane certificates".
To move to the rollback stage, patch the value of the stage
field to Rollback
in the ImageBasedUpgrade
CR by running the following command:
$ oc patch imagebasedupgrades.lca.openshift.io upgrade -p='{"spec": {"stage": "Rollback"}}' --type=merge
The Lifecycle Agent reboots the cluster with the previously installed version of OpenShift Container Platform and restores the applications.
If you are satisfied with the changes, finalize the rollback by patching the value of the stage
field to Idle
in the ImageBasedUpgrade
CR by running the following command:
$ oc patch imagebasedupgrades.lca.openshift.io upgrade -p='{"spec": {"stage": "Idle"}}' --type=merge -n openshift-lifecycle-agent
If you move to the |
Perform troubleshooting steps on the managed clusters that are affected by an issue.
If you are using the |
You can use the oc adm must-gather
CLI to collect information for debugging and troubleshooting.
Collect data about the Operators by running the following command:
$ oc adm must-gather \
--dest-dir=must-gather/tmp \
--image=$(oc -n openshift-lifecycle-agent get deployment.apps/lifecycle-agent-controller-manager -o jsonpath='{.spec.template.spec.containers[?(@.name == "manager")].image}') \
--image=quay.io/konveyor/oadp-must-gather:latest \// (1)
--image=quay.io/openshift/origin-must-gather:latest (2)
1 | Optional: Add this option if you need to gather more information from the OADP Operator. |
2 | Optional: Add this option if you need to gather more information from the SR-IOV Operator. |
During the finalize stage or when you stop the process at the Prep
stage, Lifecycle Agent cleans up the following resources:
Stateroot that is no longer required
Precaching resources
OADP CRs
ImageBasedUpgrade
CR
If the Lifecycle Agent fails to perform the above steps, it transitions to the AbortFailed
or FinalizeFailed
states.
The condition message and log show which steps failed.
message: failed to delete all the backup CRs. Perform cleanup manually then add 'lca.openshift.io/manual-cleanup-done' annotation to ibu CR to transition back to Idle
observedGeneration: 5
reason: AbortFailed
status: "False"
type: Idle
Inspect the logs to determine why the failure occurred.
To prompt Lifecycle Agent to retry the cleanup, add the lca.openshift.io/manual-cleanup-done
annotation to the ImageBasedUpgrade
CR.
After observing this annotation, Lifecycle Agent retries the cleanup and, if it is successful, the ImageBasedUpgrade
stage transitions to Idle
.
If the cleanup fails again, you can manually clean up the resources.
Stopping at the Prep
stage, Lifecycle Agent cleans up the new stateroot. When finalizing after a successful upgrade or a rollback, Lifecycle Agent cleans up the old stateroot.
If this step fails, it is recommended that you inspect the logs to determine why the failure occurred.
Check if there are any existing deployments in the stateroot by running the following command:
$ ostree admin status
If there are any, clean up the existing deployment by running the following command:
$ ostree admin undeploy <index_of_deployment>
After cleaning up all the deployments of the stateroot, wipe the stateroot directory by running the following commands:
Ensure that the booted deployment is not in this stateroot. |
$ stateroot="<stateroot_to_delete>"
$ unshare -m /bin/sh -c "mount -o remount,rw /sysroot && rm -rf /sysroot/ostree/deploy/${stateroot}"
Automatic cleanup of OADP resources can fail due to connection issues between Lifecycle Agent and the S3 backend. By restoring the connection and adding the lca.openshift.io/manual-cleanup-done
annotation, the Lifecycle Agent can successfully cleanup backup resources.
Check the backend connectivity by running the following command:
$ oc get backupstoragelocations.velero.io -n openshift-adp
NAME PHASE LAST VALIDATED AGE DEFAULT
dataprotectionapplication-1 Available 33s 8d true
Remove all backup resources and then add the lca.openshift.io/manual-cleanup-done
annotation to the ImageBasedUpgrade
CR.
When LVM Storage is used to provide dynamic persistent volume storage, LVM Storage might not restore the persistent volume contents if it is configured incorrectly.
Your Backup
CRs might be missing fields that are needed to restore your persistent volumes.
You can check for events in your application pod to determine if you have this issue by running the following:
$ oc describe pod <your_app_name>
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Warning FailedScheduling 58s (x2 over 66s) default-scheduler 0/1 nodes are available: pod has unbound immediate PersistentVolumeClaims. preemption: 0/1 nodes are available: 1 Preemption is not helpful for scheduling..
Normal Scheduled 56s default-scheduler Successfully assigned default/db-1234 to sno1.example.lab
Warning FailedMount 24s (x7 over 55s) kubelet MountVolume.SetUp failed for volume "pvc-1234" : rpc error: code = Unknown desc = VolumeID is not found
You must include logicalvolumes.topolvm.io
in the application Backup
CR.
Without this resource, the application restores its persistent volume claims and persistent volume manifests correctly, however, the logicalvolume
associated with this persistent volume is not restored properly after pivot.
apiVersion: velero.io/v1
kind: Backup
metadata:
labels:
velero.io/storage-location: default
name: small-app
namespace: openshift-adp
spec:
includedNamespaces:
- test
includedNamespaceScopedResources:
- secrets
- persistentvolumeclaims
- deployments
- statefulsets
includedClusterScopedResources: (1)
- persistentVolumes
- volumesnapshotcontents
- logicalvolumes.topolvm.io
1 | To restore the persistent volumes for your application, you must configure this section as shown. |
The expected resources for the applications are restored but the persistent volume contents are not preserved after upgrading.
List the persistent volumes for you applications by running the following command before pivot:
$ oc get pv,pvc,logicalvolumes.topolvm.io -A
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/pvc-1234 1Gi RWO Retain Bound default/pvc-db lvms-vg1 4h45m
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
default persistentvolumeclaim/pvc-db Bound pvc-1234 1Gi RWO lvms-vg1 4h45m
NAMESPACE NAME AGE
logicalvolume.topolvm.io/pvc-1234 4h45m
List the persistent volumes for you applications by running the following command after pivot:
$ oc get pv,pvc,logicalvolumes.topolvm.io -A
NAME CAPACITY ACCESS MODES RECLAIM POLICY STATUS CLAIM STORAGECLASS REASON AGE
persistentvolume/pvc-1234 1Gi RWO Delete Bound default/pvc-db lvms-vg1 19s
NAMESPACE NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
default persistentvolumeclaim/pvc-db Bound pvc-1234 1Gi RWO lvms-vg1 19s
NAMESPACE NAME AGE
logicalvolume.topolvm.io/pvc-1234 18s
The reason for this issue is that the logicalvolume
status is not preserved in the Restore
CR.
This status is important because it is required for Velero to reference the volumes that must be preserved after pivoting.
You must include the following fields in the application Restore
CR:
apiVersion: velero.io/v1
kind: Restore
metadata:
name: sample-vote-app
namespace: openshift-adp
labels:
velero.io/storage-location: default
annotations:
lca.openshift.io/apply-wave: "3"
spec:
backupName:
sample-vote-app
restorePVs: true (1)
restoreStatus: (2)
includedResources:
- logicalvolumes
1 | To preserve the persistent volumes for your application, you must set restorePVs to true . |
2 | To preserve the persistent volumes for your application, you must configure this section as shown. |
The backup or restoration of artifacts failed.
You can debug Backup
and Restore
CRs and retrieve logs with the Velero CLI tool.
The Velero CLI tool provides more detailed information than the OpenShift CLI tool.
Describe the Backup
CR that contains errors by running the following command:
$ oc exec -n openshift-adp velero-7c87d58c7b-sw6fc -c velero -- ./velero describe backup -n openshift-adp backup-acm-klusterlet --details
Describe the Restore
CR that contains errors by running the following command:
$ oc exec -n openshift-adp velero-7c87d58c7b-sw6fc -c velero -- ./velero describe restore -n openshift-adp restore-acm-klusterlet --details
Download the backed up resources to a local directory by running the following command:
$ oc exec -n openshift-adp velero-7c87d58c7b-sw6fc -c velero -- ./velero backup download -n openshift-adp backup-acm-klusterlet -o ~/backup-acm-klusterlet.tar.gz