OpenShift Virtualization supports using disaster recovery (DR) solutions to ensure that your environment can recover after a site outage. To use these methods, you must plan your OpenShift Virtualization deployment in advance.
For an overview of disaster recovery (DR) concepts, architecture, and planning considerations, see the Red Hat OpenShift Virtualization disaster recovery guide in the Red Hat Knowledgebase.
The two primary DR methods for OpenShift Virtualization are Metropolitan Disaster Recovery (Metro-DR) and Regional-DR.
Metro-DR uses synchronous replication. It writes to storage at both the primary and secondary sites so that the data is always synchronized between sites. Because the storage provider is responsible for ensuring that the synchronization succeeds, the environment must meet the throughput and latency requirements of the storage provider.
Define applications for disaster recovery by using VMs that Red Hat Advanced Cluster Management (RHACM) manages or discovers.
An RHACM-managed application that includes a VM must be created by using a GitOps workflow and by creating an RHACM application or ApplicationSet
.
There are several actions you can take to improve your experience and chance of success when defining an RHACM-managed VM.
Because data volumes create persistent volume claims (PVCs) implicitly, data volumes and VMs with data volume templates do not fit as neatly into the GitOps model.
Select a RHEL image from the software catalog to use the import method. Red Hat recommends using a specific version of the image rather than a floating tag for consistent results. The KubeVirt community maintains container disks for other operating systems in a Quay repository.
pullMethod: node
Use the pod pullMethod: node
when creating a data volume from a registry source to take advantage of the OpenShift Container Platform pull secret, which is required to pull container images from the Red Hat registry.
You can configure any VM in the cluster that is not an RHACM-managed application as an RHACM-discovered application. This includes VMs imported by using the Migration Toolkit for Virtualization (MTV), VMs created by using the OpenShift Virtualization web console, or VMs created by any other means, such as the CLI.
There are several actions you can take to improve your experience and chance of success when defining an RHACM-discovered VM.
Because automatic labeling is not currently available, the application owner must manually label the components of the VM application when using MTV, the OpenShift Virtualization web console, or a custom VM.
After creating the VM, apply a common label to the following resources associated with the VM: VirtualMachine
, DataVolume
, PersistentVolumeClaim
, Service
, Route
, Secret
, ConfigMap
, VirtualMachinePreference
, and VirtualMachineInstancetype
. Do not label virtual machine instances (VMIs) or pods; OpenShift Virtualization creates and manages these automatically.
You must apply the common label to everything in the namespace that you want to protect, including objects that you added to the VM that are not listed here. |
VirtualMachine
object in the VMWorking VMs typically also contain data volumes, persistent volume claims (PVCs), services, routes, secrets, ConfigMap
objects, and VirtualMachineSnapshot
objects.
This includes other pod-based workloads and VMs.
VMs typically act similarly to pod-based workloads during both relocate and failover disaster recovery flows.
Use relocate to move an application from the primary environment to the secondary environment when the primary environment is still accessible. During relocate, the VM is gracefully terminated, any unreplicated data is synchronized to the secondary environment, and the VM starts in the secondary environment.
Because the VM terminates gracefully, there is no data loss. Therefore, the VM operating system will not perform crash recovery.
Use failover when there is a critical failure in the primary environment that makes it impractical or impossible to use relocation to move the workload to a secondary environment. When failover is executed, the storage is fenced from the primary environment, the I/O to the VM disks is abruptly halted, and the VM restarts in the secondary environment using the replicated data.
You should expect data loss due to failover. The extent of loss depends on whether you use Metro-DR, which uses synchronous replication, or Regional-DR, which uses asynchronous replication. Because Regional-DR uses snapshot-based replication intervals, the window of data loss is proportional to the replication interval length. When the VM restarts, the operating system might perform crash recovery.
The following DR solutions combine Red Hat Advanced Cluster Management (RHACM), Red Hat Ceph Storage, and OpenShift Data Foundation components. You can use them to failover applications from the primary to the secondary site, and to relocate the applications back to the primary site after you restore the disaster site.
OpenShift Virtualization supports the Metro-DR solution for OpenShift Data Foundation, which provides two-way synchronous data replication between managed OpenShift Virtualization clusters installed on primary and secondary sites.
This synchronous solution is only available to metropolitan distance data centers with a network round-trip latency of 10 milliseconds or less.
Multiple disk VMs are supported.
To prevent data corruption, you must ensure that storage is fenced during failover.
Fencing means isolating a node so that workloads do not run on it. |
For more information about using the Metro-DR solution for OpenShift Data Foundation with OpenShift Virtualization, see IBM’s OpenShift Data Foundation Metro-DR documentation.
OpenShift Virtualization supports the Regional-DR solution for OpenShift Data Foundation, which provides asynchronous data replication at regular intervals between managed OpenShift Virtualization clusters installed on primary and secondary sites.
Regional-DR supports higher network latency between the primary and secondary sites.
Regional-DR uses RBD snapshots to replicate data asynchronously. Currently, your applications must be resilient to small variances between VM disks. You can prevent these variances by using single disk VMs.
Using the import method when selecting a population source for your VM disk is recommended. However, you can protect VMs that use cloned PVCs if you select a VolumeReplicationClass
that enables image flattening. For more information, see the OpenShift Data Foundation documentation.
For more information about using the Regional-DR solution for OpenShift Data Foundation with OpenShift Virtualization, see IBM’s OpenShift Data Foundation Regional-DR documentation.