apiVersion: self-node-remediation.medik8s.io/v1alpha1
kind: SelfNodeRemediation
metadata:
name: selfnoderemediation-sample
namespace: openshift-operators
spec:
status:
lastError: <last_error_message> (1)
You can use the Self Node Remediation Operator to automatically reboot unhealthy nodes. This remediation strategy minimizes downtime for stateful applications and ReadWriteOnce (RWO) volumes, and restores compute capacity if transient failures occur.
The Self Node Remediation Operator runs on the cluster nodes and reboots nodes that are identified as unhealthy. The Operator uses the MachineHealthCheck
or NodeHealthCheck
controller to detect the health of a node in the cluster. When a node is identified as unhealthy, the MachineHealthCheck
or the NodeHealthCheck
resource creates the SelfNodeRemediation
custom resource (CR), which triggers the Self Node Remediation Operator.
The SelfNodeRemediation
CR resembles the following YAML file:
apiVersion: self-node-remediation.medik8s.io/v1alpha1
kind: SelfNodeRemediation
metadata:
name: selfnoderemediation-sample
namespace: openshift-operators
spec:
status:
lastError: <last_error_message> (1)
1 | Displays the last error that occurred during remediation. When remediation succeeds or if no errors occur, the field is left empty. |
The Self Node Remediation Operator minimizes downtime for stateful applications and restores compute capacity if transient failures occur. You can use this Operator regardless of the management interface, such as IPMI or an API to provision a node, and regardless of the cluster installation type, such as installer-provisioned infrastructure or user-provisioned infrastructure.
The Self Node Remediation Operator creates the SelfNodeRemediationConfig
CR with the name self-node-remediation-config
. The CR is created in the namespace of the Self Node Remediation Operator.
A change in the SelfNodeRemediationConfig
CR re-creates the Self Node Remediation daemon set.
The SelfNodeRemediationConfig
CR resembles the following YAML file:
apiVersion: self-node-remediation.medik8s.io/v1alpha1
kind: SelfNodeRemediationConfig
metadata:
name: self-node-remediation-config
namespace: openshift-operators
spec:
safeTimeToAssumeNodeRebootedSeconds: 180 (1)
watchdogFilePath: /dev/watchdog (2)
isSoftwareRebootEnabled: true (3)
apiServerTimeout: 15s (4)
apiCheckInterval: 5s (5)
maxApiErrorThreshold: 3 (6)
peerApiServerTimeout: 5s (7)
peerDialTimeout: 5s (8)
peerRequestTimeout: 5s (9)
peerUpdateInterval: 15m (10)
1 | Specify the timeout duration for the surviving peer, after which the Operator can assume that an unhealthy node has been rebooted. The Operator automatically calculates the lower limit for this value. However, if different nodes have different watchdog timeouts, you must change this value to a higher value. |
2 | Specify the file path of the watchdog device in the nodes. If you enter an incorrect path to the watchdog device, the Self Node Remediation Operator automatically detects the softdog device path.
If a watchdog device is unavailable, the |
3 | Specify if you want to enable software reboot of the unhealthy nodes. By default, the value of isSoftwareRebootEnabled is set to true . To disable the software reboot, set the parameter value to false . |
4 | Specify the timeout duration to check connectivity with each API server. When this duration elapses, the Operator starts remediation. The timeout duration must be more than or equal to 10 milliseconds. |
5 | Specify the frequency to check connectivity with each API server. The timeout duration must be more than or equal to 1 second. |
6 | Specify a threshold value. After reaching this threshold, the node starts contacting its peers. The threshold value must be more than or equal to 1 second. |
7 | Specify the duration of the timeout for the peer to connect the API server. The timeout duration must be more than or equal to 10 milliseconds. |
8 | Specify the duration of the timeout for establishing connection with the peer. The timeout duration must be more than or equal to 10 milliseconds. |
9 | Specify the duration of the timeout to get a response from the peer. The timeout duration must be more than or equal to 10 milliseconds. |
10 | Specify the frequency to update peer information, such as IP address. The timeout duration must be more than or equal to 10 seconds. |
You can edit the
|
The Self Node Remediation Operator also creates the SelfNodeRemediationTemplate
Custom Resource Definition (CRD). This CRD defines the remediation strategy for the nodes. The following remediation strategies are available:
ResourceDeletion
This remediation strategy removes the pods and associated volume attachments on the node rather than the node object. This strategy helps to recover workloads faster. ResourceDeletion
is the default remediation strategy.
NodeDeletion
This remediation strategy removes the node object.
The Self Node Remediation Operator creates the following SelfNodeRemediationTemplate
CRs for each strategy:
self-node-remediation-resource-deletion-template
, which the ResourceDeletion
remediation strategy uses
self-node-remediation-node-deletion-template
, which the NodeDeletion
remediation strategy uses
The SelfNodeRemediationTemplate
CR resembles the following YAML file:
apiVersion: self-node-remediation.medik8s.io/v1alpha1
kind: SelfNodeRemediationTemplate
metadata:
creationTimestamp: "2022-03-02T08:02:40Z"
name: self-node-remediation-<remediation_object>-deletion-template (1)
namespace: openshift-operators
spec:
template:
spec:
remediationStrategy: <remediation_strategy> (2)
1 | Specifies the type of remediation template based on the remediation strategy. Replace <remediation_object> with either resource or node , for example, self-node-remediation-resource-deletion-template . |
2 | Specifies the remediation strategy. The remediation strategy can either be ResourceDeletion or NodeDeletion . |
Watchdog devices can be any of the following:
Independently powered hardware devices
Hardware devices that share power with the hosts they control
Virtual devices implemented in software, or softdog
Hardware watchdog and softdog
devices have electronic or software timers, respectively. These watchdog devices are used to ensure that the machine enters a safe state when an error condition is detected. The cluster is required to repeatedly reset the watchdog timer to prove that it is in a healthy state. This timer might elapse due to fault conditions, such as deadlocks, CPU starvation, and loss of network or disk access. If the timer expires, the watchdog device assumes that a fault has occurred and the device triggers a forced reset of the node.
Hardware watchdog devices are more reliable than softdog
devices.
The Self Node Remediation Operator determines the remediation strategy based on the watchdog devices that are present.
If a hardware watchdog device is configured and available, the Operator uses it for remediation. If a hardware watchdog device is not configured, the Operator enables and uses a softdog
device for remediation.
If neither watchdog devices are supported, either by the system or by the configuration, the Operator remediates nodes by using software reboot.
You can use the OpenShift Container Platform web console to install the Self Node Remediation Operator.
Log in as a user with cluster-admin
privileges.
In the OpenShift Container Platform web console, navigate to Operators → OperatorHub.
Search for the Self Node Remediation Operator from the list of available Operators, and then click Install.
Keep the default selection of Installation mode and namespace to ensure that the Operator is installed to the openshift-operators
namespace.
Click Install.
To confirm that the installation is successful:
Navigate to the Operators → Installed Operators page.
Check that the Operator is installed in the openshift-operators
namespace and its status is Succeeded
.
If the Operator is not installed successfully:
Navigate to the Operators → Installed Operators page and inspect the Status
column for any errors or failures.
Navigate to the Workloads → Pods page and check the logs in any pods in the self-node-remediation-controller-manager
project that are reporting issues.
You can use the OpenShift CLI (oc
) to install the Self Node Remediation Operator.
You can install the Self Node Remediation Operator in your own namespace or in the openshift-operators
namespace.
To install the Operator in your own namespace, follow the steps in the procedure.
To install the Operator in the openshift-operators
namespace, skip to step 3 of the procedure because the steps to create a new Namespace
custom resource (CR) and an OperatorGroup
CR are not required.
Install the OpenShift CLI (oc
).
Log in as a user with cluster-admin
privileges.
Create a Namespace
custom resource (CR) for the Self Node Remediation Operator:
Define the Namespace
CR and save the YAML file, for example, self-node-remediation-namespace.yaml
:
apiVersion: v1
kind: Namespace
metadata:
name: self-node-remediation
To create the Namespace
CR, run the following command:
$ oc create -f self-node-remediation-namespace.yaml
Create an OperatorGroup
CR:
Define the OperatorGroup
CR and save the YAML file, for example, self-node-remediation-operator-group.yaml
:
apiVersion: operators.coreos.com/v1
kind: OperatorGroup
metadata:
name: self-node-remediation-operator
namespace: self-node-remediation
To create the OperatorGroup
CR, run the following command:
$ oc create -f self-node-remediation-operator-group.yaml
Create a Subscription
CR:
Define the Subscription
CR and save the YAML file, for example, self-node-remediation-subscription.yaml
:
apiVersion: operators.coreos.com/v1alpha1
kind: Subscription
metadata:
name: self-node-remediation-operator
namespace: self-node-remediation (1)
spec:
channel: stable
installPlanApproval: Manual (2)
name: self-node-remediation-operator
source: redhat-operators
sourceNamespace: openshift-marketplace
package: self-node-remediation
1 | Specify the Namespace where you want to install the Self Node Remediation Operator. To install the Self Node Remediation Operator in the openshift-operators namespace, specify openshift-operators in the Subscription CR. |
2 | Set the approval strategy to Manual in case your specified version is superseded by a later version in the catalog. This plan prevents an automatic upgrade to a later version and requires manual approval before the starting CSV can complete the installation. |
To create the Subscription
CR, run the following command:
$ oc create -f self-node-remediation-subscription.yaml
Verify that the installation succeeded by inspecting the CSV resource:
$ oc get csv -n self-node-remediation
NAME DISPLAY VERSION REPLACES PHASE
self-node-remediation.v.0.4.0 Self Node Remediation Operator v.0.4.0 Succeeded
Verify that the Self Node Remediation Operator is up and running:
$ oc get deploy -n self-node-remediation
NAME READY UP-TO-DATE AVAILABLE AGE
self-node-remediation-controller-manager 1/1 1 1 28h
Verify that the Self Node Remediation Operator created the SelfNodeRemediationConfig
CR:
$ oc get selfnoderemediationconfig -n self-node-remediation
NAME AGE
self-node-remediation-config 28h
Verify that each self node remediation pod is scheduled and running on each worker node:
$ oc get daemonset -n self-node-remediation
NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE
self-node-remediation-ds 3 3 3 3 3 <none> 28h
This command is unsupported for the control plane nodes. |
Use the following procedure to configure the machine health checks to use the Self Node Remediation Operator as a remediation provider.
Install the OpenShift CLI (oc
).
Log in as a user with cluster-admin
privileges.
Create a SelfNodeRemediationTemplate
CR:
Define the SelfNodeRemediationTemplate
CR:
apiVersion: self-node-remediation.medik8s.io/v1alpha1
kind: SelfNodeRemediationTemplate
metadata:
namespace: openshift-machine-api
name: selfnoderemediationtemplate-sample
spec:
template:
spec:
remediationStrategy: ResourceDeletion (1)
1 | Specifies the remediation strategy. The default strategy is ResourceDeletion . |
To create the SelfNodeRemediationTemplate
CR, run the following command:
$ oc create -f <snr-name>.yaml
Create or update the MachineHealthCheck
CR to point to the SelfNodeRemediationTemplate
CR:
Define or update the MachineHealthCheck
CR:
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
name: machine-health-check
namespace: openshift-machine-api
spec:
selector:
matchLabels:
machine.openshift.io/cluster-api-machine-role: "worker"
machine.openshift.io/cluster-api-machine-type: "worker"
unhealthyConditions:
- type: "Ready"
timeout: "300s"
status: "False"
- type: "Ready"
timeout: "300s"
status: "Unknown"
maxUnhealthy: "40%"
nodeStartupTimeout: "10m"
remediationTemplate: (1)
kind: SelfNodeRemediationTemplate
apiVersion: self-node-remediation.medik8s.io/v1alpha1
name: selfnoderemediationtemplate-sample
1 | Specifies the details for the remediation template. |
To create a MachineHealthCheck
CR, run the following command:
$ oc create -f <file-name>.yaml
To update a MachineHealthCheck
CR, run the following command:
$ oc apply -f <file-name>.yaml
You want to troubleshoot issues with the Self Node Remediation Operator.
Check the Operator logs.
The Self Node Remediation Operator is installed but the daemon set is not available.
Check the Operator logs for errors or warnings.
An unhealthy node was not remediated.
Verify that the SelfNodeRemediation
CR was created by running the following command:
$ oc get snr -A
If the MachineHealthCheck
controller did not create the SelfNodeRemediation
CR when the node turned unhealthy, check the logs of the MachineHealthCheck
controller. Additionally, ensure that the MachineHealthCheck
CR includes the required specification to use the remediation template.
If the SelfNodeRemediation
CR was created, ensure that its name matches the unhealthy node or the machine object.
The Self Node Remediation Operator resources, such as the daemon set, configuration CR, and the remediation template CR, exist even after after uninstalling the Operator.
To remove the Self Node Remediation Operator resources, delete the resources by running the following commands for each resource type:
$ oc delete ds <self-node-remediation-ds> -n <namespace>
$ oc delete snrc <self-node-remediation-config> -n <namespace>
$ oc delete snrt <self-node-remediation-template> -n <namespace>
To collect debugging information about the Self Node Remediation Operator, use the must-gather
tool. For information about the must-gather
image for the Self Node Remediation Operator, see Gathering data about specific features.
The Self Node Remediation Operator is supported in a restricted network environment. For more information, see Using Operator Lifecycle Manager on restricted networks.