$ oc get machinehealthcheck example -n openshift-machine-api
You can configure and deploy a machine health check to automatically repair damaged machines in a machine pool.
This process is not applicable to clusters where you manually provisioned the machines yourself. You can use the advanced machine management and scaling capabilities only in clusters where the machine API is operational. |
You can define conditions under which machines in a cluster are considered unhealthy by using a MachineHealthCheck
resource.
Machines matching the conditions are automatically remediated.
To monitor machine health, create a MachineHealthCheck
custom resource (CR) that includes a label for the set of machines to monitor and a condition to check, such as staying in the NotReady
status for 15 minutes or displaying a permanent condition in the node-problem-detector.
The controller that observes a MachineHealthCheck
CR checks for the condition that you defined. If a machine fails the health check, the machine is automatically deleted and a new one is created to take its place. When a machine is deleted, you see a machine deleted
event.
For machines with the master role, the machine health check reports the number of unhealthy nodes, but the machine is not deleted. For example: Example output
To limit the disruptive impact of machine deletions, the controller drains and deletes only one node at a time. If there are more unhealthy machines than the |
To stop the check, remove the custom resource.
There are limitations to consider before deploying a machine health check:
Only machines owned by a machine set are remediated by a machine health check.
Control plane machines are not currently supported and are not remediated if they are unhealthy.
If the node for a machine is removed from the cluster, a machine health check considers the machine to be unhealthy and remediates it immediately.
If the corresponding node for a machine does not join the cluster after the nodeStartupTimeout
, the machine is remediated.
A machine is remediated immediately if the Machine
resource phase is Failed
.
See Short-circuiting machine health check remediation for more information about short circuiting
MachineHealthCheck
resourceThe MachineHealthCheck
resource resembles the following YAML file:
MachineHealthCheck
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
name: example (1)
namespace: openshift-machine-api
spec:
selector:
matchLabels:
machine.openshift.io/cluster-api-machine-role: <role> (2)
machine.openshift.io/cluster-api-machine-type: <role> (2)
machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone> (3)
unhealthyConditions:
- type: "Ready"
timeout: "300s" (4)
status: "False"
- type: "Ready"
timeout: "300s" (4)
status: "Unknown"
maxUnhealthy: "40%" (5)
nodeStartupTimeout: "10m" (6)
1 | Specify the name of the machine health check to deploy. |
2 | Specify a label for the machine pool that you want to check. |
3 | Specify the machine set to track in <cluster_name>-<label>-<zone> format. For example, prod-node-us-east-1a . |
4 | Specify the timeout duration for a node condition. If a condition is met for the duration of the timeout, the machine will be remediated. Long timeouts can result in long periods of downtime for a workload on an unhealthy machine. |
5 | Specify the amount of unhealthy machines allowed in the targeted pool. This can be set as a percentage or an integer. |
6 | Specify the timeout duration that a machine health check must wait for a node to join the cluster before a machine is determined to be unhealthy. |
The |
Short circuiting ensures that machine health checks remediate machines only when the cluster is healthy.
Short-circuiting is configured through the maxUnhealthy
field in the MachineHealthCheck
resource.
If the user defines a value for the maxUnhealthy
field,
before remediating any machines, the MachineHealthCheck
compares the value of maxUnhealthy
with the number of machines within its target pool that it has determined to be unhealthy.
Remediation is not performed if the number of unhealthy machines exceeds the maxUnhealthy
limit.
If |
The maxUnhealthy
field can be set as either an integer or percentage.
There are different remediation implementations depending on the maxUnhealthy
value.
maxUnhealthy
by using an absolute valueIf maxUnhealthy
is set to 2
:
Remediation will be performed if 2 or fewer nodes are unhealthy
Remediation will not be performed if 3 or more nodes are unhealthy
These values are independent of how many machines are being checked by the machine health check.
maxUnhealthy
by using percentagesIf maxUnhealthy
is set to 40%
and there are 25 machines being checked:
Remediation will be performed if 10 or fewer nodes are unhealthy
Remediation will not be performed if 11 or more nodes are unhealthy
If maxUnhealthy
is set to 40%
and there are 6 machines being checked:
Remediation will be performed if 2 or fewer nodes are unhealthy
Remediation will not be performed if 3 or more nodes are unhealthy
The allowed number of machines is rounded down when the percentage of |
MachineHealthCheck
resourceYou can create a MachineHealthCheck
resource for all MachineSets
in your cluster.
You should not create a MachineHealthCheck
resource that targets control plane machines.
Install the oc
command line interface.
Create a healthcheck.yml
file that contains the definition of your machine health check.
Apply the healthcheck.yml
file to your cluster:
$ oc apply -f healthcheck.yml