Machine health checks automatically repair unhealthy machines in a particular machine pool.
To monitor machine health, create a resource to define the configuration for a controller. Set a condition to check, such as staying in the
NotReady status for five minutes or displaying a permanent condition in the node-problem-detector, and a label for the set of machines to monitor.
You cannot apply a machine health check to a machine with the master role.
The controller that observes a
MachineHealthCheck resource checks for the defined condition. If a machine fails the health check, the machine is automatically deleted and one is created to take its place. When a machine is deleted, you see a
machine deleted event.
To limit disruptive impact of the machine deletion, the controller drains and deletes only one node at a time. If there are more unhealthy machines than the
maxUnhealthy threshold allows for in the targeted pool of machines, remediation stops and therefore enables manual intervention.
Consider the timeouts carefully, accounting for workloads and requirements.
Long timeouts can result in long periods of downtime for the workload on the unhealthy machine.
Too short timeouts can result in a remediation loop. For example, the timeout for checking the
NotReady status must be long enough to allow the machine to complete the startup process.
To stop the check, remove the resource.
For example, you should stop the check during the upgrade process because the nodes in the cluster might become temporarily unavailable. The
MachineHealthCheck might identify such nodes as unhealthy and reboot them. To avoid rebooting such nodes, remove any
MachineHealthCheck resource that you have deployed before updating the cluster.
MachineHealthCheck resource that is deployed by default (such as
machine-api-termination-handler) cannot be removed and will be recreated.
Limitations when deploying machine health checks
There are limitations to consider before deploying a machine health check:
Only machines owned by a machine set are remediated by a machine health check.
Control plane machines are not currently supported and are not remediated if they are unhealthy.
If the node for a machine is removed from the cluster, a machine health check considers the machine to be unhealthy and remediates it immediately.
If the corresponding node for a machine does not join the cluster after the
nodeStartupTimeout, the machine is remediated.
A machine is remediated immediately if the
Machine resource phase is