Machine health checks automatically repair unhealthy machines in a particular machine pool.
To monitor machine health, you create a resource to define the configuration for a controller. You set a condition to check for, such as staying in the
NotReady status for 15 minutes or displaying a permanent condition in the node-problem-detector, and a label for the set of machines to monitor.
You cannot apply a machine health check to a machine with the master role.
The controller that observes a
MachineHealthCheck resource checks for the status that you defined. If a machine fails the health check, it is automatically deleted and a new one is created to take its place. When a machine is deleted, you see a
machine deleted event. To limit disruptive impact of the machine deletion, the controller drains and deletes only one node at a time. If there are more unhealthy machines than the
maxUnhealthy threshold allows for in the targeted pool of machines, remediation stops so that manual intervention can take place.
To stop the check, you remove the resource.