MachineHealthChecks automatically repairs unhealthy Machines in a particular
MachinePool.
To monitor machine health, you create a resource to define the
configuration for a controller. You set a condition to check for, such as
staying in the NotReady
status for 15 minutes or displaying a permanent condition
in the node-problem-detector, and a label for the set of machines to monitor.
|
You cannot apply a MachineHealthCheck to a machine with the master role.
|
The controller that observes a MachineHealthCheck resource checks for the status
that you defined. If a machine fails the health check, it is automatically deleted
and a new one is created to take its place. When a machine is deleted, you
see a machine deleted
event. To limit disruptive impact of the machine
deletion, the controller drains and deletes only one node at a time. If there
are more unhealthy machines than the maxUnhealthy
threshold allows for in the
targeted pool of machines, remediation stops so that manual intervention can take
place.
To stop the check, you remove the resource.