The Poison Pill Operator runs on the cluster nodes and reboots nodes that are identified as unhealthy. The Operator uses the
MachineHealthCheck controller to detect the health of a node in the cluster. When a node is identified as unhealthy, the
MachineHealthCheck resource creates the
PoisonPillRemediation custom resource (CR), which triggers the Poison Pill Operator.
The Poison Pill Operator minimizes downtime for stateful applications and restores compute capacity if transient failures occur. You can use this Operator regardless of the management interface, such as IPMI or an API to provision a node, and regardless of the cluster installation type, such as installer-provisioned infrastructure or user-provisioned infrastructure.
The Poison Pill Operator creates the
PoisonPillConfig CR with the name
poison-pill-config in the Poison Pill Operator’s namespace. You can edit this CR. However, you cannot create a new CR for the Poison Pill Operator.
A change in the
PoisonPillConfig CR re-creates the Poison Pill daemon set.
PoisonPillConfig CR resembles the following YAML file:
safeTimeToAssumeNodeRebootedSeconds: 180 (1)
watchdogFilePath: /test/watchdog1 (2)
isSoftwareRebootEnabled: true (3)
apiServerTimeout: 15s (4)
apiCheckInterval: 5s (5)
maxApiErrorThreshold: 3 (6)
peerApiServerTimeout: 5s (7)
peerDialTimeout: 5s (8)
peerRequestTimeout: 5s (9)
peerUpdateInterval: 15m (10)
||Specify the timeout duration for the surviving peer, after which the Operator can assume that an unhealthy node has been rebooted. The Operator automatically calculates the lower limit for this value. However, if different nodes have different watchdog timeouts, you must change this value to a higher value.
||Specify the file path of the watchdog device in the nodes. If a watchdog device is unavailable, the
PoisonPillConfig CR uses a software reboot.
||Specify if you want to enable software reboot of the unhealthy nodes. By default, the value of
isSoftwareRebootEnabled is set to
true. To disable the software reboot, set the parameter value to
||Specify the timeout duration to check connectivity with each API server. When this duration elapses, the Operator starts remediation.
||Specify the frequency to check connectivity with each API server.
||Specify a threshold value. After reaching this threshold, the node starts contacting its peers.
||Specify the timeout duration to connect with the peer API server.
||Specify the timeout duration for establishing connection with the peer.
||Specify the timeout duration to get a response from the peer.
||Specify the frequency to update peer information, such as IP address.
Watchdog devices can be any of the following:
Independently powered hardware devices
Hardware devices that share power with the hosts they control
Virtual devices implemented in software, or
Hardware watchdog and
softdog devices have electronic or software timers, respectively. These watchdog devices are used to ensure that the machine enters a safe state when an error condition is detected. The cluster is required to repeatedly reset the watchdog timer to prove that it is in a healthy state. This timer might elapse due to fault conditions, such as deadlocks, CPU starvation, and loss of network or disk access. If the timer expires, the watchdog device assumes that a fault has occurred and the device triggers a forced reset of the node.
Hardware watchdog devices are more reliable than
The Poison Pill Operator determines the remediation strategy based on the watchdog devices that are present.
If a hardware watchdog device is configured and available, the Operator uses it for remediation. If a hardware watchdog device is not configured, the Operator enables and uses a
softdog device for remediation.
If neither watchdog devices are supported, either by the system or by the configuration, the Operator remediates nodes by using software reboot.