If you use remote worker nodes, consider which objects to use to run your applications.
It is recommend to use daemon sets or static pods based on the behavior you want in the event of network issues or power loss. In addition, you can use Kubernetes zones and tolerations to control or avoid pod evictions if the control plane cannot reach remote worker nodes.
- Daemon sets
-
Daemon sets are the best approach to managing pods on remote worker nodes for the following reasons:
-
Daemon sets do not typically need rescheduling behavior. If a node disconnects from the cluster, pods on the node can continue to run. OpenShift Container Platform does not change the state of daemon set pods, and leaves the pods in the state they last reported. For example, if a daemon set pod is in the Running
state, when a node stops communicating, the pod keeps running and is assumed to be running by OpenShift Container Platform.
-
Daemon set pods, by default, are created with NoExecute
tolerations for the node.kubernetes.io/unreachable
and node.kubernetes.io/not-ready
taints with no tolerationSeconds
value. These default values ensure that daemon set pods are never evicted if the control plane cannot reach a node. For example:
Tolerations added to daemon set pods by default
tolerations:
- key: node.kubernetes.io/not-ready
operator: Exists
effect: NoExecute
- key: node.kubernetes.io/unreachable
operator: Exists
effect: NoExecute
- key: node.kubernetes.io/disk-pressure
operator: Exists
effect: NoSchedule
- key: node.kubernetes.io/memory-pressure
operator: Exists
effect: NoSchedule
- key: node.kubernetes.io/pid-pressure
operator: Exists
effect: NoSchedule
- key: node.kubernetes.io/unschedulable
operator: Exists
effect: NoSchedule
-
Daemon sets can use labels to ensure that a workload runs on a matching worker node.
-
You can use an OpenShift Container Platform service endpoint to load balance daemon set pods.
|
Daemon sets do not schedule pods after a reboot of the node if OpenShift Container Platform cannot reach the node.
|
- Static pods
-
If you want pods restart if a node reboots, after a power loss for example, consider static pods. The kubelet on a node automatically restarts static pods as node restarts.
|
Static pods cannot use secrets and config maps.
|
- Kubernetes zones
-
Kubernetes zones can slow down the rate or, in some cases, completely stop pod evictions.
When the control plane cannot reach a node, the node controller, by default, applies node.kubernetes.io/unreachable
taints and evicts pods at a rate of 0.1 nodes per second. However, in a cluster that uses Kubernetes zones, pod eviction behavior is altered.
If a zone is fully disrupted, where all nodes in the zone have a Ready
condition that is False
or Unknown
, the control plane does not apply the node.kubernetes.io/unreachable
taint to the nodes in that zone.
For partially disrupted zones, where more than 55% of the nodes have a False
or Unknown
condition, the pod eviction rate is reduced to 0.01 nodes per second. Nodes in smaller clusters, with fewer than 50 nodes, are not tainted. Your cluster must have more than three zones for these behavior to take effect.
You assign a node to a specific zone by applying the topology.kubernetes.io/region
label in the node specification.
Sample node labels for Kubernetes zones
kind: Node
apiVersion: v1
metadata:
labels:
topology.kubernetes.io/region=east
You can adjust the amount of time that the kubelet checks the state of each node.
To set the interval that affects the timing of when the on-premise node controller marks nodes with the Unhealthy
or Unreachable
condition, create a KubeletConfig
object that contains the node-status-update-frequency
and node-status-report-frequency
parameters.
The kubelet on each node determines the node status as defined by the node-status-update-frequency
setting and reports that status to the cluster based on the node-status-report-frequency
setting. By default, the kubelet determines the pod status every 10 seconds and reports the status every minute. However, if the node state changes, the kubelet reports the change to the cluster immediately. OpenShift Container Platform uses the node-status-report-frequency
setting only when the Node Lease feature gate is enabled, which is the default state in OpenShift Container Platform clusters. If the Node Lease feature gate is disabled, the node reports its status based on the node-status-update-frequency
setting.
Example kubelet config
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: disable-cpu-units
spec:
machineConfigPoolSelector:
matchLabels:
machineconfiguration.openshift.io/role: worker (1)
kubeletConfig:
node-status-update-frequency: (2)
- "10s"
node-status-report-frequency: (3)
- "1m"
1 |
Specify the type of node type to which this KubeletConfig object applies using the label from the MachineConfig object. |
2 |
Specify the frequency that the kubelet checks the status of a node associated with this MachineConfig object. The default value is 10s . If you change this default, the node-status-report-frequency value is changed to the same value. |
3 |
Specify the frequency that the kubelet reports the status of a node associated with this MachineConfig object. The default value is 1m . |
The node-status-update-frequency
parameter works with the node-monitor-grace-period
and pod-eviction-timeout
parameters.
-
The node-monitor-grace-period
parameter specifies how long OpenShift Container Platform waits after a node associated with a MachineConfig
object is marked Unhealthy
if the controller manager does not receive the node heartbeat. Workloads on the node continue to run after this time. If the remote worker node rejoins the cluster after node-monitor-grace-period
expires, pods continue to run. New pods can be scheduled to that node. The node-monitor-grace-period
interval is 40s
. The node-status-update-frequency
value must be lower than the node-monitor-grace-period
value.
-
The pod-eviction-timeout
parameter specifies the amount of time OpenShift Container Platform waits after marking a node that is associated with a MachineConfig
object as Unreachable
to start marking pods for eviction. Evicted pods are rescheduled on other nodes. If the remote worker node rejoins the cluster after pod-eviction-timeout
expires, the pods running on the remote worker node are terminated because the node controller has evicted the pods on-premise. Pods can then be rescheduled to that node. The pod-eviction-timeout
interval is 5m0s
.
|
Modifying the node-monitor-grace-period and pod-eviction-timeout parameters is not supported.
|
- Tolerations
-
You can use pod tolerations to mitigate the effects if the on-premise node controller adds a node.kubernetes.io/unreachable
taint with a NoExecute
effect to a node it cannot reach.
A taint with the NoExecute
effect affects pods that are running on the node in the following ways:
-
Pods that do not tolerate the taint are queued for eviction.
-
Pods that tolerate the taint without specifying a tolerationSeconds
value in their toleration specification remain bound forever.
-
Pods that tolerate the taint with a specified tolerationSeconds
value remain bound for the specified amount of time. After the time elapses, the pods are queued for eviction.
You can delay or avoid pod eviction by configuring pods tolerations with the NoExecute
effect for the node.kubernetes.io/unreachable
and node.kubernetes.io/not-ready
taints.
Example toleration in a pod spec
...
tolerations:
- key: "node.kubernetes.io/unreachable"
operator: "Exists"
effect: "NoExecute" (1)
- key: "node.kubernetes.io/not-ready"
operator: "Exists"
effect: "NoExecute" (2)
tolerationSeconds: 600
...
1 |
The NoExecute effect without tolerationSeconds lets pods remain forever if the control plane cannot reach the node. |
2 |
The NoExecute effect with tolerationSeconds : 600 lets pods remain for 10 minutes if the control plane marks the node as Unhealthy . |
OpenShift Container Platform uses the tolerationSeconds
value after the pod-eviction-timeout
value elapses.
- Other types of OpenShift Container Platform objects
-
You can use replica sets, deployments, and replication controllers. The scheduler can reschedule these pods onto other nodes after the node is disconnected for five minutes. Rescheduling onto other nodes can be beneficial for some workloads, such as REST APIs, where an administrator can guarantee a specific number of pods are running and accessible.
|
When working with remote worker nodes, rescheduling pods on different nodes might not be acceptable if remote worker nodes are intended to be reserved for specific functions.
|
stateful sets do not get restarted when there is an outage. The pods remain in the terminating
state until the control plane can acknowledge that the pods are terminated.
To avoid scheduling a to a node that does not have access to the same type of persistent storage, OpenShift Container Platform cannot migrate pods that require persistent volumes to other zones in the case of network separation.