×

Understanding taints and tolerations

A taint allows a node to refuse a pod to be scheduled unless that pod has a matching toleration.

You apply taints to a node through the Node specification (NodeSpec) and apply tolerations to a pod through the Pod specification (PodSpec). When you apply a taint a node, the scheduler cannot place a pod on that node unless the pod can tolerate the taint.

Example taint in a node specification
spec:
  taints:
  - effect: NoExecute
    key: key1
    value: value1
....
Example toleration in a Pod spec
spec:
  tolerations:
  - key: "key1"
    operator: "Equal"
    value: "value1"
    effect: "NoExecute"
    tolerationSeconds: 3600
....

Taints and tolerations consist of a key, value, and effect.

Table 1. Taint and toleration components
Parameter Description

key

The key is any string, up to 253 characters. The key must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.

value

The value is any string, up to 63 characters. The value must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.

effect

The effect is one of the following:

NoSchedule [1]

  • New pods that do not match the taint are not scheduled onto that node.

  • Existing pods on the node remain.

PreferNoSchedule

  • New pods that do not match the taint might be scheduled onto that node, but the scheduler tries not to.

  • Existing pods on the node remain.

NoExecute

  • New pods that do not match the taint cannot be scheduled onto that node.

  • Existing pods on the node that do not have a matching toleration are removed.

operator

Equal

The key/value/effect parameters must match. This is the default.

Exists

The key/effect parameters must match. You must leave a blank value parameter, which matches any.

  1. If you add a NoSchedule taint to a control plane node, the node must have the node-role.kubernetes.io/master=:NoSchedule taint, which is added by default.

    For example:

    apiVersion: v1
    kind: Node
    metadata:
      annotations:
        machine.openshift.io/machine: openshift-machine-api/ci-ln-62s7gtb-f76d1-v8jxv-master-0
        machineconfiguration.openshift.io/currentConfig: rendered-master-cdc1ab7da414629332cc4c3926e6e59c
    ...
    spec:
      taints:
      - effect: NoSchedule
        key: node-role.kubernetes.io/master
    ...

A toleration matches a taint:

  • If the operator parameter is set to Equal:

    • the key parameters are the same;

    • the value parameters are the same;

    • the effect parameters are the same.

  • If the operator parameter is set to Exists:

    • the key parameters are the same;

    • the effect parameters are the same.

The following taints are built into OpenShift Container Platform:

  • node.kubernetes.io/not-ready: The node is not ready. This corresponds to the node condition Ready=False.

  • node.kubernetes.io/unreachable: The node is unreachable from the node controller. This corresponds to the node condition Ready=Unknown.

  • node.kubernetes.io/memory-pressure: The node has memory pressure issues. This corresponds to the node condition MemoryPressure=True.

  • node.kubernetes.io/disk-pressure: The node has disk pressure issues. This corresponds to the node condition DiskPressure=True.

  • node.kubernetes.io/network-unavailable: The node network is unavailable.

  • node.kubernetes.io/unschedulable: The node is unschedulable.

  • node.cloudprovider.kubernetes.io/uninitialized: When the node controller is started with an external cloud provider, this taint is set on a node to mark it as unusable. After a controller from the cloud-controller-manager initializes this node, the kubelet removes this taint.

  • node.kubernetes.io/pid-pressure: The node has pid pressure. This corresponds to the node condition PIDPressure=True.

    OpenShift Container Platform does not set a default pid.available evictionHard.

Understanding how to use toleration seconds to delay pod evictions

You can specify how long a pod can remain bound to a node before being evicted by specifying the tolerationSeconds parameter in the Pod specification or MachineSet object. If a taint with the NoExecute effect is added to a node, a pod that does tolerate the taint, which has the tolerationSeconds parameter, the pod is not evicted until that time period expires.

Example output
spec:
  tolerations:
  - key: "key1"
    operator: "Equal"
    value: "value1"
    effect: "NoExecute"
    tolerationSeconds: 3600

Here, if this pod is running but does not have a matching toleration, the pod stays bound to the node for 3,600 seconds and then be evicted. If the taint is removed before that time, the pod is not evicted.

Understanding how to use multiple taints

You can put multiple taints on the same node and multiple tolerations on the same pod. OpenShift Container Platform processes multiple taints and tolerations as follows:

  1. Process the taints for which the pod has a matching toleration.

  2. The remaining unmatched taints have the indicated effects on the pod:

    • If there is at least one unmatched taint with effect NoSchedule, OpenShift Container Platform cannot schedule a pod onto that node.

    • If there is no unmatched taint with effect NoSchedule but there is at least one unmatched taint with effect PreferNoSchedule, OpenShift Container Platform tries to not schedule the pod onto the node.

    • If there is at least one unmatched taint with effect NoExecute, OpenShift Container Platform evicts the pod from the node if it is already running on the node, or the pod is not scheduled onto the node if it is not yet running on the node.

      • Pods that do not tolerate the taint are evicted immediately.

      • Pods that tolerate the taint without specifying tolerationSeconds in their Pod specification remain bound forever.

      • Pods that tolerate the taint with a specified tolerationSeconds remain bound for the specified amount of time.

For example:

  • Add the following taints to the node:

    $ oc adm taint nodes node1 key1=value1:NoSchedule
    $ oc adm taint nodes node1 key1=value1:NoExecute
    $ oc adm taint nodes node1 key2=value2:NoSchedule
  • The pod has the following tolerations:

    spec:
      tolerations:
      - key: "key1"
        operator: "Equal"
        value: "value1"
        effect: "NoSchedule"
      - key: "key1"
        operator: "Equal"
        value: "value1"
        effect: "NoExecute"

In this case, the pod cannot be scheduled onto the node, because there is no toleration matching the third taint. The pod continues running if it is already running on the node when the taint is added, because the third taint is the only one of the three that is not tolerated by the pod.

Understanding pod scheduling and node conditions (taint node by condition)

The Taint Nodes By Condition feature, which is enabled by default, automatically taints nodes that report conditions such as memory pressure and disk pressure. If a node reports a condition, a taint is added until the condition clears. The taints have the NoSchedule effect, which means no pod can be scheduled on the node unless the pod has a matching toleration.

The scheduler checks for these taints on nodes before scheduling pods. If the taint is present, the pod is scheduled on a different node. Because the scheduler checks for taints and not the actual node conditions, you configure the scheduler to ignore some of these node conditions by adding appropriate pod tolerations.

To ensure backward compatibility, the daemon set controller automatically adds the following tolerations to all daemons:

  • node.kubernetes.io/memory-pressure

  • node.kubernetes.io/disk-pressure

  • node.kubernetes.io/unschedulable (1.10 or later)

  • node.kubernetes.io/network-unavailable (host network only)

You can also add arbitrary tolerations to daemon sets.

The control plane also adds the node.kubernetes.io/memory-pressure toleration on pods that have a QoS class. This is because Kubernetes manages pods in the Guaranteed or Burstable QoS classes. The new BestEffort pods do not get scheduled onto the affected node.

Understanding evicting pods by condition (taint-based evictions)

The Taint-Based Evictions feature, which is enabled by default, evicts pods from a node that experiences specific conditions, such as not-ready and unreachable. When a node experiences one of these conditions, OpenShift Container Platform automatically adds taints to the node, and starts evicting and rescheduling the pods on different nodes.

Taint Based Evictions have a NoExecute effect, where any pod that does not tolerate the taint is evicted immediately and any pod that does tolerate the taint will never be evicted, unless the pod uses the tolerationSeconds parameter.

The tolerationSeconds parameter allows you to specify how long a pod stays bound to a node that has a node condition. If the condition still exists after the tolerationSeconds period, the taint remains on the node and the pods with a matching toleration are evicted. If the condition clears before the tolerationSeconds period, pods with matching tolerations are not removed.

If you use the tolerationSeconds parameter with no value, pods are never evicted because of the not ready and unreachable node conditions.

OpenShift Container Platform evicts pods in a rate-limited way to prevent massive pod evictions in scenarios such as the master becoming partitioned from the nodes.

By default, if more than 55% of nodes in a given zone are unhealthy, the node lifecycle controller changes that zone’s state to PartialDisruption and the rate of pod evictions is reduced. For small clusters (by default, 50 nodes or less) in this state, nodes in this zone are not tainted and evictions are stopped.

For more information, see Rate limits on eviction in the Kubernetes documentation.

OpenShift Container Platform automatically adds a toleration for node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with tolerationSeconds=300, unless the Pod configuration specifies either toleration.

spec:
  tolerations:
  - key: node.kubernetes.io/not-ready
    operator: Exists
    effect: NoExecute
    tolerationSeconds: 300 (1)
  - key: node.kubernetes.io/unreachable
    operator: Exists
    effect: NoExecute
    tolerationSeconds: 300
1 These tolerations ensure that the default pod behavior is to remain bound for five minutes after one of these node conditions problems is detected.

You can configure these tolerations as needed. For example, if you have an application with a lot of local state, you might want