Understanding taints and tolerations

A taint allows a node to refuse pod to be scheduled unless that pod has a matching toleration.

You apply taints to a node through the node specification (NodeSpec) and apply tolerations to a pod through the pod specification (PodSpec). A taint on a node instructs the node to repel all pods that do not tolerate the taint.

Taints and tolerations consist of a key, value, and effect. An operator allows you to leave one of these parameters empty.

Table 1. Taint and toleration components
Parameter Description

key

The key is any string, up to 253 characters. The key must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.

value

The value is any string, up to 63 characters. The value must begin with a letter or number, and may contain letters, numbers, hyphens, dots, and underscores.

effect

The effect is one of the following:

NoSchedule

  • New pods that do not match the taint are not scheduled onto that node.

  • Existing pods on the node remain.

PreferNoSchedule

  • New pods that do not match the taint might be scheduled onto that node, but the scheduler tries not to.

  • Existing pods on the node remain.

NoExecute

  • New pods that do not match the taint cannot be scheduled onto that node.

  • Existing pods on the node that do not have a matching toleration are removed.

operator

Equal

The key/value/effect parameters must match. This is the default.

Exists

The key/effect parameters must match. You must leave a blank value parameter, which matches any.

A toleration matches a taint:

  • If the operator parameter is set to Equal:

    • the key parameters are the same;

    • the value parameters are the same;

    • the effect parameters are the same.

  • If the operator parameter is set to Exists:

    • the key parameters are the same;

    • the effect parameters are the same.

The following taints are built into kubernetes:

  • node.kubernetes.io/not-ready: The node is not ready. This corresponds to the node condition Ready=False.

  • node.kubernetes.io/unreachable: The node is unreachable from the node controller. This corresponds to the node condition Ready=Unknown.

  • node.kubernetes.io/out-of-disk: The node has insufficient free space on the node for adding new pods. This corresponds to the node condition OutOfDisk=True.

  • node.kubernetes.io/memory-pressure: The node has memory pressure issues. This corresponds to the node condition MemoryPressure=True.

  • node.kubernetes.io/disk-pressure: The node has disk pressure issues. This corresponds to the node condition DiskPressure=True.

  • node.kubernetes.io/network-unavailable: The node network is unavailable.

  • node.kubernetes.io/unschedulable: The node is unschedulable.

  • node.cloudprovider.kubernetes.io/uninitialized: When the node controller is started with an external cloud provider, this taint is set on a node to mark it as unusable. After a controller from the cloud-controller-manager initializes this node, the kubelet removes this taint.

Understanding how to use toleration seconds to delay pod evictions

You can specify how long a pod can remain bound to a node before being evicted by specifying the tolerationSeconds parameter in the pod specification. If a taint with the NoExecute effect is added to a node, any pods that do not tolerate the taint are evicted immediately (pods that do tolerate the taint are not evicted). However, if a pod that to be evicted has the tolerationSeconds parameter, the pod is not evicted until that time period expires.

For example:

tolerations:
- key: "key1"
  operator: "Equal"
  value: "value1"
  effect: "NoExecute"
  tolerationSeconds: 3600

Here, if this pod is running but does not have a matching taint, the pod stays bound to the node for 3,600 seconds and then be evicted. If the taint is removed before that time, the pod is not evicted.

Understanding how to use multiple taints

You can put multiple taints on the same node and multiple tolerations on the same pod. OpenShift Container Platform processes multiple taints and tolerations as follows:

  1. Process the taints for which the pod has a matching toleration.

  2. The remaining unmatched taints have the indicated effects on the pod:

    • If there is at least one unmatched taint with effect NoSchedule, OpenShift Container Platform cannot schedule a pod onto that node.

    • If there is no unmatched taint with effect NoSchedule but there is at least one unmatched taint with effect PreferNoSchedule, OpenShift Container Platform tries to not schedule the pod onto the node.

    • If there is at least one unmatched taint with effect NoExecute, OpenShift Container Platform evicts the pod from the node (if it is already running on the node), or the pod is not scheduled onto the node (if it is not yet running on the node).

      • Pods that do not tolerate the taint are evicted immediately.

      • Pods that tolerate the taint without specifying tolerationSeconds in their toleration specification remain bound forever.

      • Pods that tolerate the taint with a specified tolerationSeconds remain bound for the specified amount of time.

For example:

  • The node has the following taints:

    $ oc adm taint nodes node1 key1=value1:NoSchedule
    $ oc adm taint nodes node1 key1=value1:NoExecute
    $ oc adm taint nodes node1 key2=value2:NoSchedule
  • The pod has the following tolerations:

    tolerations:
    - key: "key1"
      operator: "Equal"
      value: "value1"
      effect: "NoSchedule"
    - key: "key1"
      operator: "Equal"
      value: "value1"
      effect: "NoExecute"

In this case, the pod cannot be scheduled onto the node, because there is no toleration matching the third taint. The pod continues running if it is already running on the node when the taint is added, because the third taint is the only one of the three that is not tolerated by the pod.

Preventing pod eviction for node problems

OpenShift Container Platform can be configured to represent node unreachable and node not ready conditions as taints. This allows per-pod specification of how long to remain bound to a node that becomes unreachable or not ready, rather than using the default of five minutes.

The Taint-Based Evictions feature is enabled by default. The taints are automatically added by the node controller and the normal logic for evicting pods from Ready nodes is disabled.

  • If a node enters a not ready state, the node.kubernetes.io/not-ready:NoExecute taint is added and pods cannot be scheduled on the node. Existing pods remain for the toleration seconds period.

  • If a node enters a not reachable state, the node.kubernetes.io/unreachable:NoExecute taint is added and pods cannot be scheduled on the node. Existing pods remain for the toleration seconds period.

This feature, in combination with tolerationSeconds, allows a pod to specify how long it should stay bound to a node that has one or both of these problems.

Understanding pod scheduling and node conditions (Taint Node by Condition)

OpenShift Container Platform automatically taints nodes that report conditions such as memory pressure and disk pressure. If a node reports a condition, a taint is added until the condition clears. The taints have the NoSchedule effect, which means no pod can be scheduled on the node, unless the pod has a matching toleration. This feature, Taint Nodes By Condition, is enabled by default.

The scheduler checks for these taints on nodes before scheduling pods. If the taint is present, the pod is scheduled on a different node. Because the scheduler checks for taints and not the actual Node conditions, you configure the scheduler to ignore some of these node conditions by adding appropriate Pod tolerations.

The DaemonSet controller automatically adds the following tolerations to all daemons, to ensure backward compatibility:

  • node.kubernetes.io/memory-pressure

  • node.kubernetes.io/disk-pressure

  • node.kubernetes.io/out-of-disk (only for critical pods)

  • node.kubernetes.io/unschedulable (1.10 or later)

  • node.kubernetes.io/network-unavailable (host network only)

You can also add arbitrary tolerations to DaemonSets.

Understanding evicting pods by condition (Taint-Based Evictions)

The Taint-Based Evictions feature, enabled by default, evicts pods from a node that experiences specific conditions, such as not-ready and unreachable. When a node experiences one of these conditions, OpenShift Container Platform automatically adds taints to the node, and starts evicting and rescheduling the pods on different nodes.

Taint Based Evictions has a NoExecute effect, where any pod that does not tolerate the taint will be evicted immediately and any pod that does tolerate the taint will never be evicted.

OpenShift Container Platform evicts pods in a rate-limited way to prevent massive pod evictions in scenarios such as the master becoming partitioned from the nodes.

This feature, in combination with tolerationSeconds, allows you to specify how long a pod should stay bound to a node that has a node condition. If the condition still exists after the tolerationSections period, the taint remains on the node and the pods are evicted in a rate-limited manner. If the condition clears before the tolerationSeconds period, pods are not removed.

OpenShift Container Platform automatically adds a toleration for node.kubernetes.io/not-ready and node.kubernetes.io/unreachable with tolerationSeconds=300, unless the pod configuration specifies either toleration.

spec
  tolerations:
    - key: node.kubernetes.io/not-ready
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 300
    - key: node.kubernetes.io/unreachable
      operator: Exists
      effect: NoExecute
      tolerationSeconds: 300

These tolerations ensure that the default pod behavior is to remain bound for 5 minutes after one of these node conditions problems is detected.

You can configure these tolerations as needed. For example, if you have an application with a lot of local state you might want to keep the pods bound to node for a longer time in the event of network partition, allowing for the partition to recover and avoiding pod eviction.

DaemonSet pods are created with NoExecute tolerations for the following taints with no tolerationSeconds:

  • node.kubernetes.io/unreachable

  • node.kubernetes.io/not-ready

This ensures that DaemonSet pods are never evicted due to these node conditions, even if the DefaultTolerationSeconds admission controller is disabled.

Adding taints and tolerations

You add taints to nodes and tolerations to pods allow the node to control which pods should (or should not) be scheduled on them.

Procedure
  1. Use the following command using the parameters described in the taint and toleration components table:

    $ oc adm taint nodes <node-name> <key>=<value>:<effect>

    For example:

    $ oc adm taint nodes node1 key1=value1:NoSchedule

    This example places a taint on node1 that has key key1, value value1, and taint effect NoSchedule.

  2. Add a toleration to a pod by editing the pod specification to include a tolerations section:

    Sample pod configuration file with Equal operator
    tolerations:
    - key: "key1" (1)
      operator: "Equal" (1)
      value: "value1" (1)
      effect: "NoExecute" (1)
      tolerationSeconds: 3600 (2)
    1 The toleration parameters, as described in the taint and toleration components table.
    2 The tolerationSeconds parameter specifies how long a pod can remain bound to a node before being evicted.

    For example:

    Sample pod configuration file with Exists operator
    tolerations:
    - key: "key1"
      operator: "Exists"
      effect: "NoExecute"
      tolerationSeconds: 3600

    Both of these tolerations match the taint created by the oc adm taint command above. A pod with either toleration would be able to schedule onto node1.

Dedicating a Node for a User using taints and tolerations

You can specify a set of nodes for exclusive use by a particular set of users.

Procedure

To specify dedicated nodes:

  1. Add a taint to those nodes:

    For example:

    $ oc adm taint nodes node1 dedicated=groupName:NoSchedule
  2. Add a corresponding toleration to the pods by writing a custom admission controller.

    Only the pods with the tolerations are allowed to use the dedicated nodes.

Binding a user to a Node using taints and tolerations

You can configure a node so that particular users can use only the dedicated nodes.

Procedure

To configure a node so that users can use only that node:

  1. Add a taint to those nodes:

    For example:

    $ oc adm taint nodes node1 dedicated=groupName:NoSchedule
  2. Add a corresponding toleration to the pods by writing a custom admission controller.

    The admission controller should add a node affinity to require that the pods can only schedule onto nodes labeled with the key:value label (dedicated=groupName).

  3. Add a label similar to the taint (such as the key:value label) to the dedicated nodes.

Controlling Nodes with special hardware using taints and tolerations

In a cluster where a small subset of nodes have specialized hardware (for example GPUs), you can use taints and tolerations to keep pods that do not need the specialized hardware off of those nodes, leaving the nodes for pods that do need the specialized hardware. You can also require pods that need specialized hardware to use specific nodes.

Procedure

To ensure pods are blocked from the specialized hardware:

  1. Taint the nodes that have the specialized hardware using one of the following commands:

    $ oc adm taint nodes <node-name> disktype=ssd:NoSchedule
    $ oc adm taint nodes <node-name> disktype=ssd:PreferNoSchedule
  2. Adding a corresponding toleration to pods that use the special hardware using an admission controller.

For example, the admission controller could use some characteristic(s) of the pod to determine that the pod should be allowed to use the special nodes by adding a toleration.

To ensure pods can only use the specialized hardware, you need some additional mechanism. For example, you could label the nodes that have the special hardware and use node affinity on the pods that need the hardware.

Removing taints and tolerations

You can remove taints from nodes and tolerations from pods as needed.

Procedure

To remove taints and tolerations:

  1. To remove a taint from a node:

    $ oc adm taint nodes <node-name> <key>-

    For example:

    $ oc adm taint nodes ip-10-0-132-248.ec2.internal key1-
    
    node/ip-10-0-132-248.ec2.internal untainted
  2. To remove a toleration from a pod, edit the pod specification to remove the toleration:

    tolerations:
    - key: "key2"
      operator: "Exists"
      effect: "NoExecute"
      tolerationSeconds: 3600