To reboot a node without causing an outage for applications running on the platform, it is important to first evacuate the pods. For pods that are made highly available by the routing tier, nothing else needs to be done. For other pods needing storage, typically databases, it is critical to ensure that they can remain in operation with one pod temporarily going offline. While implementing resiliency for stateful pods is different for each application, in all cases it is important to configure the scheduler to use node anti-affinity to ensure that the pods are properly spread across available nodes.

Another challenge is how to handle nodes that are running critical infrastructure such as the router or the registry. The same node evacuation process applies, though it is important to understand certain edge cases.

Understanding infrastructure node rebooting

Infrastructure nodes are nodes that are labeled to run pieces of the OpenShift Container Platform environment. Currently, the easiest way to manage node reboots is to ensure that there are at least three nodes available to run infrastructure. The nodes to run the infrastructure are called master nodes.

The scenario below demonstrates a common mistake that can lead to service interruptions for the applications running on OpenShift Container Platform when only two nodes are available.

  • Node A is marked unschedulable and all pods are evacuated.

  • The registry pod running on that node is now redeployed on node B. This means node B is now running both registry pods.

  • Node B is now marked unschedulable and is evacuated.

  • The service exposing the two pod endpoints on node B, for a brief period of time, loses all endpoints until they are redeployed to node A.

The same process using three master nodes for infrastructure does not result in a service disruption. However, due to pod scheduling, the last node that is evacuated and brought back in to rotation is left running zero registries. The other two nodes will run two and one registries respectively. The best solution is to rely on pod anti-affinity.

Rebooting a node using pod anti-affinity

Pod anti-affinity is slightly different than node anti-affinity. Node anti-affinity can be violated if there are no other suitable locations to deploy a pod. Pod anti-affinity can be set to either required or preferred.

With this in place, if only two infrastructure nodes are available and one is rebooted, the container image registry pod is prevented from running on the other node. oc get pods reports the pod as unready until a suitable node is available. Once a node is available and all pods are back in ready state, the next node can be restarted.

Procedure

To reboot a node using pod anti-affinity:

  1. Edit the node specification to configure pod anti-affinity:

    apiVersion: v1
    kind: Pod
    metadata:
      name: with-pod-antiaffinity
    spec:
      affinity:
        podAntiAffinity: (1)
          preferredDuringSchedulingIgnoredDuringExecution: (2)
          - weight: 100 (3)
            podAffinityTerm:
              labelSelector:
                matchExpressions:
                - key: registry (4)
                  operator: In (5)
                  values:
                  - default
              topologyKey: kubernetes.io/hostname
    1 Stanza to configure pod anti-affinity.
    2 Defines a preferred rule.
    3 Specifies a weight for a preferred rule. The node with the highest weight is preferred.
    4 Description of the pod label that determines when the anti-affinity rule applies. Specify a key and value for the label.
    5 The operator represents the relationship between the label on the existing pod and the set of values in the matchExpression parameters in the specification for the new pod. Can be In, NotIn, Exists, or DoesNotExist.

    This example assumes the container image registry pod has a label of registry=default. Pod anti-affinity can use any Kubernetes match expression.

  2. Enable the MatchInterPodAffinity scheduler predicate in the scheduling policy file. For more information, see Configuring the Default Scheduler.

Understanding how to reboot nodes running routers

In most cases, a pod running an OpenShift Container Platform router exposes a host port.

The PodFitsPorts scheduler predicate ensures that no router pods using the same port can run on the same node, and pod anti-affinity is achieved. If the routers are relying on IP failover for high availability, there is nothing else that is needed.

For router pods relying on an external service such as AWS Elastic Load Balancing for high availability, it is that service’s responsibility to react to router pod restarts.

In rare cases, a router pod may not have a host port configured. In those cases, it is important to follow the recommended restart process for infrastructure nodes.