Managing Nodes | Cluster Administration | OpenShift Container Platform 3.4

Overview
Listing Nodes
Adding Nodes
Deleting Nodes
Updating Labels on Nodes
Listing Pods on Nodes
Marking Nodes as Unschedulable or Schedulable
Evacuating Pods on Nodes
Rebooting Nodes
Configuring Node Resources
- Setting Maximum Pods Per Node
Resetting Docker Storage
Changing Node Traffic Interface

Overview

You can manage nodes in your instance using the CLI.

When you perform node management operations, the CLI interacts with node objects that are representations of actual node hosts. The master uses the information from node objects to validate nodes with health checks.

Listing Nodes

To list all nodes that are known to the master:

$ oc get nodes
NAME                        STATUS                     AGE
master.example.com          Ready,SchedulingDisabled   165d
node1.example.com           Ready                      165d
node2.example.com           Ready                      165d

To only list information about a single node, replace <node> with the full node name:

$ oc get node <node>

The STATUS column in the output of these commands can show nodes with the following conditions:

Table 1. Node Conditions
Condition	Description
`Ready`	The node is passing the health checks performed from the master by returning `StatusOK`.
`NotReady`	The node is not passing the health checks performed from the master.
`SchedulingDisabled`	Pods cannot be scheduled for placement on the node.

The STATUS column can also show Unknown for a node if the CLI cannot find any node condition.

To get more detailed information about a specific node, including the reason for the current condition:

$ oc describe node <node>

For example:

oc describe node node1.example.com
Name:			node1.example.com (1)
Labels:			beta.kubernetes.io/arch=amd64 (2)
			beta.kubernetes.io/os=linux
			kubernetes.io/hostname=node1.example.com
			region=infra
			storagenode=glusterfs
			zone=default
Taints:			<none>
CreationTimestamp:	Mon, 18 Jun 2018 14:02:45 -0400
Phase:
Conditions:                                 (3)
  Type			Status	LastHeartbeatTime			LastTransitionTime			Reason				Message
  ----			------	-----------------			------------------			------				-------
  OutOfDisk 		False 	Mon, 18 Jun 2018 16:53:16 -0400 	Mon, 18 Jun 2018 14:02:44 -0400 	KubeletHasSufficientDisk 	kubelet has sufficient disk space available
  MemoryPressure 	False 	Mon, 18 Jun 2018 16:53:16 -0400 	Mon, 18 Jun 2018 14:02:44 -0400 	KubeletHasSufficientMemory 	kubelet has sufficient memory available
  DiskPressure 		False 	Mon, 18 Jun 2018 16:53:16 -0400 	Mon, 18 Jun 2018 14:02:44 -0400 	KubeletHasNoDiskPressure 	kubelet has no disk pressure
  Ready 		True 	Mon, 18 Jun 2018 16:53:16 -0400 	Mon, 18 Jun 2018 15:03:16 -0400 	KubeletReady 			kubelet is posting ready status
Addresses:		10.74.157.70,10.74.157.70
Capacity:                                    (4)
 alpha.kubernetes.io/nvidia-gpu:	0
 cpu:					1
 memory:				8009836Ki
 pods:					10
Allocatable:
 alpha.kubernetes.io/nvidia-gpu:	0
 cpu:					1
 memory:				8009836Ki
 pods:					10
System Info:                                 (5)
 Machine ID:			6d48b4ed915236703275389b
 System UUID:			BFE23571-E9B-828F-25ABC6838CBC
 Boot ID:			c07a5150-dbe-9c3b-f7e748edfb74
 Kernel Version:		3.10.0-862.3.3.el7.x86_64
 OS Image:			Employee SKU
 Operating System:		linux
 Architecture:			amd64
 Container Runtime Version:	docker://1.10.3
 Kubelet Version:		v1.4.0+776c994
 Kube-Proxy Version:		v1.4.0+776c994
ExternalID:			node1.example.com  (6)
Non-terminated Pods:		(3 in total)                     (7)
  Namespace			Name				CPU Requests	CPU Limits	Memory Requests	Memory Limits
  ---------			----				------------	----------	---------------	-------------
  default			router-1-6tuk1			100m (10%)	0 (0%)		256Mi (3%)	0 (0%)
  storage-project		deploy-heketi-1-z4mtx		0 (0%)		0 (0%)		0 (0%)		0 (0%)
  storage-project		glusterfs-r7lp2			0 (0%)		0 (0%)		0 (0%)		0 (0%)
Allocated resources:                                             (9)
  (Total limits may be over 100 percent, i.e., overcommitted.
  CPU Requests	CPU Limits	Memory Requests	Memory Limits
  ------------	----------	---------------	-------------
  100m (10%)	0 (0%)		256Mi (3%)	0 (0%)
Events:                                                          (8)
  FirstSeen	LastSeen	Count	From					SubobjectPath	Type		Reason			Message
  ---------	--------	-----	----					-------------	--------	------			-------
  1h		1h		1	{kubelet node1.example.com}		Normal		Starting		Starting kubelet.
  1h		1h		2	{kubelet node1.example.com}		Normal		NodeHasSufficientDisk	Node node1.example.com status is now: NodeHasSufficientDisk
  1h		1h		2	{kubelet node1.example.com}		Normal		NodeHasSufficientMemory	Node node1.example.com status is now: NodeHasSufficientMemory
  1h		1h		2	{kubelet node1.example.com}		Normal		NodeHasNoDiskPressure	Node node1.example.com status is now: NodeHasNoDiskPressure
  1h		1h		1	{kubelet node1.example.com}		Warning		Rebooted		Node node1.example.com has been rebooted, boot id: c07a5150-dbe-9c3b-f7e748edfb74
  1h		1h		1	{kubelet node1.example.com}		Normal		NodeNotReady		Node node1.example.com status is now: NodeNotReady
  1h		1h		1	{kubelet node1.example.com}		Normal		NodeReady		Node node1.example.com status is now: NodeReady

1	The name of the node.
2	The labels applied to the node.
3	Node conditions.
4	The pod resources and allocatable resources.
5	Information about the node host.
6	The host name of the node.
7	The pods on the node.
8	Events affecting the node.
9	Information about the node host.

Adding Nodes

To add nodes to your existing OpenShift Container Platform cluster, you can run an Ansible playbook that handles installing the node components, generating the required certificates, and other important steps. See the advanced installation method for instructions on running the playbook directly.

Alternatively, if you used the quick installation method, you can re-run the installer to add nodes, which performs the same steps.

Deleting Nodes

When you delete a node using the CLI, the node object is deleted in Kubernetes, but the pods that exist on the node itself are not deleted. Any bare pods not backed by a replication controller would be inaccessible to OpenShift Container Platform, pods backed by replication controllers would be rescheduled to other available nodes, and local manifest pods would need to be manually deleted.

To delete a node from the OpenShift Container Platform cluster:

Evacuate pods from the node you are preparing to delete.
Delete the node object:
```
$ oc delete node <node>
```
Check that the node has been removed from the node list:
```
$ oc get nodes
```
Pods should now be only scheduled for the remaining nodes that are in Ready state.
If you want to uninstall all OpenShift Container Platform content from the node host, including all pods and containers, continue to Uninstalling Nodes and follow the procedure using the uninstall.yml playbook. The procedure assumes general understanding of the advanced installation method using Ansible.

Updating Labels on Nodes

To add or update labels on a node:

$ oc label node <node> <key_1>=<value_1> ... <key_n>=<value_n>

To see more detailed usage:

$ oc label -h

Listing Pods on Nodes

To list all or selected pods on one or more nodes:

$ oadm manage-node <node1> <node2> \
    --list-pods [--pod-selector=<pod_selector>] [-o json|yaml]

To list all or selected pods on selected nodes:

$ oadm manage-node --selector=<node_selector> \
    --list-pods [--pod-selector=<pod_selector>] [-o json|yaml]

Marking Nodes as Unschedulable or Schedulable

By default, healthy nodes with a Ready status are marked as schedulable, meaning that new pods are allowed for placement on the node. Manually marking a node as unschedulable blocks any new pods from being scheduled on the node. Existing pods on the node are not affected.

To mark a node or nodes as unschedulable:

$ oadm manage-node <node1> <node2> --schedulable=false

For example:

$ oadm manage-node node1.example.com --schedulable=false
NAME                 LABELS                                        STATUS
node1.example.com    kubernetes.io/hostname=node1.example.com      Ready,SchedulingDisabled

To mark a currently unschedulable node or nodes as schedulable:

$ oadm manage-node <node1> <node2> --schedulable

Alternatively, instead of specifying specific node names (e.g., <node1> <node2>), you can use the --selector=<node_selector> option to mark selected nodes as schedulable or unschedulable.

Evacuating Pods on Nodes

Evacuating pods allows you to migrate all or selected pods from a given node or nodes. Nodes must first be marked unschedulable to perform pod evacuation.

Only pods backed by a replication controller can be evacuated; the replication controllers create new pods on other nodes and remove the existing pods from the specified node(s). Bare pods, meaning those not backed by a replication controller, are unaffected by default. You can evacuate a subset of pods by specifying a pod-selector. Pod selector is based on labels, so all the pods with the specified label will be evacuated.

To list pods that will be migrated without actually performing the evacuation, use the --dry-run option:

$ oadm manage-node <node1> <node2> \
    --evacuate --dry-run [--pod-selector=<pod_selector>]

To actually evacuate all or selected pods on one or more nodes:

$ oadm manage-node <node1> <node2> \
    --evacuate [--pod-selector=<pod_selector>]

You can force deletion of bare pods by using the --force option:

$ oadm manage-node <node1> <node2> \
    --evacuate --force [--pod-selector=<pod_selector>]

Alternatively, instead of specifying specific node names (e.g., <node1> <node2>), you can use the --selector=<node_selector> option to evacuate pods on selected nodes.

Rebooting Nodes

To reboot a node without causing an outage for applications running on the platform, it is important to first evacuate the pods. For pods that are made highly available by the routing tier, nothing else needs to be done. For other pods needing storage, typically databases, it is critical to ensure that they can remain in operation with one pod temporarily going offline. While implementing resiliency for stateful pods is different for each application, in all cases it is important to configure the scheduler to use node anti-affinity to ensure that the pods are properly spread across available nodes.

Another challenge is how to handle nodes that are running critical infrastructure such as the router or the registry. The same node evacuation process applies, though it is important to understand certain edge cases.

Infrastructure Nodes

Infrastructure nodes are nodes that are labeled to run pieces of the OpenShift Container Platform environment. Currently, the easiest way to manage node reboots is to ensure that there are at least three nodes available to run infrastructure. The scenario below demonstrates a common mistake that can lead to service interruptions for the applications running on OpenShift Container Platform when only two nodes are available.

Node A is marked unschedulable and all pods are evacuated.
The registry pod running on that node is now redeployed on node B. This means node B is now running both registry pods.
Node B is now marked unschedulable and is evacuated.
The service exposing the two pod endpoints on node B, for a brief period of time, loses all endpoints until they are redeployed to node A.

The same process using three infrastructure nodes does not result in a service disruption. However, due to pod scheduling, the last node that is evacuated and brought back in to rotation is left running zero registries. The other two nodes will run two and one registries respectively. The best solution is to rely on pod anti-affinity. This is an alpha feature in Kubernetes that is available for testing now, but is not yet supported for production workloads.

Using Pod Anti-Affinity for the Docker Registry Pod

Pod anti-affinity is slightly different than node anti-affinity. Node anti-affinity can be violated if there are no other suitable locations to deploy a pod. Pod anti-affinity can be set to either required or preferred.

Using the docker-registry pod as an example, the first step in enabling this feature is to set the scheduler.alpha.kubernetes.io/affinity on the pod. Since this pod uses a deployment configuration, the most appropriate place to add the annotation is to the pod template’s metadata.

$ oc edit dc/docker-registry -o yaml

...
  template:
    metadata:
      annotations:
        scheduler.alpha.kubernetes.io/affinity: |
          {
            "podAntiAffinity": {
              "requiredDuringSchedulingIgnoredDuringExecution": [{
                "labelSelector": {
                  "matchExpressions": [{
                    "key": "docker-registry",
                    "operator": "In",
                    "values":["default"]
                  }]
                },
                "topologyKey": "kubernetes.io/hostname"
              }]
            }
          }

scheduler.alpha.kubernetes.io/affinity is internally stored as a string even though the contents are JSON. The above example shows how this string can be added as an annotation to a YAML deployment configuration.

This example assumes the Docker registry pod has a label of docker-registry=default. Pod anti-affinity can use any Kubernetes match expression.

The last required step is to enable the MatchInterPodAffinity scheduler predicate in /etc/origin/master/scheduler.json. With this in place, if only two infrastructure nodes are available and one is rebooted, the Docker registry pod is prevented from running on the other node. oc get pods reports the pod as unready until a suitable node is available. Once a node is available and all pods are back in ready state, the next node can be restarted.

Handling Nodes Running Routers

In most cases, a pod running an OpenShift Container Platform router will expose a host port. The PodFitsPorts scheduler predicate ensures that no router pods using the same port can run on the same node, and pod anti-affinity is achieved. If the routers are relying on IP failover for high availability, there is nothing else that is needed. For router pods relying on an external service such as AWS Elastic Load Balancing for high availability, it is that service’s responsibility to react to router pod restarts.

In rare cases, a router pod might not have a host port configured. In those cases, it is important to follow the recommended restart process for infrastructure nodes.

Configuring Node Resources

You can configure node resources by adding kubelet arguments to the node configuration file (/etc/origin/node/node-config.yaml). Add the kubeletArguments section and include any desired options:

kubeletArguments:
  max-pods: (1)
    - "40"
  resolv-conf: (2)
    - "/etc/resolv.conf"
  image-gc-high-threshold: (3)
    - "90"
  image-gc-low-threshold: (4)
    - "80"

1	Maximum number of pods that can run on this kubelet.
2	Resolver configuration file used as the basis for the container DNS resolution configuration.
3	The percent of disk usage after which image garbage collection is always run. Default: 90%
4	The percent of disk usage before which image garbage collection is never run. Lowest disk usage to garbage collect to. Default: 80%

To view all available kubelet options:

$ kubelet -h

This can also be set during an advanced installation using the openshift_node_kubelet_args variable. For example:

openshift_node_kubelet_args={'max-pods': ['40'], 'resolv-conf': ['/etc/resolv.conf'],  'image-gc-high-threshold': ['90'], 'image-gc-low-threshold': ['80']}

Setting Maximum Pods Per Node

In the /etc/origin/node/node-config.yaml file, two parameters control the maximum number of pods that can be scheduled to a node: pods-per-core and max-pods. When both options are in use, the lower of the two limits the number of pods on a node.

pods-per-core sets the number of pods the node can run based on the number of processor cores on the node. For example, if pods-per-core is set to 10 on a node with 4 processor cores, the maxiumum number of pods allowed on the node will be 40.

kubeletArguments:
  pods-per-core:
    - "10"

Setting pods-per-core to 0 disables this limit.

max-pods sets the number of pods the node can run to a fixed value, regardless of the properties of the node.

kubeletArguments:
  max-pods:
    - "250"

Using the above example, the default value for pods-per-core is 10 and the default value for max-pods is 250. This means that unless the node has 25 cores or more, by default, pods-per-core will be the limiting factor.

Resetting Docker Storage

As you download Docker images and run and delete containers, Docker does not always free up mapped disk space. As a result, over time you can run out of space on a node, which might prevent OpenShift Container Platform from being able to create new pods or cause pod creation to take several minutes.

For example, the following shows pods that are still in the ContainerCreating state after six minutes and the events log shows a FailedSync event.

$ oc get pod
NAME                               READY     STATUS              RESTARTS   AGE
cakephp-mysql-persistent-1-build   0/1       ContainerCreating   0          6m
mysql-1-9767d                      0/1       ContainerCreating   0          2m
mysql-1-deploy                     0/1       ContainerCreating   0          6m

$ oc get events
LASTSEEN   FIRSTSEEN   COUNT     NAME                               KIND                    SUBOBJECT                     TYPE      REASON                         SOURCE                                                 MESSAGE
6m         6m          1         cakephp-mysql-persistent-1-build   Pod                                                   Normal    Scheduled                      default-scheduler                                      Successfully assigned cakephp-mysql-persistent-1-build to ip-172-31-71-195.us-east-2.compute.internal
2m         5m          4         cakephp-mysql-persistent-1-build   Pod                                                   Warning   FailedSync                     kubelet, ip-172-31-71-195.us-east-2.compute.internal   Error syncing pod
2m         4m          4         cakephp-mysql-persistent-1-build   Pod                                                   Normal    SandboxChanged                 kubelet, ip-172-31-71-195.us-east-2.compute.internal   Pod sandbox changed, it will be killed and re-created.

One solution to this problem is to reset Docker storage to remove artifacts not needed by Docker.

On the node where you want to restart Docker storage:

Run the following command to mark the node as unschedulable:
```
$ oadm manage-node <node> --schedulable=false
```
Run the following command to shut down Docker and the atomic-openshift-node service:
```
$ systemctl stop docker atomic-openshift-node
```
Run the following command to remove the local volume directory:
```
$ rm -rf /var/lib/origin/openshift.local.volumes
```
This command clears the local image cache. As a result, images, including ose-* images, will need to be re-pulled. This might result in slower pod start times while the image store recovers.
Remove the /var/lib/docker directory:
```
$ rm -rf /var/lib/docker
```
Run the following command to reset the Docker storage:
```
$ docker-storage-setup --reset
```
Run the following command to recreate the Docker storage:
```
$ docker-storage-setup
```
Recreate the /var/lib/docker directory:
```
$ mkdir /var/lib/docker
```
Run the following command to restart Docker and the atomic-openshift-node service:
```
$ systemctl start docker atomic-openshift-node
```
Run the following command to mark the node as schedulable:
```
$ oadm manage-node <node> --schedulable=true
```

Changing Node Traffic Interface

By default, DNS routes all node traffic. During node registration, the master receives the node IP addresses from the DNS configuration, and therefore accessing nodes via DNS is the most flexible solution for most deployments.

If your deployment is using a cloud provider, then the node gets the IP information from the cloud provider. However, openshift-sdn attempts to determine the IP through a variety of methods, including a DNS lookup on the nodeName (if set), or on the system hostname (if nodeName is not set).

However, you may need to change the node traffic interface. For example, where:

OpenShift Container Platform is installed in a cloud provider where internal hostnames are not configured/resolvable by all hosts.
The node’s IP from the master’s perspective is not the same as the node’s IP from its own perspective.

Configuring the openshift_set_node_ip Ansible variable forces node traffic through an interface other than the default network interface.

To change the node traffic interface:

Set the openshift_set_node_ip Ansible variable to true.
Set the openshift_ip to the IP address for the node you want to configure.

Although openshift_set_node_ip can be useful as a workaround for the cases stated in this section, it is generally not suited for production environments. This is because the node will no longer function properly if it receives a new IP address.