Recommended host practices | Scalability and performance

Recommended node host practices
Create a KubeletConfig CRD to edit kubelet parameters
Master node sizing
Recommended etcd practices
Additional resources

This topic provides recommended host practices for OpenShift Container Platform.

Recommended node host practices

The OpenShift Container Platform node configuration file contains important options. For example, two parameters control the maximum number of pods that can be scheduled to a node: podsPerCore and maxPods.

When both options are in use, the lower of the two values limits the number of pods on a node. Exceeding these values can result in:

Increased CPU utilization.
Slow pod scheduling.
Potential out-of-memory scenarios, depending on the amount of memory in the node.
Exhausting the pool of IP addresses.
Resource overcommitting, leading to poor user application performance.

In Kubernetes, a pod that is holding a single container actually uses two containers. The second container is used to set up networking prior to the actual container starting. Therefore, a system running 10 pods will actually have 20 containers running.

podsPerCore sets the number of pods the node can run based on the number of processor cores on the node. For example, if podsPerCore is set to 10 on a node with 4 processor cores, the maximum number of pods allowed on the node will be 40.

kubeletConfig:
  podsPerCore: 10

Setting podsPerCore to 0 disables this limit. The default is 0. podsPerCore cannot exceed maxPods.

maxPods sets the number of pods the node can run to a fixed value, regardless of the properties of the node.

 kubeletConfig:
    maxPods: 250

Create a KubeletConfig CRD to edit kubelet parameters

The kubelet configuration is currently serialized as an ignition configuration, so it can be directly edited. However, there is also a new kubelet-config-controller added to the Machine Config Controller (MCC). This allows you to create a KubeletConfig custom resource (CR) to edit the kubelet parameters.

Procedure

Run:
```
$ oc get machineconfig
```
This provides a list of the available machine configuration objects you can select. By default, the two kubelet-related configs are 01-master-kubelet and 01-worker-kubelet.

To check the current value of max Pods per node, run:

# oc describe node <node-ip> | grep Allocatable -A6

Look for value: pods: <value>.

For example:

# oc describe node ip-172-31-128-158.us-east-2.compute.internal | grep Allocatable -A6
Allocatable:
 attachable-volumes-aws-ebs:  25
 cpu:                         3500m
 hugepages-1Gi:               0
 hugepages-2Mi:               0
 memory:                      15341844Ki
 pods:                        250

To set the max Pods per node on the worker nodes, create a YAML file that contains the kubelet configuration. For example, max-worker-pods.yaml:
```
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: set-max-pods
spec:
  machineConfigSelector: 01-worker-kubelet
  kubeletConfig:
    maxPods: 250
```
The rate at which the kubelet talks to the API server depends on queries per second (QPS) and burst values. The default values, 5 for kubeAPIQPS and 10 for kubeAPIBurst, are good enough if there are limited pods running on each node. Updating the kubelet QPS and burst rates is recommended if there are enough CPU and memory resources on the node:
```
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
  name: set-max-pods
spec:
  machineConfigPoolSelector:
    matchLabels:
      custom-kubelet: large-pods
  kubeletConfig:
    maxPods: <pod_count>
    kubeAPIBurst: <burst_rate>
    kubeAPIQPS: <QPS>
```
1. Run:
  $ oc label machineconfigpool worker custom-kubelet=large-pods
2. Run:
  $ oc create -f change-maxPods-cr.yaml
3. Run:
  $ oc get kubeletconfig
  This should return set-max-pods.
  
  Depending on the number of worker nodes in the cluster, wait for the worker nodes to be rebooted one by one. For a cluster with 3 worker nodes, this could take about 10 to 15 minutes.
Check for maxPods changing for the worker nodes:
```
$ oc describe node
```
1. Verify the change by running:
  $ oc get kubeletconfigs set-max-pods -o yaml
  This should show a status of True and type:Success.

Procedure

By default, only one machine is allowed to be unavailable when applying the kubelet-related configuration to the available worker nodes. For a large cluster, it can take a long time for the configuration change to be reflected. At any time, you can adjust the number of machines that are updating to speed up the process. . Run:

$ oc edit machineconfigpool worker

Set maxUnavailable to the desired value.
```
spec:
  maxUnavailable: <node_count>
```
When setting the value, consider the number of worker nodes that can be unavailable without affecting the applications running on the cluster.

Master node sizing

The master node resource requirements depend on the number of nodes in the cluster. The following master node size recommendations are based on the results of control plane density focused testing.

Number of worker nodes	CPU cores	Memory (GB)
25	4	16
100	8	32
250	16	64

Because you cannot modify the master node size in a running OpenShift Container Platform 4.1 cluster, you must estimate your total node count and use the suggested master size during installation.

In OpenShift Container Platform 4.1, half of a CPU core (500 millicore) is now reserved by the system by default compared to OpenShift Container Platform 3.11 and previous versions. The sizes are determined taking that into consideration.

Recommended etcd practices

For large and dense clusters, etcd can suffer from poor performance if the keyspace grows excessively large and exceeds the space quota. Periodic maintenance of etcd including defragmentation needs to be done to free up space in the data store. It is highly recommended that you monitor Prometheus for etcd metrics and defragment it when needed before etcd raises a cluster-wide alarm that puts the cluster into a maintenance mode, which only accepts key reads and deletes. Some of the key metrics to monitor are etcd_server_quota_backend_bytes which is the current quota limit, etcd_mvcc_db_total_size_in_use_in_bytes which indicates the actual database usage after a history compaction, and etcd_debugging_mvcc_db_total_size_in_bytes which shows the database size including free space waiting for defragmentation.

Additional resources

OpenShift Container Platform cluster maximums