kubeletConfig: podsPerCore: 10
This topic provides recommended host practices for OpenShift Container Platform.
The OpenShift Container Platform node configuration file contains important options. For
example, two parameters control the maximum number of pods that can be scheduled
to a node:
When both options are in use, the lower of the two values limits the number of pods on a node. Exceeding these values can result in:
Increased CPU utilization.
Slow pod scheduling.
Potential out-of-memory scenarios, depending on the amount of memory in the node.
Exhausting the pool of IP addresses.
Resource overcommitting, leading to poor user application performance.
In Kubernetes, a pod that is holding a single container actually uses two containers. The second container is used to set up networking prior to the actual container starting. Therefore, a system running 10 pods will actually have 20 containers running.
podsPerCore sets the number of pods the node can run based on the number of
processor cores on the node. For example, if
podsPerCore is set to
10 on a
node with 4 processor cores, the maximum number of pods allowed on the node will
kubeletConfig: podsPerCore: 10
0 disables this limit. The default is
podsPerCore cannot exceed
maxPods sets the number of pods the node can run to a fixed value, regardless
of the properties of the node.
kubeletConfig: maxPods: 250
The kubelet configuration is currently serialized as an Ignition configuration, so it can be directly edited. However, there is also a new
kubelet-config-controller added to the Machine Config Controller (MCC). This allows you to create a
KubeletConfig custom resource (CR) to edit the kubelet parameters.
$ oc get machineconfig
This provides a list of the available machine configuration objects you can select. By default, the two kubelet-related configs are
To check the current value of max pods per node, run:
# oc describe node <node-ip> | grep Allocatable -A6
value: pods: <value>.
# oc describe node ip-172-31-128-158.us-east-2.compute.internal | grep Allocatable -A6
Allocatable: attachable-volumes-aws-ebs: 25 cpu: 3500m hugepages-1Gi: 0 hugepages-2Mi: 0 memory: 15341844Ki pods: 250
To set the max pods per node on the worker nodes, create a custom resource file that contains the kubelet configuration. For example,
apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: set-max-pods spec: machineConfigPoolSelector: matchLabels: custom-kubelet: large-pods kubeletConfig: maxPods: 500
The rate at which the kubelet talks to the API server depends on queries per second (QPS) and burst values. The default values,
kubeAPIBurst, are good enough if there are limited pods running on each node. Updating the kubelet QPS and burst rates is recommended if there are enough CPU and memory resources on the node:
apiVersion: machineconfiguration.openshift.io/v1 kind: KubeletConfig metadata: name: set-max-pods spec: machineConfigPoolSelector: matchLabels: custom-kubelet: large-pods kubeletConfig: maxPods: <pod_count> kubeAPIBurst: <burst_rate> kubeAPIQPS: <QPS>
$ oc label machineconfigpool worker custom-kubelet=large-pods
$ oc create -f change-maxPods-cr.yaml
$ oc get kubeletconfig
This should return
Depending on the number of worker nodes in the cluster, wait for the worker nodes to be rebooted one by one. For a cluster with 3 worker nodes, this could take about 10 to 15 minutes.
maxPods changing for the worker nodes:
$ oc describe node
Verify the change by running:
$ oc get kubeletconfigs set-max-pods -o yaml
This should show a status of
By default, only one machine is allowed to be unavailable when applying the kubelet-related configuration to the available worker nodes. For a large cluster, it can take a long time for the configuration change to be reflected. At any time, you can adjust the number of machines that are updating to speed up the process.
$ oc edit machineconfigpool worker
maxUnavailable to the desired value.
spec: maxUnavailable: <node_count>
When setting the value, consider the number of worker nodes that can be unavailable without affecting the applications running on the cluster.
The master node resource requirements depend on the number of nodes in the cluster. The following master node size recommendations are based on the results of control plane density focused testing.
|Number of worker nodes||CPU cores||Memory (GB)|
Because you cannot modify the master node size in a running OpenShift Container Platform 4.4 cluster, you must estimate your total node count and use the suggested master size during installation.
In OpenShift Container Platform 4.4, half of a CPU core (500 millicore) is now reserved by the system by default compared to OpenShift Container Platform 3.11 and previous versions. The sizes are determined taking that into consideration.
For large and dense clusters, etcd can suffer from poor performance if the keyspace grows excessively large and exceeds the space quota.Periodic maintenance of etcd including defragmentation needs to be done to free up space in the data store. It is highly recommended that you monitor
Prometheus for etcd metrics and defragment it when needed before etcd raises a cluster-wide alarm that puts the cluster into a maintenance mode, which only accepts key reads and deletes. Some of the key metrics to monitor are
etcd_server_quota_backend_bytes which is the current quota limit,
etcd_mvcc_db_total_size_in_use_in_bytes which indicates the actual database usage after a history compaction, and
etcd_debugging_mvcc_db_total_size_in_bytes which shows the database size including free space waiting for defragmentation.
Etcd replicates requests among all the members, so its performance strongly depends on network input/output (IO) latency. High network latencies result in etcd heartbeats taking longer than the election timeout, which leads to leader elections that are disruptive to the cluster. A key metric to monitor on a deployed OpenShift Container Platform cluster is the 99th percentile of etcd network peer latency on each etcd cluster member. Use Prometheus to track the metric.
histogram_quantile(0.99, rate(etcd_network_peer_round_trip_time_seconds_bucket[2m])) reports the round trip time for etcd to finish replicating the client requests between the members; it should be less than 50 ms.