[Allocatable] = [Node Capacity] - [kube-reserved] - [system-reserved] - [Hard-Eviction-Thresholds]
To provide more reliable scheduling and minimize node resource overcommitment, each node can reserve a portion of its resources for use by all underlying node components (such as kubelet, kube-proxy) and the remaining system components (such as sshd, NetworkManager) on the host. Once specified, the scheduler has more information about the resources (e.g., memory, CPU) a node has allocated for pods.
CPU and memory resources reserved for node components in OpenShift Container Platform are based on two node settings:
Setting | Description |
---|---|
|
Resources reserved for node components. Default is none. |
|
Resources reserved for the remaining system components. Default is none. |
If a flag is not set, it defaults to 0
. If none of the flags are set, the
allocated resource is set to the node’s capacity as it was before the
introduction of allocatable resources.
An allocated amount of a resource is computed based on the following formula:
[Allocatable] = [Node Capacity] - [kube-reserved] - [system-reserved] - [Hard-Eviction-Thresholds]
The withholding of |
If [Allocatable]
is negative, it is set to 0.
Each node reports system resources utilized by the container runtime and kubelet.
To better aid your ability to configure --system-reserved
and --kube-reserved
,
you can introspect corresponding node’s resource usage using the node summary API,
which is accessible at /api/v1/nodes/<node>/proxy/stats/summary
.
The node is able to limit the total amount of resources that pods may consume based on the configured allocatable value. This feature significantly improves the reliability of the node by preventing pods from starving system services (for example: container runtime, node agent, etc.) for resources. It is strongly encouraged that administrators reserve resources based on the desired node utilization target in order to improve node reliability.
The node enforces resource constraints using a new cgroup hierarchy that enforces quality of service. All pods are launched in a dedicated cgroup hierarchy separate from system daemons.
Optionally, the node can be made to enforce kube-reserved and system-reserved by
specifying those tokens in the enforce-node-allocatable flag. If specified, the
corresponding --kube-reserved-cgroup
or --system-reserved-cgroup
needs to be provided.
In future releases, the node and container runtime will be packaged in a common cgroup
separate from system.slice
. Until that time, we do not recommend users
change the default value of enforce-node-allocatable flag.
Administrators should treat system daemons similar to Guaranteed pods. System daemons can burst within their bounding control groups and this behavior needs to be managed as part of cluster deployments. Enforcing system-reserved limits can lead to critical system services being CPU starved or OOM killed on the node. The recommendation is to enforce system-reserved only if operators have profiled their nodes exhaustively to determine precise estimates and are confident in their ability to recover if any process in that group is OOM killed.
As a result, we strongly recommended that users only enforce node allocatable for
pods
by default, and set aside appropriate reservations for system daemons to maintain
overall node reliability.
If a node is under memory pressure, it can impact the entire node and all pods running on it. If a system daemon is using more than its reserved amount of memory, an OOM event may occur that can impact the entire node and all pods running on it. To avoid (or reduce the probability of) system OOMs the node provides out-of-resource handling.
You can reserve some memory using the --eviction-hard
flag. The node attempts to evict
pods whenever memory availability on the node drops below the absolute value or percentage.
If system daemons do not exist on a node, pods are limited to the memory
capacity - eviction-hard
. For this reason, resources set aside as a buffer for eviction
before reaching out of memory conditions are not available for pods.
The following is an example to illustrate the impact of node allocatable for memory:
Node capacity is 32Gi
--kube-reserved is 2Gi
--system-reserved is 1Gi
--eviction-hard is set to 100Mi
.
For this node, the effective node allocatable value is 28.9Gi
. If the node
and system components use up all their reservation, the memory available for pods is 28.9Gi
,
and kubelet will evict pods when it exceeds this usage.
If you enforce node allocatable (28.9Gi
) via top level cgroups, then pods can never exceed 28.9Gi
.
Evictions would not be performed unless system daemons are consuming more than 3.1Gi
of memory.
If system daemons do not use up all their reservation, with the above example,
pods would face memcg OOM kills from their bounding cgroup before node evictions kick in.
To better enforce QoS under this situation, the node applies the hard eviction thresholds to
the top-level cgroup for all pods to be Node Allocatable + Eviction Hard Thresholds
.
If system daemons do not use up all their reservation, the node will evict pods whenever
they consume more than 28.9Gi
of memory. If eviction does not occur in time, a pod
will be OOM killed if pods consume 29Gi
of memory.
OpenShift Container Platform supports the CPU and memory resource types for allocation. If
your administrator enabled the ephemeral storage technology preview, the
ephemeral-resource
resource type is supported as well. For the cpu
type, the
resource quantity is specified in units of cores, such as 200m
, 0.5
, or 1
.
For memory
and ephemeral-storage
, it is specified in units of bytes,
such as 200Ki
, 50Mi
, or 5Gi
.
As an administrator, you can set these using a Custom Resource (CR) through a set of <resource_type>=<resource_quantity>
pairs
(e.g., cpu=200m,memory=512Mi).
To help you determine setting for --system-reserved
and --kube-reserved
you can introspect the corresponding node’s resource usage
using the node summary API, which is accessible at /api/v1/nodes/<node>/proxy/stats/summary
. Enter the following command for your node:
$ oc get --raw /api/v1/nodes/<node>/proxy/stats/summary
For example, to access the resources from cluster.node22
node, you can enter:
$ oc get --raw /api/v1/nodes/cluster.node22/proxy/stats/summary { "node": { "nodeName": "cluster.node22", "systemContainers": [ { "cpu": { "usageCoreNanoSeconds": 929684480915, "usageNanoCores": 190998084 }, "memory": { "rssBytes": 176726016, "usageBytes": 1397895168, "workingSetBytes": 1050509312 }, "name": "kubelet" }, { "cpu": { "usageCoreNanoSeconds": 128521955903, "usageNanoCores": 5928600 }, "memory": { "rssBytes": 35958784, "usageBytes": 129671168, "workingSetBytes": 102416384 }, "name": "runtime" } ] } }
Obtain the label associated with the static Machine Config Pool CRD for the type of node you want to configure. Perform one of the following steps:
View the Machine Config Pool:
$ oc describe machineconfigpool <name>
For example:
$ oc describe machineconfigpool worker
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
creationTimestamp: 2019-02-08T14:52:39Z
generation: 1
labels:
custom-kubelet: small-pods (1)
1 | If a label has been added it appears under labels . |
If the label is not present, add a key/value pair:
$ oc label machineconfigpool worker custom-kubelet=small-pods
Create a Custom Resource (CR) for your configuration change.
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: set-allocatable (1)
spec:
machineConfigPoolSelector:
matchLabels:
custom-kubelet: small-pods (2)
kubeletConfig:
systemReserved:
cpu: 500m
memory: 512Mi
kubeReserved:
cpu: 500m
memory: 512Mi