Understanding overcommitment

Requests and limits enable administrators to allow and manage the overcommitment of resources on a node. The scheduler uses requests for scheduling your container and providing a minimum service guarantee. Limits constrain the amount of compute resource that may be consumed on your node.

Azure Red Hat OpenShift administrators can control the level of overcommit and manage container density on nodes by configuring masters to override the ratio between request and limit set on developer containers. In conjunction with a per-project LimitRange specifying limits and defaults, this adjusts the container limit and request to achieve the desired level of overcommit.

That these overrides have no effect if no limits have been set on containers. Create a LimitRange object with default limits (per individual project, or in the project template) in order to ensure that the overrides apply.

After these overrides, the container limits and requests must still be validated by any LimitRange objects in the project. It is possible, for example, for developers to specify a limit close to the minimum limit, and have the request then be overridden below the minimum limit, causing the pod to be forbidden. This unfortunate user experience should be addressed with future work, but for now, configure this capability and LimitRanges with caution.

Understanding resource requests and overcommitment

For each compute resource, a container may specify a resource request and limit. Scheduling decisions are made based on the request to ensure that a node has enough capacity available to meet the requested value. If a container specifies limits, but omits requests, the requests are defaulted to the limits. A container is not able to exceed the specified limit on the node.

The enforcement of limits is dependent upon the compute resource type. If a container makes no request or limit, the container is scheduled to a node with no resource guarantees. In practice, the container is able to consume as much of the specified resource as is available with the lowest local priority. In low resource situations, containers that specify no resource requests are given the lowest quality of service.

Scheduling is based on resources requested, while quota and hard limits refer to resource limits, which can be set higher than requested resources. The difference between request and limit determines the level of overcommit; for instance, if a container is given a memory request of 1Gi and a memory limit of 2Gi, it is scheduled based on the 1Gi request being available on the node, but could use up to 2Gi; so it is 200% overcommitted.

Configuring Buffer Chunk Limiting for Fluentd

If the Fluentd log collector is unable to keep up with a high number of logs, Fluentd performs file buffering to reduce memory usage and prevent data loss.

Fluentd file buffering stores records in chunks. Chunks are stored in buffers.

You can tune file buffering in your cluster by editing environment variables in the Fluentd Daemonset:

To modify the FILE_BUFFER_LIMIT or BUFFER_SIZE_LIMIT parameters in the Fluentd Daemonset, you must set cluster logging to the unmanaged state. Operators in an unmanaged state are unsupported and the cluster administrator assumes full control of the individual component configurations and upgrades.

  • BUFFER_SIZE_LIMIT. This parameter determines the maximum size of each chunk file before Fluentd creates a new chunk. The default is 8M. This parameter sets the Fluentd chunk_limit_size variable.

    A high BUFFER_SIZE_LIMIT can collect more records per chunk file. However, bigger records take longer to be sent to the logstore.

  • FILE_BUFFER_LIMIT. This parameter determines the file buffer size per logging output. This value is only a request based on the available space on the node where a Fluentd pod is scheduled. Azure Red Hat OpenShift does not allow Fluentd to exceed the node capacity. The default is 256Mi.

    A high FILE_BUFFER_LIMIT could translate to a higher BUFFER_QUEUE_LIMIT based the number of outputs. However, if the node’s space is under pressure, Fluentd can fail.

    By default, the number_of_outputs is 1 if all the logs are sent to a single resource, and is incremented by 1 for each additional resource. You might have multiple outputs if you use the Log Forwarding API, the Fluentd Forward protocol, or syslog protocol to forward logs to external locations.

    The permanent volume size must be larger than FILE_BUFFER_LIMIT multiplied by the number of outputs.

  • BUFFER_QUEUE_LIMIT. This parameter is the maximum number of buffer chunks allowed. The BUFFER_QUEUE_LIMIT parameter is not directly tunable. Azure Red Hat OpenShift calculates this value based on the number of logging outputs, the chunk size, and the filesystem space available. The default is 32 chunks. To change the BUFFER_QUEUE_LIMIT, you must change the value of FILE_BUFFER_LIMIT. The BUFFER_QUEUE_LIMIT parameter sets the Fluentd queue_limit_length parameter.

    Azure Red Hat OpenShift calculates the BUFFER_QUEUE_LIMIT as (FILE_BUFFER_LIMIT / (number_of_outputs * BUFFER_SIZE_LIMIT)).

    Using the default set of values, the value of BUFFER_QUEUE_LIMIT is 32:


    • number_of_outputs = 1


Azure Red Hat OpenShift uses the Fluentd file buffer plug-in to configure how the chunks are stored. You can see the location of the buffer file using the following command:

$ oc get cm fluentd -o json | jq -r '.data."fluent.conf"'
   @type file (1)
   path '/var/lib/flunetd/retry-elasticseach' (2)
1 The Fluentd file buffer plugin. Do not change this value.
2 The path where buffer chunks are stored.
  • Set cluster logging to the unmanaged state. Operators in an unmanaged state are unsupported and the cluster administrator assumes full control of the individual component configurations and upgrades.


To configure Buffer Chunk Limiting:

  1. Edit either of the following parameters in the fluentd Daemonset.

              - name: FILE_BUFFER_LIMIT (1)
                value: "256"
              - name: BUFFER_SIZE_LIMIT (2)
                value: 8Mi
    1 Specify the Fluentd file buffer size per output.
    2 Specify the maximum size of each Fluentd buffer chunk.

Understanding compute resources and containers

The node-enforced behavior for compute resources is specific to the resource type.

Understanding container CPU requests

A container is guaranteed the amount of CPU it requests and is additionally able to consume excess CPU available on the node, up to any limit specified by the container. If multiple containers are attempting to use excess CPU, CPU time is distributed based on the amount of CPU requested by each container.

For example, if one container requested 500m of CPU time and another container requested 250m of CPU time, then any extra CPU time available on the node is distributed among the containers in a 2:1 ratio. If a container specified a limit, it will be throttled not to use more CPU than the specified limit. CPU requests are enforced using the CFS shares support in the Linux kernel. By default, CPU limits are enforced using the CFS quota support in the Linux kernel over a 100ms measuring interval, though this can be disabled.

Understanding container memory requests

A container is guaranteed the amount of memory it requests. A container can use more memory than requested, but once it exceeds its requested amount, it could be terminated in a low memory situation on the node. If a container uses less memory than requested, it will not be terminated unless system tasks or daemons need more memory than was accounted for in the node’s resource reservation. If a container specifies a limit on memory, it is immediately terminated if it exceeds the limit amount.

Understanding overcomitment and quality of service classes

A node is overcommitted when it has a pod scheduled that makes no request, or when the sum of limits across all pods on that node exceeds available machine capacity.

In an overcommitted environment, it is possible that the pods on the node will attempt to use more compute resource than is available at any given point in time. When this occurs, the node must give priority to one pod over another. The facility used to make this decision is referred to as a Quality of Service (QoS) Class.

For each compute resource, a container is divided into one of three QoS classes with decreasing order of priority:

Table 1. Quality of Service Classes
Priority Class Name Description

1 (highest)


If limits and optionally requests are set (not equal to 0) for all resources and they are equal, then the container is classified as Guaranteed.



If requests and optionally limits are set (not equal to 0) for all resources, and they are not equal, then the container is classified as Burstable.

3 (lowest)


If requests and limits are not set for any of the resources, then the container is classified as BestEffort.

Memory is an incompressible resource, so in low memory situations, containers that have the lowest priority are terminated first:

  • Guaranteed containers are considered top priority, and are guaranteed to only be terminated if they exceed their limits, or if the system is under memory pressure and there are no lower priority containers that can be evicted.

  • Burstable containers under system memory pressure are more likely to be terminated once they exceed their requests and no other BestEffort containers exist.

  • BestEffort containers are treated with the lowest priority. Processes in these containers are first to be terminated if the system runs out of memory.

Understanding how to reserve memory across quality of service tiers

You can use the qos-reserved parameter to specify a percentage of memory to be reserved by a pod in a particular QoS level. This feature attempts to reserve requested resources to exclude pods from lower OoS classes from using resources requested by pods in higher QoS classes.

Azure Red Hat OpenShift uses the qos-reserved parameter as follows:

  • A value of qos-reserved=memory=100% will prevent the Burstable and BestEffort QOS classes from consuming memory that was requested by a higher QoS class. This increases the risk of inducing OOM on BestEffort and Burstable workloads in favor of increasing memory resource guarantees for Guaranteed and Burstable workloads.

  • A value of qos-reserved=memory=50% will allow the Burstable and BestEffort QOS classes to consume half of the memory requested by a higher QoS class.

  • A value of qos-reserved=memory=0% will allow a Burstable and BestEffort QoS classes to consume up to the full node allocatable amount if available, but increases the risk that a Guaranteed workload will not have access to requested memory. This condition effectively disables this feature.

Understanding swap memory and QOS

You can disable swap by default on your nodes in order to preserve quality of service (QOS) guarantees. Otherwise, physical resources on a node can oversubscribe, affecting the resource guarantees the Kubernetes scheduler makes during pod placement.

For example, if two guaranteed pods have reached their memory limit, each container could start using swap memory. Eventually, if there is not enough swap space, processes in the pods can be terminated due to the system being oversubscribed.

Failing to disable swap results in nodes not recognizing that they are experiencing MemoryPressure, resulting in pods not receiving the memory they made in their scheduling request. As a result, additional pods are placed on the node to further increase memory pressure, ultimately increasing your risk of experiencing a system out of memory (OOM) event.

If swap is enabled, any out-of-resource handling eviction thresholds for available memory will not work as expected. Take advantage of out-of-resource handling to allow pods to be evicted from a node when it is under memory pressure, and rescheduled on an alternative node that has no such pressure.

Understanding nodes overcommitment

In an overcommitted environment, it is important to properly configure your node to provide best system behavior.

When the node starts, it ensures that the kernel tunable flags for memory management are set properly. The kernel should never fail memory allocations unless it runs out of physical memory.

In an overcommitted environment, it is important to properly configure your node to provide best system behavior.

When the node starts, it ensures that the kernel tunable flags for memory management are set properly. The kernel should never fail memory allocations unless it runs out of physical memory.

To ensure this behavior, Azure Red Hat OpenShift configures the kernel to always overcommit memory by setting the vm.overcommit_memory parameter to 1, overriding the default operating system setting.

Azure Red Hat OpenShift also configures the kernel not to panic when it runs out of memory by setting the vm.panic_on_oom parameter to 0. A setting of 0 instructs the kernel to call oom_killer in an Out of Memory (OOM) condition, which kills processes based on priority

You can view the current setting by running the following commands on your nodes:

$ sysctl -a |grep commit

vm.overcommit_memory = 1
$ sysctl -a |grep panic
vm.panic_on_oom = 0

The above flags should already be set on nodes, and no further action is required.

You can also perform the following configurations for each node:

  • Disable or enforce CPU limits using CPU CFS quotas

  • Reserve resources for system processes

  • Reserve memory across quality of service tiers

Disabling or enforcing CPU limits using CPU CFS quotas

Nodes by default enforce specified CPU limits using the Completely Fair Scheduler (CFS) quota support in the Linux kernel.

  1. Obtain the label associated with the static Machine Config Pool CRD for the type of node you want to configure. Perform one of the following steps:

    1. View the Machine Config Pool:

      $ oc describe machineconfigpool <name>

      For example:

      $ oc describe machineconfigpool worker
      apiVersion: machineconfiguration.openshift.io/v1
      kind: MachineConfigPool
        creationTimestamp: 2019-02-08T14:52:39Z
        generation: 1
          custom-kubelet: small-pods (1)
      1 If a label has been added it appears under labels.
    2. If the label is not present, add a key/value pair:

      $ oc label machineconfigpool worker custom-kubelet=small-pods
  1. Create a Custom Resource (CR) for your configuration change.

    Sample configuration for a disabling CPU limits
    apiVersion: machineconfiguration.openshift.io/v1
    kind: KubeletConfig
      name: disable-cpu-units (1)
          custom-kubelet: small-pods (2)
        cpu-cfs-quota: (3)
          - "false"
    1 Assign a name to CR.
    2 Specify the label to apply the configuration change.
    3 Set the cpu-cfs-quota parameter to false.

If CPU limit enforcement is disabled, it is important to understand the impact that will have on your node:

  • If a container makes a request for CPU, it will continue to be enforced by CFS shares in the Linux kernel.

  • If a container makes no explicit request for CPU, but it does specify a limit, the request will default to the specified limit, and be enforced by CFS shares in the Linux kernel.

  • If a container specifies both a request and a limit for CPU, the request will be enforced by CFS shares in the Linux kernel, and the limit will have no impact on the node.

Reserving resources for system processes

To provide more reliable scheduling and minimize node resource overcommitment, each node can reserve a portion of its resources for use by system daemons that are required to run on your node for your cluster to function (sshd, etc.). In particular, it is recommended that you reserve resources for incompressible resources such as memory.


To explicitly reserve resources for non-pod processes, allocate node resources by specifying resources available for scheduling. For more details, see Allocating Resources for Nodes.

Disabling overcommitment for a node

When enabled, overcommitment can be disabled on each node.


To disable overcommitment in a node run the following command on that node:

$ sysctl -w vm.overcommit_memory=0

Disabling overcommitment for a project

When enabled, overcommitment can be disabled per-project. For example, you can allow infrastructure components to be configured independently of overcommitment.


To disable overcommitment in a project:

  1. Edit the project object file

  2. Add the following annotation:

    quota.openshift.io/cluster-resource-override-enabled: "false"
  3. Create the project object:

    $ oc create -f <file-name>.yaml