Overview

The node must preserve node stability when available compute resources are low. This is especially important when dealing with incompressible resources such as memory or disk. If either resource is exhausted, the node becomes unstable.

Failure to disable swap memory makes the node not recognize it is under MemoryPressure.

To take advantage of memory based evictions, operators must disable swap.

Eviction Policy

Using eviction policies, a node can proactively monitor for and prevent against total starvation of a compute resource.

In cases where a node is running low on available resources, it can proactively fail one or more pods in order to reclaim the starved resource using an eviction policy. When the node fails a pod, it terminates all containers in the pod, and the PodPhase is transitioned to Failed.

Platform administrators can configure eviction settings within the node-config.yaml file.

Eviction Signals

The node can be configured to trigger eviction decisions on the signals described in the table below. The value of each signal is described in the description column based on the node summary API.

To view the signals:

curl <certificate details> \
  https://<master>/api/v1/nodes/<node>/proxy/stats/summary
Table 1. Supported Eviction Signals
Eviction Signal Description

memory.available

memory.available = node.status.capacity[memory] - node.stats.memory.workingSet

In future releases, the node will support the ability to trigger eviction decisions based on disk pressure. Until then, use container and image garbage collection.

Eviction Thresholds

You can configure a node to specify eviction thresholds, which trigger the node to reclaim resources.

Eviction thresholds can be soft, for when you allow a grace period before reclaiming resources, and hard, for when the node takes immediate action when a threshold is met.

Thresholds are configured in the following form:

<eviction_signal><operator><quantity>
  • Valid eviction-signal tokens as defined by eviction signals.

  • Valid operator tokens are <.

  • Valid quantity tokens must match the quantity representation used by Kubernetes.

For example, using the memory.available signal, in order to construct a threshold for when the memory available drops below 500Mi, the form would be:

memory.available<500Mi

Soft Eviction Thresholds

A soft eviction threshold pairs an eviction threshold with a required administrator-specified grace period. The node does not reclaim resources associated with the eviction signal until that grace period is exceeded. If no grace period is provided, the node errors on startup.

In addition, if a soft eviction threshold is met, an operator can specify a maximum allowed pod termination grace period to use when evicting pods from the node. If specified, the node uses the lesser value among the pod.Spec.TerminationGracePeriodSeconds and the maximum-allowed grace period. If not specified, the node kills pods immediately with no graceful termination.

To configure soft eviction thresholds, the following flags are supported:

  • eviction-soft: a set of eviction thresholds (for example, memory.available<1.5Gi) that, if met over a corresponding grace period, triggers a pod eviction.

  • eviction-soft-grace-period: a set of eviction grace periods (for example, memory.available=1m30s) that correspond to how long a soft eviction threshold must hold before triggering a pod eviction.

  • eviction-max-pod-grace-period: the maximum-allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met.

Hard Eviction Thresholds

A hard eviction threshold has no grace period and, if observed, the node takes immediate action to reclaim the associated starved resource. If a hard eviction threshold is met, the node kills the pod immediately with no graceful termination.

To configure hard eviction thresholds, the following flag is supported:

  • eviction-hard: a set of eviction thresholds (for example, memory.available<1Gi) that, if met, triggers a pod eviction.

Oscillation of Node Conditions

If a node is oscillating above and below a soft eviction threshold, but not exceeding its associated grace period, the corresponding node condition oscillates between true and false, which can confuse the scheduler.

To protect this, set the following flag to control how long the node must wait before transitioning out of a pressure condition:

  • eviction-pressure-transition-period: the duration that the node has to wait before transitioning out of an eviction pressure condition.

Before toggling the condition back to false, the node ensures that it has not observed a met eviction threshold for the specified pressure condition for the period specified.

Eviction Monitoring Interval

The node evaluates and monitors eviction thresholds every 10 seconds and the value can not be modified. This is the housekeeping interval.

Mapping Eviction Signals to Node Conditions

The node can map one or more eviction signals to a corresponding node condition.

If an eviction threshold is met, independent of its associated grace period, the node reports a condition indicating that the node is under pressure.

The following node conditions are defined that correspond to the specified eviction signal.

Table 2. Node Conditions Related to Low Resources
Node Condition Eviction Signal Description

MemoryPressure

memory.available

Available memory on the node has satisfied an eviction threshold.

When the above is set the node continues to report node status updates at the frequency specified by the node-status-update-frequency argument, which defaults to 10s.

Eviction of Pods

If an eviction threshold is met and the grace period is passed, the node initiates the process of evicting pods until it observes the signal going below its defined threshold.

The node ranks pods for eviction by their quality of service, and, among those with the same quality of service, by the consumption of the starved compute resource relative to the pod’s scheduling request.

  • BestEffort: pods that consume the most of the starved resource are failed first.

  • Burstable: pods that consume the most of the starved resource relative to their request for that resource are failed first. If no pod has exceeded its request, the strategy targets the largest consumer of the starved resource.

  • Guaranteed: pods that consume the most of the starved resource relative to their request are failed first. If no pod has exceeded its request, the strategy targets the largest consumer of the starved resource.

A Guaranteed pod will never be evicted because of another pod’s resource consumption unless a system daemon (node, docker, journald, etc) is consuming more resources than were reserved via system-reserved, or kube-reserved allocations or if the node has only Guaranteed pods remaining.

If the latter, the node evicts a Guaranteed pod that least impacts node stability and limits the impact of the unexpected consumption to other Guaranteed pods.

Scheduler

The scheduler views node conditions when placing additional pods on the node. For example, if the node has an eviction threshold like the following:

eviction-hard is "memory.available<500Mi"

and available memory falls below 500Mi, the node reports a value in Node.Status.Conditions as MemoryPressure as true.

Table 3. Node Conditions and Scheduler Behavior
Node Condition Scheduler Behavior

MemoryPressure

BestEffort pods are not scheduled to the node.

This means that if the scheduler sees the node reporting MemoryPressure it will not place BestEffort pods on that node.

Example Scenario

Consider the following scenario:

  • Node memory capacity of 10Gi.

  • The operator wants to reserve 10% of memory capacity for system daemons (kernel, node, etc.).

  • The operator wants to evict pods at 95% memory utilization to reduce thrashing and incidence of system OOM.

A node reports two values:

  • Capacity: How much resource is on the machine

  • Allocatable: How much resource is made available for scheduling.

The goal is to allow the scheduler to fully allocate a node and to not have evictions occur.

Evictions should only occur if pods use more than their requested amount of resource.

To facilitate this scenario, the node configuration file (the node-config.yaml file) is modified as follows:

kubeletArguments:
  eviction-hard: (1)
    - "memory.available<500Mi"
  system-reserved:
    - "1.5Gi"
1 This threshold can either be eviction-hard or eviction-soft.

Soft eviction usage is more common when you are targeting a certain level of utilization, but can tolerate temporary spikes. It is recommended that the soft eviction threshold is always less than the hard eviction threshold, but the time period is operator specific. The system reservation should also cover the soft eviction threshold.

Implicit in this configuration is the understanding that system-reserved should include the amount of memory covered by the eviction threshold.

To reach that capacity, either some pod is using more than its request, or the system is using more than 1Gi.

If a node has 10 Gi of capacity, and you want to reserve 10% of that capacity for the system daemons, do the following:

capacity = 10 Gi
system-reserved = 10 Gi * .01 = 1 Gi

The node allocatable value in this setting becomes:

allocatable = capacity - system-reserved = 9 Gi

This means by default, the scheduler will schedule pods that request 9 Gi of memory to that node.

If you want to turn on eviction so that eviction is triggered when the node observes that available memory falls below 10% of capacity for 30 seconds, or immediately when it falls below 5% of capacity, you need the scheduler to see allocatable as 8Gi. Therefore, ensure your system reservation covers the greater of your eviction thresholds.

capacity = 10 Gi
eviction-threshold = 10 Gi * .05 = .5 Gi
system-reserved = (10Gi * .01) + eviction-threshold = 1.5 Gi
allocatable = capacity - system-reserved = 8.5 Gi

You must set system-reserved equal to the amount of resource you want to reserve for system-daemons, plus the amount of resource you want to reserve before triggering evictions.

This configuration ensures that the scheduler does not place pods on a node that immediately induce memory pressure and trigger eviction assuming those pods use less than their configured request.

Out of Resource and Out of Memory

If the node experiences a system out of memory (OOM) event before it is able to reclaim memory, the node depends on the OOM killer to respond.

The node sets a oom_score_adj value for each container based on the quality of service for the pod.

Table 4. Quality of Service OOM Scores
Quality of Service oom_score_adj Value

Guaranteed

-998

BestEffort

1000

Burstable

min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999)

If the node is unable to reclaim memory prior to experiencing a system OOM event, the oom_killer calculates an oom_score:

% of node memory a container is using + `oom_score_adj` = `oom_score`

The node then kills the container with the highest score.

Containers with the lowest quality of service that are consuming the largest amount of memory relative to the scheduling request are failed first.

Unlike pod eviction, if a pod container is OOM failed, it can be restarted by the node based on its RestartPolicy.

DaemonSets and Out of Resource Handling

If a node evicts a pod that was created by a DaemonSet, the pod will immediately be recreated and rescheduled back to the same node, because the node has no ability to distinguish a pod created from a DaemonSet versus any other object.

In general, DaemonSets should not create BestEffort pods to avoid being identified as a candidate pod for eviction. Instead DaemonSets should ideally launch Guaranteed pods.