curl <certificate details> \ https://<master>/api/v1/nodes/<node>/proxy/stats/summary
The node must preserve node stability when available compute resources are low. This is especially important when dealing with incompressible resources such as memory or disk. If either resource is exhausted, the node becomes unstable.
Failure to disable swap memory makes the node not recognize it is under MemoryPressure. To take advantage of memory based evictions, operators must disable swap. |
Using eviction policies, a node can proactively monitor for and prevent against total starvation of a compute resource.
In cases where a node is running low on available resources, it can proactively
fail one or more pods in order to reclaim the starved resource using an eviction
policy. When the node fails a pod, it terminates all containers in the pod, and
the PodPhase
is transitioned to Failed.
Platform administrators can configure eviction settings within the node-config.yaml file.
The node can be configured to trigger eviction decisions on the signals described in the table below. The value of each signal is described in the description column based on the node summary API.
To view the signals:
curl <certificate details> \ https://<master>/api/v1/nodes/<node>/proxy/stats/summary
Eviction Signal | Description |
---|---|
|
|
In future releases, the node will support the ability to trigger eviction decisions based on disk pressure. Until then, use container and image garbage collection. |
You can configure a node to specify eviction thresholds, which trigger the node to reclaim resources.
Eviction thresholds can be soft, for when you allow a grace period before reclaiming resources, and hard, for when the node takes immediate action when a threshold is met.
Thresholds are configured in the following form:
<eviction_signal><operator><quantity>
Valid eviction-signal
tokens as defined by eviction signals.
Valid operator
tokens are <
.
Valid quantity
tokens must match the quantity representation used by
Kubernetes.
For example, using the memory.available
signal, in order to construct a
threshold for when the memory available drops below 500Mi, the form would be:
memory.available<500Mi
A soft eviction threshold pairs an eviction threshold with a required administrator-specified grace period. The node does not reclaim resources associated with the eviction signal until that grace period is exceeded. If no grace period is provided, the node errors on startup.
In addition, if a soft eviction threshold is met, an operator can specify a
maximum allowed pod termination grace period to use when evicting pods from the
node. If specified, the node uses the lesser value among the
pod.Spec.TerminationGracePeriodSeconds
and the maximum-allowed grace period.
If not specified, the node kills pods immediately with no graceful termination.
To configure soft eviction thresholds, the following flags are supported:
eviction-soft
: a set of eviction thresholds (for example,
memory.available<1.5Gi
) that, if met over a corresponding grace period,
triggers a pod eviction.
eviction-soft-grace-period
: a set of eviction grace periods (for
example, memory.available=1m30s
) that correspond to how long a soft eviction
threshold must hold before triggering a pod eviction.
eviction-max-pod-grace-period
: the maximum-allowed grace period (in
seconds) to use when terminating pods in response to a soft eviction threshold
being met.
A hard eviction threshold has no grace period and, if observed, the node takes immediate action to reclaim the associated starved resource. If a hard eviction threshold is met, the node kills the pod immediately with no graceful termination.
To configure hard eviction thresholds, the following flag is supported:
eviction-hard
: a set of eviction thresholds (for example,
memory.available<1Gi
) that, if met, triggers a pod eviction.
If a node is oscillating above and below a soft eviction threshold, but not exceeding its associated grace period, the corresponding node condition oscillates between true and false, which can confuse the scheduler.
To protect this, set the following flag to control how long the node must wait before transitioning out of a pressure condition:
eviction-pressure-transition-period
: the duration that the node has
to wait before transitioning out of an eviction pressure condition.
Before toggling the condition back to false, the node ensures that it has not observed a met eviction threshold for the specified pressure condition for the period specified.
The node evaluates and monitors eviction thresholds every 10 seconds and the value can not be modified. This is the housekeeping interval.
The node can map one or more eviction signals to a corresponding node condition.
If an eviction threshold is met, independent of its associated grace period, the node reports a condition indicating that the node is under pressure.
The following node conditions are defined that correspond to the specified eviction signal.
Node Condition | Eviction Signal | Description |
---|---|---|
|
|
Available memory on the node has satisfied an eviction threshold. |
When the above is set the node continues to report node status updates at the
frequency specified by the node-status-update-frequency
argument, which
defaults to 10s
.
If an eviction threshold is met and the grace period is passed, the node initiates the process of evicting pods until it observes the signal going below its defined threshold.
The node ranks pods for eviction by their quality of service, and, among those with the same quality of service, by the consumption of the starved compute resource relative to the pod’s scheduling request.
BestEffort
: pods that consume the most of the starved resource are failed
first.
Burstable
: pods that consume the most of the starved resource relative to their
request for that resource are failed first. If no pod has exceeded its request,
the strategy targets the largest consumer of the starved resource.
Guaranteed
: pods that consume the most of the starved resource relative to
their request are failed first. If no pod has exceeded its request, the strategy
targets the largest consumer of the starved resource.
A Guaranteed
pod will never be evicted because of another pod’s resource
consumption unless a system daemon (node, docker, journald, etc) is
consuming more resources than were reserved via system-reserved, or
kube-reserved allocations or if the node has only Guaranteed
pods remaining.
If the latter, the node evicts a Guaranteed
pod that least impacts node
stability and limits the impact of the unexpected consumption to other
Guaranteed
pods.
The scheduler views node conditions when placing additional pods on the node. For example, if the node has an eviction threshold like the following:
eviction-hard is "memory.available<500Mi"
and available memory falls below 500Mi, the node reports a value in Node.Status.Conditions
as MemoryPressure
as true.
Node Condition | Scheduler Behavior |
---|---|
|
|
This means that if the scheduler sees the node reporting MemoryPressure
it
will not place BestEffort
pods on that node.
Consider the following scenario:
Node memory capacity of 10Gi
.
The operator wants to reserve 10% of memory capacity for system daemons (kernel, node, etc.).
The operator wants to evict pods at 95% memory utilization to reduce thrashing and incidence of system OOM.
A node reports two values:
Capacity
: How much resource is on the machine
Allocatable
: How much resource is made available for scheduling.
The goal is to allow the scheduler to fully allocate a node and to not have evictions occur.
Evictions should only occur if pods use more than their requested amount of resource.
To facilitate this scenario, the node configuration file (the node-config.yaml file) is modified as follows:
kubeletArguments: eviction-hard: (1) - "memory.available<500Mi" system-reserved: - "memory=1.5Gi"
1 | This threshold can either be eviction-hard or eviction-soft . |
Soft eviction usage is more common when you are targeting a certain level of utilization, but can tolerate temporary spikes. It is recommended that the soft eviction threshold is always less than the hard eviction threshold, but the time period is operator specific. The system reservation should also cover the soft eviction threshold. |
Implicit in this configuration is the understanding that system-reserved
should include the amount of memory covered by the eviction threshold.
To reach that capacity, either some pod is using more than its request, or the
system is using more than 1Gi
.
If a node has 10 Gi of capacity, and you want to reserve 10% of that capacity for the system daemons, do the following:
capacity = 10 Gi system-reserved = 10 Gi * .1 = 1 Gi
The node allocatable value in this setting becomes:
allocatable = capacity - system-reserved = 9 Gi
This means by default, the scheduler will schedule pods that request 9 Gi of memory to that node.
If you want to turn on eviction so that eviction is triggered when the node observes that available memory falls below 10% of capacity for 30 seconds, or immediately when it falls below 5% of capacity, you need the scheduler to see allocatable as 8Gi. Therefore, ensure your system reservation covers the greater of your eviction thresholds.
capacity = 10 Gi eviction-threshold = 10 Gi * .1 = 1 Gi system-reserved = (10Gi * .1) + eviction-threshold = 2 Gi allocatable = capacity - system-reserved = 8 Gi
You must set system-reserved
equal to the amount of resource you want to
reserve for system-daemons, plus the amount of resource you want to reserve
before triggering evictions.
This configuration ensures that the scheduler does not place pods on a node that immediately induce memory pressure and trigger eviction assuming those pods use less than their configured request.
If the node experiences a system out of memory (OOM) event before it is able to reclaim memory, the node depends on the OOM killer to respond.
The node sets a oom_score_adj
value for each container based on the quality
of service for the pod.
Quality of Service | oom_score_adj Value |
---|---|
|
-998 |
|
1000 |
|
min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) |
If the node is unable to reclaim memory prior to experiencing a system OOM
event, the oom_killer
calculates an oom_score
:
% of node memory a container is using + `oom_score_adj` = `oom_score`
The node then kills the container with the highest score.
Containers with the lowest quality of service that are consuming the largest amount of memory relative to the scheduling request are failed first.
Unlike pod eviction, if a pod container is OOM failed, it can be restarted by
the node based on its RestartPolicy
.
If a node evicts a pod that was created by a DaemonSet, the pod will immediately be recreated and rescheduled back to the same node, because the node has no ability to distinguish a pod created from a DaemonSet versus any other object.
In general, DaemonSets should not create BestEffort
pods to avoid being
identified as a candidate pod for eviction. Instead DaemonSets should ideally
launch Guaranteed
pods.