$ oc describe mcp/worker-rt
Tune nodes for low latency by using the cluster performance profile. You can restrict CPUs for infra and application containers, configure huge pages, Hyper-Threading, and configure CPU partitions for latency-sensitive processes.
Learn about the Performance Profile Creator (PPC) and how you can use it to create a performance profile.
In Telco, clusters using |
The Performance Profile Creator (PPC) is a command-line tool, delivered with the Node Tuning Operator, used to create the performance profile.
The tool consumes must-gather
data from the cluster and several user-supplied profile arguments. The PPC generates a performance profile that is appropriate for your hardware and topology.
The tool is run by one of the following methods:
Invoking podman
Calling a wrapper script
The Performance Profile Creator (PPC) tool requires must-gather
data. As a cluster administrator, run the must-gather
command to capture information about your cluster.
Access to the cluster as a user with the cluster-admin
role.
The OpenShift CLI (oc
) installed.
Optional: Verify that a matching machine config pool exists with a label:
$ oc describe mcp/worker-rt
Name: worker-rt
Namespace:
Labels: machineconfiguration.openshift.io/role=worker-rt
If a matching label does not exist add a label for a machine config pool (MCP) that matches with the MCP name:
$ oc label mcp <mcp_name> machineconfiguration.openshift.io/role=<mcp_name>
Navigate to the directory where you want to store the must-gather
data.
Collect cluster information by running the following command:
$ oc adm must-gather
Optional: Create a compressed file from the must-gather
directory:
$ tar cvaf must-gather.tar.gz must-gather/
Compressed output is required if you are running the Performance Profile Creator wrapper script. |
As a cluster administrator, you can run podman
and the Performance Profile Creator to create a performance profile.
Access to the cluster as a user with the cluster-admin
role.
A cluster installed on bare-metal hardware.
A node with podman
and OpenShift CLI (oc
) installed.
Access to the Node Tuning Operator image.
Check the machine config pool:
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-acd1358917e9f98cbdb599aea622d78b True False False 3 3 3 0 22h
worker-cnf rendered-worker-cnf-1d871ac76e1951d32b2fe92369879826 False True False 2 1 1 0 22h
Use Podman to authenticate to registry.redhat.io
:
$ podman login registry.redhat.io
Username: <username>
Password: <password>
Optional: Display help for the PPC tool:
$ podman run --rm --entrypoint performance-profile-creator registry.redhat.io/openshift4/ose-cluster-node-tuning-operator:v4.14 -h
A tool that automates creation of Performance Profiles
Usage:
performance-profile-creator [flags]
Flags:
--disable-ht Disable Hyperthreading
-h, --help help for performance-profile-creator
--info string Show cluster information; requires --must-gather-dir-path, ignore the other arguments. [Valid values: log, json] (default "log")
--mcp-name string MCP name corresponding to the target machines (required)
--must-gather-dir-path string Must gather directory path (default "must-gather")
--offlined-cpu-count int Number of offlined CPUs
--per-pod-power-management Enable Per Pod Power Management
--power-consumption-mode string The power consumption mode. [Valid values: default, low-latency, ultra-low-latency] (default "default")
--profile-name string Name of the performance profile to be created (default "performance")
--reserved-cpu-count int Number of reserved CPUs (required)
--rt-kernel Enable Real Time Kernel (required)
--split-reserved-cpus-across-numa Split the Reserved CPUs across NUMA nodes
--topology-manager-policy string Kubelet Topology Manager Policy of the performance profile to be created. [Valid values: single-numa-node, best-effort, restricted] (default "restricted")
--user-level-networking Run with User level Networking(DPDK) enabled
Run the Performance Profile Creator tool in discovery mode:
Discovery mode inspects your cluster by using the output from
Using this information you can set appropriate values for some of the arguments supplied to the Performance Profile Creator tool. |
$ podman run --entrypoint performance-profile-creator -v <path_to_must-gather>/must-gather:/must-gather:z registry.redhat.io/openshift4/ose-cluster-node-tuning-operator:v4.14 --info log --must-gather-dir-path /must-gather
This command uses the performance profile creator as a new entry point to The
The |
Run podman
:
$ podman run --entrypoint performance-profile-creator -v /must-gather:/must-gather:z registry.redhat.io/openshift4/ose-cluster-node-tuning-operator:v4.14 --mcp-name=worker-cnf --reserved-cpu-count=4 --rt-kernel=true --split-reserved-cpus-across-numa=false --must-gather-dir-path /must-gather --power-consumption-mode=ultra-low-latency --offlined-cpu-count=6 > my-performance-profile.yaml
The Performance Profile Creator arguments are shown in the Performance Profile Creator arguments table. The following arguments are required:
The |
Review the created YAML file:
$ cat my-performance-profile.yaml
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: performance
spec:
cpu:
isolated: 2-39,48-79
offlined: 42-47
reserved: 0-1,40-41
machineConfigPoolSelector:
machineconfiguration.openshift.io/role: worker-cnf
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
numa:
topologyPolicy: restricted
realTimeKernel:
enabled: true
workloadHints:
highPowerConsumption: true
realTime: true
Apply the generated profile:
$ oc apply -f my-performance-profile.yaml
For more information about the must-gather
tool,
see Gathering data about your cluster.
The following example illustrates how to run podman
to create a performance profile with 20 reserved CPUs that are to be split across the NUMA nodes.
Node hardware configuration:
80 CPUs
Hyperthreading enabled
Two NUMA nodes
Even numbered CPUs run on NUMA node 0 and odd numbered CPUs run on NUMA node 1
Run podman
to create the performance profile:
$ podman run --entrypoint performance-profile-creator -v /must-gather:/must-gather:z registry.redhat.io/openshift4/ose-cluster-node-tuning-operator:v4.14 --mcp-name=worker-cnf --reserved-cpu-count=20 --rt-kernel=true --split-reserved-cpus-across-numa=true --must-gather-dir-path /must-gather > my-performance-profile.yaml
The created profile is described in the following YAML:
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: performance
spec:
cpu:
isolated: 10-39,50-79
reserved: 0-9,40-49
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
numa:
topologyPolicy: restricted
realTimeKernel:
enabled: true
In this case, 10 CPUs are reserved on NUMA node 0 and 10 are reserved on NUMA node 1. |
The performance profile wrapper script simplifies the running of the Performance Profile Creator (PPC) tool. It hides the complexities associated with running podman
and specifying the mapping directories and it enables the creation of the performance profile.
Access to the Node Tuning Operator image.
Access to the must-gather
tarball.
Create a file on your local machine named, for example, run-perf-profile-creator.sh
:
$ vi run-perf-profile-creator.sh
Paste the following code into the file:
#!/bin/bash
readonly CONTAINER_RUNTIME=${CONTAINER_RUNTIME:-podman}
readonly CURRENT_SCRIPT=$(basename "$0")
readonly CMD="${CONTAINER_RUNTIME} run --entrypoint performance-profile-creator"
readonly IMG_EXISTS_CMD="${CONTAINER_RUNTIME} image exists"
readonly IMG_PULL_CMD="${CONTAINER_RUNTIME} image pull"
readonly MUST_GATHER_VOL="/must-gather"
NTO_IMG="registry.redhat.io/openshift4/ose-cluster-node-tuning-operator:v4.14"
MG_TARBALL=""
DATA_DIR=""
usage() {
print "Wrapper usage:"
print " ${CURRENT_SCRIPT} [-h] [-p image][-t path] -- [performance-profile-creator flags]"
print ""
print "Options:"
print " -h help for ${CURRENT_SCRIPT}"
print " -p Node Tuning Operator image"
print " -t path to a must-gather tarball"
${IMG_EXISTS_CMD} "${NTO_IMG}" && ${CMD} "${NTO_IMG}" -h
}
function cleanup {
[ -d "${DATA_DIR}" ] && rm -rf "${DATA_DIR}"
}
trap cleanup EXIT
exit_error() {
print "error: $*"
usage
exit 1
}
print() {
echo "$*" >&2
}
check_requirements() {
${IMG_EXISTS_CMD} "${NTO_IMG}" || ${IMG_PULL_CMD} "${NTO_IMG}" || \
exit_error "Node Tuning Operator image not found"
[ -n "${MG_TARBALL}" ] || exit_error "Must-gather tarball file path is mandatory"
[ -f "${MG_TARBALL}" ] || exit_error "Must-gather tarball file not found"
DATA_DIR=$(mktemp -d -t "${CURRENT_SCRIPT}XXXX") || exit_error "Cannot create the data directory"
tar -zxf "${MG_TARBALL}" --directory "${DATA_DIR}" || exit_error "Cannot decompress the must-gather tarball"
chmod a+rx "${DATA_DIR}"
return 0
}
main() {
while getopts ':hp:t:' OPT; do
case "${OPT}" in
h)
usage
exit 0
;;
p)
NTO_IMG="${OPTARG}"
;;
t)
MG_TARBALL="${OPTARG}"
;;
?)
exit_error "invalid argument: ${OPTARG}"
;;
esac
done
shift $((OPTIND - 1))
check_requirements || exit 1
${CMD} -v "${DATA_DIR}:${MUST_GATHER_VOL}:z" "${NTO_IMG}" "$@" --must-gather-dir-path "${MUST_GATHER_VOL}"
echo "" 1>&2
}
main "$@"
Add execute permissions for everyone on this script:
$ chmod a+x run-perf-profile-creator.sh
Optional: Display the run-perf-profile-creator.sh
command usage:
$ ./run-perf-profile-creator.sh -h
Wrapper usage:
run-perf-profile-creator.sh [-h] [-p image][-t path] -- [performance-profile-creator flags]
Options:
-h help for run-perf-profile-creator.sh
-p Node Tuning Operator image (1)
-t path to a must-gather tarball (2)
A tool that automates creation of Performance Profiles
Usage:
performance-profile-creator [flags]
Flags:
--disable-ht Disable Hyperthreading
-h, --help help for performance-profile-creator
--info string Show cluster information; requires --must-gather-dir-path, ignore the other arguments. [Valid values: log, json] (default "log")
--mcp-name string MCP name corresponding to the target machines (required)
--must-gather-dir-path string Must gather directory path (default "must-gather")
--offlined-cpu-count int Number of offlined CPUs
--per-pod-power-management Enable Per Pod Power Management
--power-consumption-mode string The power consumption mode. [Valid values: default, low-latency, ultra-low-latency] (default "default")
--profile-name string Name of the performance profile to be created (default "performance")
--reserved-cpu-count int Number of reserved CPUs (required)
--rt-kernel Enable Real Time Kernel (required)
--split-reserved-cpus-across-numa Split the Reserved CPUs across NUMA nodes
--topology-manager-policy string Kubelet Topology Manager Policy of the performance profile to be created. [Valid values: single-numa-node, best-effort, restricted] (default "restricted")
--user-level-networking Run with User level Networking(DPDK) enabled
There two types of arguments:
|
1 | Optional: Specify the Node Tuning Operator image. If not set, the default upstream image is used: registry.redhat.io/openshift4/ose-cluster-node-tuning-operator:v4.14 . |
2 | -t is a required wrapper script argument and specifies the path to a must-gather tarball. |
Run the performance profile creator tool in discovery mode:
Discovery mode inspects your cluster using the output from
Using this information you can set appropriate values for some of the arguments supplied to the Performance Profile Creator tool. |
$ ./run-perf-profile-creator.sh -t /must-gather/must-gather.tar.gz -- --info=log
The |
Check the machine config pool:
$ oc get mcp
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
master rendered-master-acd1358917e9f98cbdb599aea622d78b True False False 3 3 3 0 22h
worker-cnf rendered-worker-cnf-1d871ac76e1951d32b2fe92369879826 False True False 2 1 1 0 22h
Create a performance profile:
$ ./run-perf-profile-creator.sh -t /must-gather/must-gather.tar.gz -- --mcp-name=worker-cnf --reserved-cpu-count=2 --rt-kernel=true > my-performance-profile.yaml
The Performance Profile Creator arguments are shown in the Performance Profile Creator arguments table. The following arguments are required:
The |
Review the created YAML file:
$ cat my-performance-profile.yaml
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: performance
spec:
cpu:
isolated: 1-39,41-79
reserved: 0,40
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
numa:
topologyPolicy: restricted
realTimeKernel:
enabled: false
Apply the generated profile:
Install the Node Tuning Operator before applying the profile. |
$ oc apply -f my-performance-profile.yaml
Argument | Description | ||
---|---|---|---|
|
Disable hyperthreading. Possible values: Default:
|
||
|
This captures cluster information and is used in discovery mode only. Discovery mode also requires the Possible values:
Default: |
||
|
MCP name for example |
||
|
Must gather directory path. This parameter is required. When the user runs the tool with the wrapper script |
||
|
Number of offlined CPUs.
|
||
|
The power consumption mode. Possible values:
Default: |
||
|
Enable per pod power management. You cannot use this argument if you configured Possible values: Default: |
||
|
Name of the performance profile to create.
Default: |
||
|
Number of reserved CPUs. This parameter is required.
|
||
|
Enable real-time kernel. This parameter is required. Possible values: |
||
|
Split the reserved CPUs across NUMA nodes. Possible values: Default: |
||
|
Kubelet Topology Manager policy of the performance profile to be created. Possible values:
Default: |
||
|
Run with user level networking (DPDK) enabled. Possible values: Default: |
Use the following reference performance profiles as the basis to develop your own custom profiles.
To maximize machine performance in a cluster that uses Open vSwitch with the Data Plane Development Kit (OVS-DPDK) on Red Hat OpenStack Platform (RHOSP), you can use a performance profile.
You can use the following performance profile template to create a profile for your deployment.
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: cnf-performanceprofile
spec:
additionalKernelArgs:
- nmi_watchdog=0
- audit=0
- mce=off
- processor.max_cstate=1
- idle=poll
- intel_idle.max_cstate=0
- default_hugepagesz=1GB
- hugepagesz=1G
- intel_iommu=on
cpu:
isolated: <CPU_ISOLATED>
reserved: <CPU_RESERVED>
hugepages:
defaultHugepagesSize: 1G
pages:
- count: <HUGEPAGES_COUNT>
node: 0
size: 1G
nodeSelector:
node-role.kubernetes.io/worker: ''
realTimeKernel:
enabled: false
globallyDisableIrqLoadBalancing: true
Insert values that are appropriate for your configuration for the CPU_ISOLATED
, CPU_RESERVED
, and HUGEPAGES_COUNT
keys.
The following performance profile configures node-level performance settings for OpenShift Container Platform clusters on commodity hardware to host telco RAN DU workloads.
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
# if you change this name make sure the 'include' line in TunedPerformancePatch.yaml
# matches this name: include=openshift-node-performance-${PerformanceProfile.metadata.name}
# Also in file 'validatorCRs/informDuValidator.yaml':
# name: 50-performance-${PerformanceProfile.metadata.name}
name: openshift-node-performance-profile
annotations:
ran.openshift.io/reference-configuration: "ran-du.redhat.com"
spec:
additionalKernelArgs:
- "rcupdate.rcu_normal_after_boot=0"
- "efi=runtime"
- "vfio_pci.enable_sriov=1"
- "vfio_pci.disable_idle_d3=1"
- "module_blacklist=irdma"
cpu:
isolated: $isolated
reserved: $reserved
hugepages:
defaultHugepagesSize: $defaultHugepagesSize
pages:
- size: $size
count: $count
node: $node
machineConfigPoolSelector:
pools.operator.machineconfiguration.openshift.io/$mcp: ""
nodeSelector:
node-role.kubernetes.io/$mcp: ""
numa:
topologyPolicy: "restricted"
# To use the standard (non-realtime) kernel, set enabled to false
realTimeKernel:
enabled: true
workloadHints:
# WorkloadHints defines the set of upper level flags for different type of workloads.
# See https://github.com/openshift/cluster-node-tuning-operator/blob/master/docs/performanceprofile/performance_profile.md#workloadhints
# for detailed descriptions of each item.
# The configuration below is set for a low latency, performance mode.
realTime: true
highPowerConsumption: false
perPodPowerManagement: false
The following performance profile configures node-level performance settings for OpenShift Container Platform clusters on commodity hardware to host telco core workloads.
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
# if you change this name make sure the 'include' line in TunedPerformancePatch.yaml
# matches this name: include=openshift-node-performance-${PerformanceProfile.metadata.name}
# Also in file 'validatorCRs/informDuValidator.yaml':
# name: 50-performance-${PerformanceProfile.metadata.name}
name: openshift-node-performance-profile
annotations:
ran.openshift.io/reference-configuration: "ran-du.redhat.com"
spec:
additionalKernelArgs:
- "rcupdate.rcu_normal_after_boot=0"
- "efi=runtime"
- "vfio_pci.enable_sriov=1"
- "vfio_pci.disable_idle_d3=1"
- "module_blacklist=irdma"
cpu:
isolated: $isolated
reserved: $reserved
hugepages:
defaultHugepagesSize: $defaultHugepagesSize
pages:
- size: $size
count: $count
node: $node
machineConfigPoolSelector:
pools.operator.machineconfiguration.openshift.io/$mcp: ""
nodeSelector:
node-role.kubernetes.io/$mcp: ""
numa:
topologyPolicy: "restricted"
# To use the standard (non-realtime) kernel, set enabled to false
realTimeKernel:
enabled: true
workloadHints:
# WorkloadHints defines the set of upper level flags for different type of workloads.
# See https://github.com/openshift/cluster-node-tuning-operator/blob/master/docs/performanceprofile/performance_profile.md#workloadhints
# for detailed descriptions of each item.
# The configuration below is set for a low latency, performance mode.
realTime: true
highPowerConsumption: false
perPodPowerManagement: false
The Node Tuning Operator supports v2
, v1
, and v1alpha1
for the performance profile apiVersion
field. The v1 and v1alpha1 APIs are identical. The v2 API includes an optional boolean field globallyDisableIrqLoadBalancing
with a default value of false
.
When you upgrade the Node Tuning Operator performance profile custom resource definition (CRD) from v1 or v1alpha1 to v2, globallyDisableIrqLoadBalancing
is set to true
on existing profiles.
|
When upgrading Node Tuning Operator API version from v1alpha1 to v1, the v1alpha1 performance profiles are converted on-the-fly using a "None" Conversion strategy and served to the Node Tuning Operator with API version v1.
When upgrading from an older Node Tuning Operator API version, the existing v1 and v1alpha1 performance profiles are converted using a conversion webhook that injects the globallyDisableIrqLoadBalancing
field with a value of true
.
Create a PerformanceProfile
appropriate for the environment’s hardware and topology by using the Performance Profile Creator (PPC) tool. The following table describes the possible values set for the power-consumption-mode
flag associated with the PPC tool and the workload hint that is applied.
Performance Profile creator setting | Hint | Environment | Description |
---|---|---|---|
Default |
|
High throughput cluster without latency requirements |
Performance achieved through CPU partitioning only. |
Low-latency |
|
Regional data-centers |
Both energy savings and low-latency are desirable: compromise between power management, latency and throughput. |
Ultra-low-latency |
|
Far edge clusters, latency critical workloads |
Optimized for absolute minimal latency and maximum determinism at the cost of increased power consumption. |
Per-pod power management |
|
Critical and non-critical workloads |
Allows for power management per pod. |
The following configuration is commonly used in a telco RAN DU deployment.
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: workload-hints
spec:
...
workloadHints:
realTime: true
highPowerConsumption: false
perPodPowerManagement: false (1)
1 | Disables some debugging and monitoring features that can affect system latency. |
When the |
For more information how combinations of power consumption and real-time settings impact latency, see Understanding workload hints.
You can enable power savings for a node that has low priority workloads that are colocated with high priority workloads without impacting the latency or throughput of the high priority workloads. Power saving is possible without modifications to the workloads themselves.
The feature is supported on Intel Ice Lake and later generations of Intel CPUs. The capabilities of the processor might impact the latency and throughput of the high priority workloads. |
You enabled C-states and operating system controlled P-states in the BIOS
Generate a PerformanceProfile
with the per-pod-power-management
argument set to true
:
$ podman run --entrypoint performance-profile-creator -v \
/must-gather:/must-gather:z registry.redhat.io/openshift4/ose-cluster-node-tuning-operator:v4.14 \
--mcp-name=worker-cnf --reserved-cpu-count=20 --rt-kernel=true \
--split-reserved-cpus-across-numa=false --topology-manager-policy=single-numa-node \
--must-gather-dir-path /must-gather --power-consumption-mode=low-latency \ (1)
--per-pod-power-management=true > my-performance-profile.yaml
1 | The power-consumption-mode argument must be default or low-latency when the per-pod-power-management argument is set to true . |
PerformanceProfile
with perPodPowerManagement
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: performance
spec:
[.....]
workloadHints:
realTime: true
highPowerConsumption: false
perPodPowerManagement: true
Set the default cpufreq
governor as an additional kernel argument in the PerformanceProfile
custom resource (CR):
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: performance
spec:
...
additionalKernelArgs:
- cpufreq.default_governor=schedutil (1)
1 | Using the schedutil governor is recommended, however, you can use other governors such as the ondemand or powersave governors. |
Set the maximum CPU frequency in the TunedPerformancePatch
CR:
spec:
profile:
- data: |
[sysfs]
/sys/devices/system/cpu/intel_pstate/max_perf_pct = <x> (1)
1 | The max_perf_pct controls the maximum frequency that the cpufreq driver is allowed to set as a percentage of the maximum supported cpu frequency. This value applies to all CPUs. You can check the maximum supported frequency in /sys/devices/system/cpu/cpu0/cpufreq/cpuinfo_max_freq . As a starting point, you can use a percentage that caps all CPUs at the All Cores Turbo frequency. The All Cores Turbo frequency is the frequency that all cores will run at when the cores are all fully occupied. |
Generic housekeeping and workload tasks use CPUs in a way that may impact latency-sensitive processes. By default, the container runtime uses all online CPUs to run all containers together, which can result in context switches and spikes in latency. Partitioning the CPUs prevents noisy processes from interfering with latency-sensitive processes by separating them from each other. The following table describes how processes run on a CPU after you have tuned the node using the Node Tuning Operator:
Process type | Details |
---|---|
|
Runs on any CPU except where low latency workload is running |
Infrastructure pods |
Runs on any CPU except where low latency workload is running |
Interrupts |
Redirects to reserved CPUs (optional in OpenShift Container Platform 4.7 and later) |
Kernel processes |
Pins to reserved CPUs |
Latency-sensitive workload pods |
Pins to a specific set of exclusive CPUs from the isolated pool |
OS processes/systemd services |
Pins to reserved CPUs |
The allocatable capacity of cores on a node for pods of all QoS process types, Burstable
, BestEffort
, or Guaranteed
, is equal to the capacity of the isolated pool. The capacity of the reserved pool is removed from the node’s total core capacity for use by the cluster and operating system housekeeping duties.
A node features a capacity of 100 cores. Using a performance profile, the cluster administrator allocates 50 cores to the isolated pool and 50 cores to the reserved pool. The cluster administrator assigns 25 cores to QoS Guaranteed
pods and 25 cores for BestEffort
or Burstable
pods. This matches the capacity of the isolated pool.
A node features a capacity of 100 cores. Using a performance profile, the cluster administrator allocates 50 cores to the isolated pool and 50 cores to the reserved pool. The cluster administrator assigns 50 cores to QoS Guaranteed
pods and one core for BestEffort
or Burstable
pods. This exceeds the capacity of the isolated pool by one core. Pod scheduling fails because of insufficient CPU capacity.
The exact partitioning pattern to use depends on many factors like hardware, workload characteristics and the expected system load. Some sample use cases are as follows:
If the latency-sensitive workload uses specific hardware, such as a network interface controller (NIC), ensure that the CPUs in the isolated pool are as close as possible to this hardware. At a minimum, you should place the workload in the same Non-Uniform Memory Access (NUMA) node.
The reserved pool is used for handling all interrupts. When depending on system networking, allocate a sufficiently-sized reserve pool to handle all the incoming packet interrupts. In 4.14 and later versions, workloads can optionally be labeled as sensitive.
The decision regarding which specific CPUs should be used for reserved and isolated partitions requires detailed analysis and measurements. Factors like NUMA affinity of devices and memory play a role. The selection also depends on the workload architecture and the specific use case.
The reserved and isolated CPU pools must not overlap and together must span all available cores in the worker node. |
To ensure that housekeeping tasks and workloads do not interfere with each other, specify two groups of CPUs in the spec
section of the performance profile.
isolated
- Specifies the CPUs for the application container workloads. These CPUs have the lowest latency. Processes in this group have no interruptions and can, for example, reach much higher DPDK zero packet loss bandwidth.
reserved
- Specifies the CPUs for the cluster and operating system housekeeping duties. Threads in the reserved
group are often busy. Do not run latency-sensitive applications in the reserved
group. Latency-sensitive applications run in the isolated
group.
Create a performance profile appropriate for the environment’s hardware and topology.
Add the reserved
and isolated
parameters with the CPUs you want reserved and isolated for the infra and application containers:
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: infra-cpus
spec:
cpu:
reserved: "0-4,9" (1)
isolated: "5-8" (2)
nodeSelector: (3)
node-role.kubernetes.io/worker: ""
1 | Specify which CPUs are for infra containers to perform cluster and operating system housekeeping duties. |
2 | Specify which CPUs are for application containers to run workloads. |
3 | Optional: Specify a node selector to apply the performance profile to specific nodes. |
To configure Hyper-Threading for an OpenShift Container Platform cluster, set the CPU threads in the performance profile to the same cores that are configured for the reserved or isolated CPU pools.
If you configure a performance profile, and subsequently change the Hyper-Threading configuration for the host, ensure that you update the CPU |
Disabling a previously enabled host Hyper-Threading configuration can cause the CPU core IDs listed in the |
Access to the cluster as a user with the cluster-admin
role.
Install the OpenShift CLI (oc).
Ascertain which threads are running on what CPUs for the host you want to configure.
You can view which threads are running on the host CPUs by logging in to the cluster and running the following command:
$ lscpu --all --extended
CPU NODE SOCKET CORE L1d:L1i:L2:L3 ONLINE MAXMHZ MINMHZ
0 0 0 0 0:0:0:0 yes 4800.0000 400.0000
1 0 0 1 1:1:1:0 yes 4800.0000 400.0000
2 0 0 2 2:2:2:0 yes 4800.0000 400.0000
3 0 0 3 3:3:3:0 yes 4800.0000 400.0000
4 0 0 0 0:0:0:0 yes 4800.0000 400.0000
5 0 0 1 1:1:1:0 yes 4800.0000 400.0000
6 0 0 2 2:2:2:0 yes 4800.0000 400.0000
7 0 0 3 3:3:3:0 yes 4800.0000 400.0000
In this example, there are eight logical CPU cores running on four physical CPU cores. CPU0 and CPU4 are running on physical Core0, CPU1 and CPU5 are running on physical Core 1, and so on.
Alternatively, to view the threads that are set for a particular physical CPU core (cpu0
in the example below), open a shell prompt and run the following:
$ cat /sys/devices/system/cpu/cpu0/topology/thread_siblings_list
0-4
Apply the isolated and reserved CPUs in the PerformanceProfile
YAML. For example, you can set logical cores CPU0 and CPU4 as isolated
, and logical cores CPU1 to CPU3 and CPU5 to CPU7 as reserved
. When you configure reserved and isolated CPUs, the infra containers in pods use the reserved CPUs and the application containers use the isolated CPUs.
...
cpu:
isolated: 0,4
reserved: 1-3,5-7
...
The reserved and isolated CPU pools must not overlap and together must span all available cores in the worker node. |
Hyper-Threading is enabled by default on most Intel processors. If you enable Hyper-Threading, all threads processed by a particular core must be isolated or processed on the same core. When Hyper-Threading is enabled, all guaranteed pods must use multiples of the simultaneous multi-threading (SMT) level to avoid a "noisy neighbor" situation that can cause the pod to fail. See Static policy options for more information. |
When configuring clusters for low latency processing, consider whether you want to disable Hyper-Threading before you deploy the cluster. To disable Hyper-Threading, perform the following steps:
Create a performance profile that is appropriate for your hardware and topology.
Set nosmt
as an additional kernel argument. The following example performance profile illustrates this setting:
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: example-performanceprofile
spec:
additionalKernelArgs:
- nmi_watchdog=0
- audit=0
- mce=off
- processor.max_cstate=1
- idle=poll
- intel_idle.max_cstate=0
- nosmt
cpu:
isolated: 2-3
reserved: 0-1
hugepages:
defaultHugepagesSize: 1G
pages:
- count: 2
node: 0
size: 1G
nodeSelector:
node-role.kubernetes.io/performance: ''
realTimeKernel:
enabled: true
When you configure reserved and isolated CPUs, the infra containers in pods use the reserved CPUs and the application containers use the isolated CPUs. |
The Node Tuning Operator can manage host CPUs by dividing them into reserved CPUs for cluster and operating system housekeeping duties, including pod infra containers, and isolated CPUs for application containers to run the workloads. This allows you to set CPUs for low latency workloads as isolated.
Device interrupts are load balanced between all isolated and reserved CPUs to avoid CPUs being overloaded, with the exception of CPUs where there is a guaranteed pod running. Guaranteed pod CPUs are prevented from processing device interrupts when the relevant annotations are set for the pod.
In the performance profile, globallyDisableIrqLoadBalancing
is used to manage whether device interrupts are processed or not. For certain workloads, the reserved CPUs are not always sufficient for dealing with device interrupts, and for this reason, device interrupts are not globally disabled on the isolated CPUs. By default, Node Tuning Operator does not disable device interrupts on isolated CPUs.
Some IRQ controllers lack support for IRQ affinity setting and will always expose all online CPUs as the IRQ mask. These IRQ controllers effectively run on CPU 0.
The following are examples of drivers and hardware that Red Hat are aware lack support for IRQ affinity setting. The list is, by no means, exhaustive:
Some RAID controller drivers, such as megaraid_sas
Many non-volatile memory express (NVMe) drivers
Some LAN on motherboard (LOM) network controllers
The driver uses managed_irqs
The reason they do not support IRQ affinity setting might be associated with factors such as the type of processor, the IRQ controller, or the circuitry connections in the motherboard. |
If the effective affinity of any IRQ is set to an isolated CPU, it might be a sign of some hardware or driver not supporting IRQ affinity setting. To find the effective affinity, log in to the host and run the following command:
$ find /proc/irq -name effective_affinity -printf "%p: " -exec cat {} \;
/proc/irq/0/effective_affinity: 1
/proc/irq/1/effective_affinity: 8
/proc/irq/2/effective_affinity: 0
/proc/irq/3/effective_affinity: 1
/proc/irq/4/effective_affinity: 2
/proc/irq/5/effective_affinity: 1
/proc/irq/6/effective_affinity: 1
/proc/irq/7/effective_affinity: 1
/proc/irq/8/effective_affinity: 1
/proc/irq/9/effective_affinity: 2
/proc/irq/10/effective_affinity: 1
/proc/irq/11/effective_affinity: 1
/proc/irq/12/effective_affinity: 4
/proc/irq/13/effective_affinity: 1
/proc/irq/14/effective_affinity: 1
/proc/irq/15/effective_affinity: 1
/proc/irq/24/effective_affinity: 2
/proc/irq/25/effective_affinity: 4
/proc/irq/26/effective_affinity: 2
/proc/irq/27/effective_affinity: 1
/proc/irq/28/effective_affinity: 8
/proc/irq/29/effective_affinity: 4
/proc/irq/30/effective_affinity: 4
/proc/irq/31/effective_affinity: 8
/proc/irq/32/effective_affinity: 8
/proc/irq/33/effective_affinity: 1
/proc/irq/34/effective_affinity: 2
Some drivers use managed_irqs
, whose affinity is managed internally by the kernel and userspace cannot change the affinity. In some cases, these IRQs might be assigned to isolated CPUs. For more information about managed_irqs
, see Affinity of managed interrupts cannot be changed even if they target isolated CPU.
Configure a cluster node for IRQ dynamic load balancing to control which cores can receive device interrupt requests (IRQ).
For core isolation, all server hardware components must support IRQ affinity. To check if the hardware components of your server support IRQ affinity, view the server’s hardware specifications or contact your hardware provider.
Log in to the OpenShift Container Platform cluster as a user with cluster-admin privileges.
Set the performance profile apiVersion
to use performance.openshift.io/v2
.
Remove the globallyDisableIrqLoadBalancing
field or set it to false
.
Set the appropriate isolated and reserved CPUs. The following snippet illustrates a profile that reserves 2 CPUs. IRQ load-balancing is enabled for pods running on the isolated
CPU set:
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: dynamic-irq-profile
spec:
cpu:
isolated: 2-5
reserved: 0-1
...
When you configure reserved and isolated CPUs, the infra containers in pods use the reserved CPUs and the application containers use the isolated CPUs. |
Create the pod that uses exclusive CPUs, and set irq-load-balancing.crio.io
and cpu-quota.crio.io
annotations to disable
. For example:
apiVersion: v1
kind: Pod
metadata:
name: dynamic-irq-pod
annotations:
irq-load-balancing.crio.io: "disable"
cpu-quota.crio.io: "disable"
spec:
containers:
- name: dynamic-irq-pod
image: "registry.redhat.io/openshift4/cnf-tests-rhel8:v4.14"
command: ["sleep", "10h"]
resources:
requests:
cpu: 2
memory: "200M"
limits:
cpu: 2
memory: "200M"
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
runtimeClassName: performance-dynamic-irq-profile
...
Enter the pod runtimeClassName
in the form performance-<profile_name>, where <profile_name> is the name
from the PerformanceProfile
YAML, in this example, performance-dynamic-irq-profile
.
Set the node selector to target a cnf-worker.
Ensure the pod is running correctly. Status should be running
, and the correct cnf-worker node should be set:
$ oc get pod -o wide
NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES
dynamic-irq-pod 1/1 Running 0 5h33m <ip-address> <node-name> <none> <none>
Get the CPUs that the pod configured for IRQ dynamic load balancing runs on:
$ oc exec -it dynamic-irq-pod -- /bin/bash -c "grep Cpus_allowed_list /proc/self/status | awk '{print $2}'"
Cpus_allowed_list: 2-3
Ensure the node configuration is applied correctly. Log in to the node to verify the configuration.
$ oc debug node/<node-name>
Starting pod/<node-name>-debug ...
To use host binaries, run `chroot /host`
Pod IP: <ip-address>
If you don't see a command prompt, try pressing enter.
sh-4.4#
Verify that you can use the node file system:
sh-4.4# chroot /host
sh-4.4#
Ensure the default system CPU affinity mask does not include the dynamic-irq-pod
CPUs, for example, CPUs 2 and 3.
$ cat /proc/irq/default_smp_affinity
33
Ensure the system IRQs are not configured to run on the dynamic-irq-pod
CPUs:
find /proc/irq/ -name smp_affinity_list -exec sh -c 'i="$1"; mask=$(cat $i); file=$(echo $i); echo $file: $mask' _ {} \;
/proc/irq/0/smp_affinity_list: 0-5
/proc/irq/1/smp_affinity_list: 5
/proc/irq/2/smp_affinity_list: 0-5
/proc/irq/3/smp_affinity_list: 0-5
/proc/irq/4/smp_affinity_list: 0
/proc/irq/5/smp_affinity_list: 0-5
/proc/irq/6/smp_affinity_list: 0-5
/proc/irq/7/smp_affinity_list: 0-5
/proc/irq/8/smp_affinity_list: 4
/proc/irq/9/smp_affinity_list: 4
/proc/irq/10/smp_affinity_list: 0-5
/proc/irq/11/smp_affinity_list: 0
/proc/irq/12/smp_affinity_list: 1
/proc/irq/13/smp_affinity_list: 0-5
/proc/irq/14/smp_affinity_list: 1
/proc/irq/15/smp_affinity_list: 0
/proc/irq/24/smp_affinity_list: 1
/proc/irq/25/smp_affinity_list: 1
/proc/irq/26/smp_affinity_list: 1
/proc/irq/27/smp_affinity_list: 5
/proc/irq/28/smp_affinity_list: 1
/proc/irq/29/smp_affinity_list: 0
/proc/irq/30/smp_affinity_list: 0-5
Nodes must pre-allocate huge pages used in an OpenShift Container Platform cluster. Use the Node Tuning Operator to allocate huge pages on a specific node.
OpenShift Container Platform provides a method for creating and allocating huge pages. Node Tuning Operator provides an easier method for doing this using the performance profile.
For example, in the hugepages
pages
section of the performance profile, you can specify multiple blocks of size
, count
, and, optionally, node
:
hugepages:
defaultHugepagesSize: "1G"
pages:
- size: "1G"
count: 4
node: 0 (1)
1 | node is the NUMA node in which the huge pages are allocated. If you omit node , the pages are evenly spread across all NUMA nodes. |
Wait for the relevant machine config pool status that indicates the update is finished. |
These are the only configuration steps you need to do to allocate huge pages.
To verify the configuration, see the /proc/meminfo
file on the node:
$ oc debug node/ip-10-0-141-105.ec2.internal
# grep -i huge /proc/meminfo
AnonHugePages: ###### ##
ShmemHugePages: 0 kB
HugePages_Total: 2
HugePages_Free: 2
HugePages_Rsvd: 0
HugePages_Surp: 0
Hugepagesize: #### ##
Hugetlb: #### ##
Use oc describe
to report the new size:
$ oc describe node worker-0.ocp4poc.example.com | grep -i huge
hugepages-1g=true
hugepages-###: ###
hugepages-###: ###
You can request huge pages with different sizes under the same container. This allows you to define more complicated pods consisting of containers with different huge page size needs.
For example, you can define sizes 1G
and 2M
and the Node Tuning Operator will configure both sizes on the node, as shown here:
spec:
hugepages:
defaultHugepagesSize: 1G
pages:
- count: 1024
node: 0
size: 2M
- count: 4
node: 1
size: 1G
The Node Tuning Operator facilitates reducing NIC queues for enhanced performance. Adjustments are made using the performance profile, allowing customization of queues for different network devices.
The performance profile lets you adjust the queue count for each network device.
Supported network devices:
Non-virtual network devices
Network devices that support multiple queues (channels)
Unsupported network devices:
Pure software network interfaces
Block devices
Intel DPDK virtual functions
Access to the cluster as a user with the cluster-admin
role.
Install the OpenShift CLI (oc
).
Log in to the OpenShift Container Platform cluster running the Node Tuning Operator as a user with cluster-admin
privileges.
Create and apply a performance profile appropriate for your hardware and topology. For guidance on creating a profile, see the "Creating a performance profile" section.
Edit this created performance profile:
$ oc edit -f <your_profile_name>.yaml
Populate the spec
field with the net
object. The object list can contain two fields:
userLevelNetworking
is a required field specified as a boolean flag. If userLevelNetworking
is true
, the queue count is set to the reserved CPU count for all supported devices. The default is false
.
devices
is an optional field specifying a list of devices that will have the queues set to the reserved CPU count. If the device list is empty, the configuration applies to all network devices. The configuration is as follows:
interfaceName
: This field specifies the interface name, and it supports shell-style wildcards, which can be positive or negative.
Example wildcard syntax is as follows: <string> .*
Negative rules are prefixed with an exclamation mark. To apply the net queue changes to all devices other than the excluded list, use !<device>
, for example, !eno1
.
vendorID
: The network device vendor ID represented as a 16-bit hexadecimal number with a 0x
prefix.
deviceID
: The network device ID (model) represented as a 16-bit hexadecimal number with a 0x
prefix.
When a When two or more devices are specified, the net queues count is set to any net device that matches one of them. |
Set the queue count to the reserved CPU count for all devices by using this example performance profile:
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: manual
spec:
cpu:
isolated: 3-51,55-103
reserved: 0-2,52-54
net:
userLevelNetworking: true
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
Set the queue count to the reserved CPU count for all devices matching any of the defined device identifiers by using this example performance profile:
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: manual
spec:
cpu:
isolated: 3-51,55-103
reserved: 0-2,52-54
net:
userLevelNetworking: true
devices:
- interfaceName: "eth0"
- interfaceName: "eth1"
- vendorID: "0x1af4"
deviceID: "0x1000"
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
Set the queue count to the reserved CPU count for all devices starting with the interface name eth
by using this example performance profile:
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: manual
spec:
cpu:
isolated: 3-51,55-103
reserved: 0-2,52-54
net:
userLevelNetworking: true
devices:
- interfaceName: "eth*"
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
Set the queue count to the reserved CPU count for all devices with an interface named anything other than eno1
by using this example performance profile:
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: manual
spec:
cpu:
isolated: 3-51,55-103
reserved: 0-2,52-54
net:
userLevelNetworking: true
devices:
- interfaceName: "!eno1"
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
Set the queue count to the reserved CPU count for all devices that have an interface name eth0
, vendorID
of 0x1af4
, and deviceID
of 0x1000
by using this example performance profile:
apiVersion: performance.openshift.io/v2
kind: PerformanceProfile
metadata:
name: manual
spec:
cpu:
isolated: 3-51,55-103
reserved: 0-2,52-54
net:
userLevelNetworking: true
devices:
- interfaceName: "eth0"
- vendorID: "0x1af4"
deviceID: "0x1000"
nodeSelector:
node-role.kubernetes.io/worker-cnf: ""
Apply the updated performance profile:
$ oc apply -f <your_profile_name>.yaml
In this section, a number of examples illustrate different performance profiles and how to verify the changes are applied.
In this example, the net queue count is set to the reserved CPU count (2) for all supported devices.
The relevant section from the performance profile is:
apiVersion: performance.openshift.io/v2
metadata:
name: performance
spec:
kind: PerformanceProfile
spec:
cpu:
reserved: 0-1 #total = 2
isolated: 2-8
net:
userLevelNetworking: true
# ...
Display the status of the queues associated with a device using the following command:
Run this command on the node where the performance profile was applied. |
$ ethtool -l <device>
Verify the queue status before the profile is applied:
$ ethtool -l ens4
Channel parameters for ens4:
Pre-set maximums:
RX: 0
TX: 0
Other: 0
Combined: 4
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 4
Verify the queue status after the profile is applied:
$ ethtool -l ens4
Channel parameters for ens4:
Pre-set maximums:
RX: 0
TX: 0
Other: 0
Combined: 4
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 2 (1)
1 | The combined channel shows that the total count of reserved CPUs for all supported devices is 2. This matches what is configured in the performance profile. |
In this example, the net queue count is set to the reserved CPU count (2) for all supported network devices with a specific vendorID
.
The relevant section from the performance profile is:
apiVersion: performance.openshift.io/v2
metadata:
name: performance
spec:
kind: PerformanceProfile
spec:
cpu:
reserved: 0-1 #total = 2
isolated: 2-8
net:
userLevelNetworking: true
devices:
- vendorID = 0x1af4
# ...
Display the status of the queues associated with a device using the following command:
Run this command on the node where the performance profile was applied. |
$ ethtool -l <device>
Verify the queue status after the profile is applied:
$ ethtool -l ens4
Channel parameters for ens4:
Pre-set maximums:
RX: 0
TX: 0
Other: 0
Combined: 4
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 2 (1)
1 | The total count of reserved CPUs for all supported devices with vendorID=0x1af4 is 2.
For example, if there is another network device ens2 with vendorID=0x1af4 it will also have total net queues of 2. This matches what is configured in the performance profile. |
In this example, the net queue count is set to the reserved CPU count (2) for all supported network devices that match any of the defined device identifiers.
The command udevadm info
provides a detailed report on a device. In this example the devices are:
# udevadm info -p /sys/class/net/ens4
...
E: ID_MODEL_ID=0x1000
E: ID_VENDOR_ID=0x1af4
E: INTERFACE=ens4
...
# udevadm info -p /sys/class/net/eth0
...
E: ID_MODEL_ID=0x1002
E: ID_VENDOR_ID=0x1001
E: INTERFACE=eth0
...
Set the net queues to 2 for a device with interfaceName
equal to eth0
and any devices that have a vendorID=0x1af4
with the following performance profile:
apiVersion: performance.openshift.io/v2
metadata:
name: performance
spec:
kind: PerformanceProfile
spec:
cpu:
reserved: 0-1 #total = 2
isolated: 2-8
net:
userLevelNetworking: true
devices:
- interfaceName = eth0
- vendorID = 0x1af4
...
Verify the queue status after the profile is applied:
$ ethtool -l ens4
Channel parameters for ens4:
Pre-set maximums:
RX: 0
TX: 0
Other: 0
Combined: 4
Current hardware settings:
RX: 0
TX: 0
Other: 0
Combined: 2 (1)
1 | The total count of reserved CPUs for all supported devices with vendorID=0x1af4 is set to 2.
For example, if there is another network device ens2 with vendorID=0x1af4 , it will also have the total net queues set to 2. Similarly, a device with interfaceName equal to eth0 will have total net queues set to 2. |
Log messages detailing the assigned devices are recorded in the respective Tuned daemon logs. The following messages might be recorded to the /var/log/tuned/tuned.log
file:
An INFO
message is recorded detailing the successfully assigned devices:
INFO tuned.plugins.base: instance net_test (net): assigning devices ens1, ens2, ens3
A WARNING
message is recorded if none of the devices can be assigned:
WARNING tuned.plugins.base: instance net_test: no matching devices available