CPU Manager manages groups of CPUs and constrains workloads to specific CPUs.

CPU Manager is useful for workloads that have some of these attributes:

  • Require as much CPU time as possible.

  • Are sensitive to processor cache misses.

  • Are low-latency network applications.

  • Coordinate with other processes and benefit from sharing a single processor cache.

Setting up CPU Manager

  1. Optional: Label a node:

    # oc label node perf-node.example.com cpumanager=true
  2. Edit the MachineConfigPool of the nodes where CPU Manager should be enabled. In this example, all workers have CPU Manager enabled:

    # oc edit machineconfigpool worker
  3. Add a label to the worker machine config pool:

      creationTimestamp: 2020-xx-xxx
      generation: 3
        custom-kubelet: cpumanager-enabled
  4. Create a KubeletConfig, cpumanager-kubeletconfig.yaml, custom resource (CR). Refer to the label created in the previous step to have the correct nodes updated with the new kubelet config. See the machineConfigPoolSelector section:

    apiVersion: machineconfiguration.openshift.io/v1
    kind: KubeletConfig
      name: cpumanager-enabled
          custom-kubelet: cpumanager-enabled
         cpuManagerPolicy: static (1)
         cpuManagerReconcilePeriod: 5s (2)
    1 Specify a policy:
    • none. This policy explicitly enables the existing default CPU affinity scheme, providing no affinity beyond what the scheduler does automatically.

    • static. This policy allows pods with certain resource characteristics to be granted increased CPU affinity and exclusivity on the node.

    2 Optional. Specify the CPU Manager reconcile frequency. The default is 5s.
  5. Create the dynamic kubelet config:

    # oc create -f cpumanager-kubeletconfig.yaml

    This adds the CPU Manager feature to the kubelet config and, if needed, the Machine Config Operator (MCO) reboots the node. To enable CPU Manager, a reboot is not needed.

  6. Check for the merged kubelet config:

    # oc get machineconfig 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet -o json | grep ownerReference -A7
    Example output
           "ownerReferences": [
                    "apiVersion": "machineconfiguration.openshift.io/v1",
                    "kind": "KubeletConfig",
                    "name": "cpumanager-enabled",
                    "uid": "7ed5616d-6b72-11e9-aae1-021e1ce18878"
  7. Check the worker for the updated kubelet.conf:

    # oc debug node/perf-node.example.com
    sh-4.2# cat /host/etc/kubernetes/kubelet.conf | grep cpuManager
    Example output
    cpuManagerPolicy: static        (1)
    cpuManagerReconcilePeriod: 5s   (1)
    1 These settings were defined when you created the KubeletConfig CR.
  8. Create a pod that requests a core or multiple cores. Both limits and requests must have their CPU value set to a whole integer. That is the number of cores that will be dedicated to this pod:

    # cat cpumanager-pod.yaml
    Example output
    apiVersion: v1
    kind: Pod
      generateName: cpumanager-
      - name: cpumanager
        image: gcr.io/google_containers/pause-amd64:3.0
            cpu: 1
            memory: "1G"
            cpu: 1
            memory: "1G"
        cpumanager: "true"
  9. Create the pod:

    # oc create -f cpumanager-pod.yaml
  10. Verify that the pod is scheduled to the node that you labeled:

    # oc describe pod cpumanager
    Example output
    Name:               cpumanager-6cqz7
    Namespace:          default
    Priority:           0
    PriorityClassName:  <none>
    Node:  perf-node.example.com/xxx.xx.xx.xxx
          cpu:     1
          memory:  1G
          cpu:        1
          memory:     1G
    QoS Class:       Guaranteed
    Node-Selectors:  cpumanager=true
  11. Verify that the cgroups are set up correctly. Get the process ID (PID) of the pause process:

    # ├─init.scope
    │ └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 17
      │ ├─crio-b5437308f1a574c542bdf08563b865c0345c8f8c0b0a655612c.scope
      │ └─32706 /pause

    Pods of quality of service (QoS) tier Guaranteed are placed within the kubepods.slice. Pods of other QoS tiers end up in child cgroups of kubepods:

    # cd /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice/crio-b5437308f1ad1a7db0574c542bdf08563b865c0345c86e9585f8c0b0a655612c.scope
    # for i in `ls cpuset.cpus tasks` ; do echo -n "$i "; cat $i ; done
    Example output
    cpuset.cpus 1
    tasks 32706
  12. Check the allowed CPU list for the task:

    # grep ^Cpus_allowed_list /proc/32706/status
    Example output
     Cpus_allowed_list:    1
  13. Verify that another pod (in this case, the pod in the burstable QoS tier) on the system cannot run on the core allocated for the Guaranteed pod:

    # cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podc494a073_6b77_11e9_98c0_06bba5c387ea.slice/crio-c56982f57b75a2420947f0afc6cafe7534c5734efc34157525fa9abbf99e3849.scope/cpuset.cpus
    # oc describe node perf-node.example.com
    Example output
     attachable-volumes-aws-ebs:  39
     cpu:                         2
     ephemeral-storage:           124768236Ki
     hugepages-1Gi:               0
     hugepages-2Mi:               0
     memory:                      8162900Ki
     pods:                        250
     attachable-volumes-aws-ebs:  39
     cpu:                         1500m
     ephemeral-storage:           124768236Ki
     hugepages-1Gi:               0
     hugepages-2Mi:               0
     memory:                      7548500Ki
     pods:                        250
    -------                               ----                           ------------  ----------  ---------------  -------------  ---
      default                                 cpumanager-6cqz7               1 (66%)       1 (66%)     1G (12%)         1G (12%)       29m
    Allocated resources:
      (Total limits may be over 100 percent, i.e., overcommitted.)
      Resource                    Requests          Limits
      --------                    --------          ------
      cpu                         1440m (96%)       1 (66%)

    This VM has two CPU cores. The system-reserved setting reserves 500 millicores, meaning that half of one core is subtracted from the total capacity of the node to arrive at the Node Allocatable amount. You can see that Allocatable CPU is 1500 millicores. This means you can run one of the CPU Manager pods since each will take one whole core. A whole core is equivalent to 1000 millicores. If you try to schedule a second pod, the system will accept the pod, but it will never be scheduled:

    NAME                    READY   STATUS    RESTARTS   AGE
    cpumanager-6cqz7        1/1     Running   0          33m
    cpumanager-7qc2t        0/1     Pending   0          11s