Sysctl settings are exposed via Kubernetes, allowing users to modify certain kernel parameters at runtime for namespaces within a container. Only sysctls that are namespaced can be set independently on pods. If a sysctl is not namespaced, called node-level, it cannot be set within OpenShift Dedicated. Moreover, only those sysctls considered safe are whitelisted by default; you can manually enable other unsafe sysctls on the node to be available to the user.

About sysctls

In Linux, the sysctl interface allows an administrator to modify kernel parameters at runtime. Parameters are available via the /proc/sys/ virtual process file system. The parameters cover various subsystems, such as:

  • kernel (common prefix: kernel.)

  • networking (common prefix: net.)

  • virtual memory (common prefix: vm.)

  • MDADM (common prefix: dev.)

More subsystems are described in Kernel documentation. To get a list of all parameters, run:

$ sudo sysctl -a

Namespaced versus node-level sysctls

A number of sysctls are namespaced in the Linux kernels. This means that you can set them independently for each pod on a node. Being namespaced is a requirement for sysctls to be accessible in a pod context within Kubernetes.

The following sysctls are known to be namespaced:

  • kernel.shm*

  • kernel.msg*

  • kernel.sem

  • fs.mqueue.*

Additionally, most of the sysctls in the net.* group are known to be namespaced. Their namespace adoption differs based on the kernel version and distributor.

Sysctls that are not namespaced are called node-level and must be set manually by the cluster administrator, either by means of the underlying Linux distribution of the nodes, such as by modifying the /etc/sysctls.conf file, or by using a DaemonSet with privileged containers.

Consider marking nodes with special sysctls as tainted. Only schedule pods onto them that need those sysctl settings. Use the taints and toleration feature to mark the nodes.

Safe versus unsafe sysctls

Sysctls are grouped into safe and unsafe sysctls.

For a sysctl to be considered safe, it must use proper namespacing and must be properly isolated between pods on the same node. This means that if you set a sysctl for one pod it must not:

  • Influence any other pod on the node

  • Harm the node’s health

  • Gain CPU or memory resources outside of the resource limits of a pod

OpenShift Dedicated supports, or whitelists, the following sysctls in the safe set:

  • kernel.shm_rmid_forced

  • net.ipv4.ip_local_port_range

  • net.ipv4.tcp_syncookies

All safe sysctls are enabled by default. You can use a sysctl in a pod by modifying the pod specification.

Any sysctl not whitelisted by OpenShift Dedicated is considered unsafe for OpenShift Dedicated. Note that being namespaced alone is not sufficient for the sysctl to be considered safe.

All unsafe sysctls are disabled by default, and the cluster administrator must manually enable them on a per-node basis. Pods with disabled unsafe sysctls are scheduled but do not launch.

$ oc get pod

NAME        READY   STATUS            RESTARTS   AGE
hello-pod   0/1     SysctlForbidden   0          14s

Setting sysctls for a pod

You can set sysctls on pods using the pod’s securityContext. The securityContext applies to all containers in the same pod.

Safe sysctls are allowed by default. A pod with unsafe sysctls fails to launch on any node unless the cluster administrator explicitly enables unsafe sysctls for that node. As with node-level sysctls, use the taints and toleration feature or labels on nodes to schedule those pods onto the right nodes.

The following example uses the pod securityContext to set a safe sysctl kernel.shm_rmid_forced and two unsafe sysctls, net.ipv4.route.min_pmtu and kernel.msgmax. There is no distinction between safe and unsafe sysctls in the specification.

To avoid destabilizing your operating system, modify sysctl parameters only after you understand their effects.

Procedure

To use safe and unsafe sysctls:

  1. Modify the YAML file that defines the pod and add the securityContext spec, as shown in the following example:

    apiVersion: v1
    kind: Pod
    metadata:
      name: sysctl-example
    spec:
      securityContext:
        sysctls:
        - name: kernel.shm_rmid_forced
          value: "0"
        - name: net.ipv4.route.min_pmtu
          value: "552"
        - name: kernel.msgmax
          value: "65536"
      ...
  2. Create the pod:

    $ oc apply -f <file-name>.yaml

    If the unsafe sysctls are not allowed for the node, the pod is scheduled, but does not deploy:

    $ oc get pod
    
    NAME        READY   STATUS            RESTARTS   AGE
    hello-pod   0/1     SysctlForbidden   0          14s

Enabling unsafe sysctls

A cluster administrator can allow certain unsafe sysctls for very special situations such as high-performance or real-time application tuning.

If you want to use unsafe sysctls, a cluster administrator must enable them individually for a specific type of node. The sysctls must be namespaced.

Due to their nature of being unsafe, the use of unsafe sysctls is at-your-own-risk and can lead to severe problems, such as improper behavior of containers, resource shortage, or breaking a node.

Procedure
  1. Add a label to the MachineConfigPool where the containers where containers with the unsafe sysctls will run:

    $ oc edit machineconfigpool worker
    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfigPool
    metadata:
      creationTimestamp: 2019-02-08T14:52:39Z
      generation: 1
      labels:
        custom-kubelet: sysctl (1)
    1 Add a key: pair label.
  2. Create a KubeletConfig Custom Resource (CR):

    apiVersion: machineconfiguration.openshift.io/v1
    kind: KubeletConfig
    metadata:
      name: custom-kubelet
    spec:
      machineConfigPoolSelector:
        matchLabels:
          custom-kubelet: sysctl (1)
      kubeletConfig:
        allowedUnsafeSysctls: (2)
          - "kernel.msg*"
          - "net.ipv4.route.min_pmtu"
    1 Specify the label from the MachineConfigPool.
    2 List the unsafe sysctls you want to allow.
  3. Create the object:

    $ oc apply -f set-sysctl-worker.yaml

    A new MachineConfig named in the 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet format is created.

  4. Wait for the cluster to reboot usng the machineconfigpool object status fields:

    For example:

    status:
      conditions:
        - lastTransitionTime: '2019-08-11T15:32:00Z'
          message: >-
            All nodes are updating to
            rendered-worker-ccbfb5d2838d65013ab36300b7b3dc13
          reason: ''
          status: 'True'
          type: Updating

    A message similar to the following appears when the cluster is ready:

       - lastTransitionTime: '2019-08-11T16:00:00Z'
          message: >-
            All nodes are updated with
            rendered-worker-ccbfb5d2838d65013ab36300b7b3dc13
          reason: ''
          status: 'True'
          type: Updated
  5. When the cluster is ready, check for the merged KubeletConfig in the new MachineConfig:

    $ oc get machineconfig 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet -o json | grep ownerReference -A7
            "ownerReferences": [
                {
                    "apiVersion": "machineconfiguration.openshift.io/v1",
                    "blockOwnerDeletion": true,
                    "controller": true,
                    "kind": "KubeletConfig",
                    "name": "custom-kubelet",
                    "uid": "3f64a766-bae8-11e9-abe8-0a1a2a4813f2"

    You can now add unsafe sysctls to pods as needed.