×

Sysctl settings are exposed through Kubernetes, allowing users to modify certain kernel parameters at runtime. Only sysctls that are namespaced can be set independently on pods. If a sysctl is not namespaced, called node-level, you must use another method of setting the sysctl, such as by using the Node Tuning Operator.

Network sysctls are a special category of sysctl. Network sysctls include:

  • System-wide sysctls, for example net.ipv4.ip_local_port_range, that are valid for all networking. You can set these independently for each pod on a node.

  • Interface-specific sysctls, for example net.ipv4.conf.IFNAME.accept_local, that only apply to a specific additional network interface for a given pod. You can set these independently for each additional network configuration. You set these by using a configuration in the tuning-cni after the network interfaces are created.

Moreover, only those sysctls considered safe are whitelisted by default; you can manually enable other unsafe sysctls on the node to be available to the user.

Additional resources

About sysctls

In Linux, the sysctl interface allows an administrator to modify kernel parameters at runtime. Parameters are available from the /proc/sys/ virtual process file system. The parameters cover various subsystems, such as:

  • kernel (common prefix: kernel.)

  • networking (common prefix: net.)

  • virtual memory (common prefix: vm.)

  • MDADM (common prefix: dev.)

More subsystems are described in Kernel documentation. To get a list of all parameters, run:

$ sudo sysctl -a

Namespaced and node-level sysctls

A number of sysctls are namespaced in the Linux kernels. This means that you can set them independently for each pod on a node. Being namespaced is a requirement for sysctls to be accessible in a pod context within Kubernetes.

The following sysctls are known to be namespaced:

  • kernel.shm*

  • kernel.msg*

  • kernel.sem

  • fs.mqueue.*

Additionally, most of the sysctls in the net.* group are known to be namespaced. Their namespace adoption differs based on the kernel version and distributor.

Sysctls that are not namespaced are called node-level and must be set manually by the cluster administrator, either by means of the underlying Linux distribution of the nodes, such as by modifying the /etc/sysctls.conf file, or by using a daemon set with privileged containers. You can use the Node Tuning Operator to set node-level sysctls.

Consider marking nodes with special sysctls as tainted. Only schedule pods onto them that need those sysctl settings. Use the taints and toleration feature to mark the nodes.

Safe and unsafe sysctls

Sysctls are grouped into safe and unsafe sysctls.

For system-wide sysctls to be considered safe, they must be namespaced. A namespaced sysctl ensures there is isolation between namespaces and therefore pods. If you set a sysctl for one pod it must not add any of the following:

  • Influence any other pod on the node

  • Harm the node health

  • Gain CPU or memory resources outside of the resource limits of a pod

Being namespaced alone is not sufficient for the sysctl to be considered safe.

Any sysctl that is not added to the allowed list on OpenShift Container Platform is considered unsafe for OpenShift Container Platform.

Unsafe sysctls are not allowed by default. For system-wide sysctls the cluster administrator must manually enable them on a per-node basis. Pods with disabled unsafe sysctls are scheduled but do not launch.

You cannot manually enable interface-specific unsafe sysctls.

OpenShift Container Platform adds the following system-wide and interface-specific safe sysctls to an allowed safe list:

Table 1. System-wide safe sysctls
sysctl Description

kernel.shm_rmid_forced

When set to 1, all shared memory objects in current IPC namespace are automatically forced to use IPC_RMID. For more information, see shm_rmid_forced.

net.ipv4.ip_local_port_range

Defines the local port range that is used by TCP and UDP to choose the local port. The first number is the first port number, and the second number is the last local port number. If possible, it is better if these numbers have different parity (one even and one odd value). They must be greater than or equal to ip_unprivileged_port_start. The default values are 32768 and 60999 respectively. For more information, see ip_local_port_range.

net.ipv4.tcp_syncookies

When net.ipv4.tcp_syncookies is set, the kernel handles TCP SYN packets normally until the half-open connection queue is full, at which time, the SYN cookie functionality kicks in. This functionality allows the system to keep accepting valid connections, even if under a denial-of-service attack. For more information, see tcp_syncookies.

net.ipv4.ping_group_range

This restricts ICMP_PROTO datagram sockets to users in the group range. The default is 1 0, meaning that nobody, not even root, can create ping sockets. For more information, see ping_group_range.

net.ipv4.ip_unprivileged_port_start

This defines the first unprivileged port in the network namespace. To disable all privileged ports, set this to 0. Privileged ports must not overlap with the ip_local_port_range. For more information, see ip_unprivileged_port_start.

Table 2. Interface-specific safe sysctls
sysctl Description

net.ipv4.conf.IFNAME.accept_redirects

Accept IPv4 ICMP redirect messages.

net.ipv4.conf.IFNAME.accept_source_route

Accept IPv4 packets with strict source route (SRR) option.

net.ipv4.conf.IFNAME.arp_accept

Define behavior for gratuitous ARP frames with an IPv4 address that is not already present in the ARP table:

  • 0 - Do not create new entries in the ARP table.

  • 1 - Create new entries in the ARP table.

net.ipv4.conf.IFNAME.arp_notify

Define mode for notification of IPv4 address and device changes.

net.ipv4.conf.IFNAME.disable_policy

Disable IPSEC policy (SPD) for this IPv4 interface.

net.ipv4.conf.IFNAME.secure_redirects

Accept ICMP redirect messages only to gateways listed in the interface’s current gateway list.

net.ipv4.conf.IFNAME.send_redirects

Send redirects is enabled only if the node acts as a router. That is, a host should not send an ICMP redirect message. It is used by routers to notify the host about a better routing path that is available for a particular destination.

net.ipv6.conf.IFNAME.accept_ra

Accept IPv6 Router advertisements; autoconfigure using them. It also determines whether or not to transmit router solicitations. Router solicitations are transmitted only if the functional setting is to accept router advertisements.

net.ipv6.conf.IFNAME.accept_redirects

Accept IPv6 ICMP redirect messages.

net.ipv6.conf.IFNAME.accept_source_route

Accept IPv6 packets with SRR option.

net.ipv6.conf.IFNAME.arp_accept

Define behavior for gratuitous ARP frames with an IPv6 address that is not already present in the ARP table:

  • 0 - Do not create new entries in the ARP table.

  • 1 - Create new entries in the ARP table.

net.ipv6.conf.IFNAME.arp_notify

Define mode for notification of IPv6 address and device changes.

net.ipv6.neigh.IFNAME.base_reachable_time_ms

This parameter controls the hardware address to IP mapping lifetime in the neighbour table for IPv6.

net.ipv6.neigh.IFNAME.retrans_time_ms

Set the retransmit timer for neighbor discovery messages.

When setting these values using the tuning CNI plugin, use the value IFNAME literally. The interface name is represented by the IFNAME token, and is replaced with the actual name of the interface at runtime.

Additional resources

Starting a pod with safe sysctls

You can set sysctls on pods using the pod’s securityContext. The securityContext applies to all containers in the same pod.

Safe sysctls are allowed by default.

This example uses the pod securityContext to set the following safe sysctls:

  • kernel.shm_rmid_forced

  • net.ipv4.ip_local_port_range

  • net.ipv4.tcp_syncookies

  • net.ipv4.ping_group_range

To avoid destabilizing your operating system, modify sysctl parameters only after you understand their effects.

Use this procedure to start a pod with the configured sysctl settings.

In most cases you modify an existing pod definition and add the securityContext spec.

Procedure
  1. Create a YAML file sysctl_pod.yaml that defines an example pod and add the securityContext spec, as shown in the following example:

    apiVersion: v1
    kind: Pod
    metadata:
      name: sysctl-example
      namespace: default
    spec:
      containers:
      - name: podexample
        image: centos
        command: ["bin/bash", "-c", "sleep INF"]
        securityContext:
          runAsUser: 2000 (1)
          runAsGroup: 3000 (2)
          allowPrivilegeEscalation: false (3)
          capabilities: (4)
            drop: ["ALL"]
      securityContext:
        runAsNonRoot: true (5)
        seccompProfile: (6)
          type: RuntimeDefault
        sysctls:
        - name: kernel.shm_rmid_forced
          value: "1"
        - name: net.ipv4.ip_local_port_range
          value: "32770       60666"
        - name: net.ipv4.tcp_syncookies
          value: "0"
        - name: net.ipv4.ping_group_range
          value: "0           200000000"
    1 runAsUser controls which user ID the container is run with.
    2 runAsGroup controls which primary group ID the containers is run with.
    3 allowPrivilegeEscalation determines if a pod can request to allow privilege escalation. If unspecified, it defaults to true. This boolean directly controls whether the no_new_privs flag gets set on the container process.
    4 capabilities permit privileged actions without giving full root access. This policy ensures all capabilities are dropped from the pod.
    5 runAsNonRoot: true requires that the container will run with a user with any UID other than 0.
    6 RuntimeDefault enables the default seccomp profile for a pod or container workload.
  2. Create the pod by running the following command:

    $ oc apply -f sysctl_pod.yaml
  3. Verify that the pod is created by running the following command:

    $ oc get pod
    Example output
    NAME              READY   STATUS            RESTARTS   AGE
    sysctl-example    1/1     Running           0          14s
  4. Log in to the pod by running the following command:

    $ oc rsh sysctl-example
  5. Verify the values of the configured sysctl flags. For example, find the value kernel.shm_rmid_forced by running the following command:

    sh-4.4# sysctl kernel.shm_rmid_forced
    Expected output
    kernel.shm_rmid_forced = 1

Starting a pod