$ sudo sysctl -a
Sysctl settings are exposed via Kubernetes, allowing users to modify certain kernel parameters at runtime for namespaces within a container. Only sysctls that are namespaced can be set independently on pods. If a sysctl is not namespaced, called node-level, it cannot be set within OpenShift Container Platform. Moreover, only those sysctls considered safe are whitelisted by default; you can manually enable other unsafe sysctls on the node to be available to the user.
In Linux, the sysctl interface allows an administrator to modify kernel parameters at runtime. Parameters are available via the /proc/sys/ virtual process file system. The parameters cover various subsystems, such as:
kernel (common prefix: kernel.)
networking (common prefix: net.)
virtual memory (common prefix: vm.)
MDADM (common prefix: dev.)
More subsystems are described in Kernel documentation. To get a list of all parameters, run:
$ sudo sysctl -a
A number of sysctls are namespaced in the Linux kernels. This means that you can set them independently for each pod on a node. Being namespaced is a requirement for sysctls to be accessible in a pod context within Kubernetes.
The following sysctls are known to be namespaced:
kernel.shm*
kernel.msg*
kernel.sem
fs.mqueue.*
Additionally, most of the sysctls in the net.* group are known to be namespaced. Their namespace adoption differs based on the kernel version and distributor.
Sysctls that are not namespaced are called node-level and must be set manually by the cluster administrator, either by means of the underlying Linux distribution of the nodes, such as by modifying the /etc/sysctls.conf file, or by using a daemon set with privileged containers.
Consider marking nodes with special sysctls as tainted. Only schedule pods onto them that need those sysctl settings. Use the taints and toleration feature to mark the nodes. |
Sysctls are grouped into safe and unsafe sysctls.
For a sysctl to be considered safe, it must use proper namespacing and must be properly isolated between pods on the same node. This means that if you set a sysctl for one pod it must not:
Influence any other pod on the node
Harm the node’s health
Gain CPU or memory resources outside of the resource limits of a pod
OpenShift Container Platform supports, or whitelists, the following sysctls in the safe set:
kernel.shm_rmid_forced
net.ipv4.ip_local_port_range
net.ipv4.tcp_syncookies
All safe sysctls are enabled by default. You can use a sysctl in a pod by modifying
the Pod
spec.
Any sysctl not whitelisted by OpenShift Container Platform is considered unsafe for OpenShift Container Platform. Note that being namespaced alone is not sufficient for the sysctl to be considered safe.
All unsafe sysctls are disabled by default, and the cluster administrator must manually enable them on a per-node basis. Pods with disabled unsafe sysctls are scheduled but do not launch.
$ oc get pod NAME READY STATUS RESTARTS AGE hello-pod 0/1 SysctlForbidden 0 14s
You can set sysctls on pods using the pod’s securityContext
. The securityContext
applies to all containers in the same pod.
Safe sysctls are allowed by default. A pod with unsafe sysctls fails to launch on any node unless the cluster administrator explicitly enables unsafe sysctls for that node. As with node-level sysctls, use the taints and toleration feature or labels on nodes to schedule those pods onto the right nodes.
The following example uses the pod securityContext
to set a safe sysctl
kernel.shm_rmid_forced
and two unsafe sysctls, net.ipv4.route.min_pmtu
and
kernel.msgmax
. There is no distinction between safe and unsafe sysctls in
the specification.
To avoid destabilizing your operating system, modify sysctl parameters only after you understand their effects. |
To use safe and unsafe sysctls:
Modify the YAML file that defines the pod and add the securityContext
spec, as
shown in the following example:
apiVersion: v1
kind: Pod
metadata:
name: sysctl-example
spec:
securityContext:
sysctls:
- name: kernel.shm_rmid_forced
value: "0"
- name: net.ipv4.route.min_pmtu
value: "552"
- name: kernel.msgmax
value: "65536"
...
Create the pod:
$ oc apply -f <file-name>.yaml
If the unsafe sysctls are not allowed for the node, the pod is scheduled, but does not deploy:
$ oc get pod NAME READY STATUS RESTARTS AGE hello-pod 0/1 SysctlForbidden 0 14s
A cluster administrator can allow certain unsafe sysctls for very special situations such as high-performance or real-time application tuning.
If you want to use unsafe sysctls, a cluster administrator must enable them individually for a specific type of node. The sysctls must be namespaced.
Due to their nature of being unsafe, the use of unsafe sysctls is at-your-own-risk and can lead to severe problems, such as improper behavior of containers, resource shortage, or breaking a node. |
Add a label to the machine config pool where the containers where containers with the unsafe sysctls will run:
$ oc edit machineconfigpool worker
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
creationTimestamp: 2019-02-08T14:52:39Z
generation: 1
labels:
custom-kubelet: sysctl (1)
1 | Add a key: pair label. |
Create a KubeletConfig
custom resource (CR):
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: custom-kubelet
spec:
machineConfigPoolSelector:
matchLabels:
custom-kubelet: sysctl (1)
kubeletConfig:
allowedUnsafeSysctls: (2)
- "kernel.msg*"
- "net.ipv4.route.min_pmtu"
1 | Specify the label from the machine config pool. |
2 | List the unsafe sysctls you want to allow. |
Create the object:
$ oc apply -f set-sysctl-worker.yaml
A new MachineConfig
object named in the 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet
format is created.
Wait for the cluster to reboot usng the machineconfigpool
object status
fields:
For example:
status:
conditions:
- lastTransitionTime: '2019-08-11T15:32:00Z'
message: >-
All nodes are updating to
rendered-worker-ccbfb5d2838d65013ab36300b7b3dc13
reason: ''
status: 'True'
type: Updating
A message similar to the following appears when the cluster is ready:
- lastTransitionTime: '2019-08-11T16:00:00Z'
message: >-
All nodes are updated with
rendered-worker-ccbfb5d2838d65013ab36300b7b3dc13
reason: ''
status: 'True'
type: Updated
When the cluster is ready, check for the merged KubeletConfig
object in the new MachineConfig
object:
$ oc get machineconfig 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet -o json | grep ownerReference -A7
"ownerReferences": [
{
"apiVersion": "machineconfiguration.openshift.io/v1",
"blockOwnerDeletion": true,
"controller": true,
"kind": "KubeletConfig",
"name": "custom-kubelet",
"uid": "3f64a766-bae8-11e9-abe8-0a1a2a4813f2"
You can now add unsafe sysctls to pods as needed.