Using node disruption policies to minimize disruption from machine config changes | Machine configuration

Example node disruption policies
Configuring node restart behaviors upon machine config changes

By default, when you make certain changes to the fields in a MachineConfig object, the Machine Config Operator (MCO) drains and reboots the nodes associated with that machine config. However, you can create a node disruption policy that defines a set of changes to some Ignition config objects that would require little or no disruption to your workloads.

A node disruption policy allows you to define the configuration changes that cause a disruption to your cluster, and which changes do not. This allows you to reduce node downtime when making small machine configuration changes in your cluster. To configure the policy, you modify the MachineConfiguration object, which is in the openshift-machine-config-operator namespace. See the example node disruption policies in the MachineConfiguration objects that follow.

There are machine configuration changes that always require a reboot, regardless of any node disruption policies. For more information, see About the Machine Config Operator.

After you create the node disruption policy, the MCO validates the policy to search for potential issues in the file, such as problems with formatting. The MCO then merges the policy with the cluster defaults and populates the status.nodeDisruptionPolicyStatus fields in the machine config with the actions to be performed upon future changes to the machine config. The configurations in your policy always overwrite the cluster defaults.

The MCO does not validate whether a change can be successfully applied by your node disruption policy. Therefore, you are responsible to ensure the accuracy of your node disruption policies.

For example, you can configure a node disruption policy so that sudo configurations do not require a node drain and reboot. Or, you can configure your cluster so that updates to sshd are applied with only a reload of that one service.

The node disruption policy feature is a Technology Preview feature only. Technology Preview features are not supported with Red Hat production service level agreements (SLAs) and might not be functionally complete. Red Hat does not recommend using them in production. These features provide early access to upcoming product features, enabling customers to test functionality and provide feedback during the development process.

For more information about the support scope of Red Hat Technology Preview features, see Technology Preview Features Support Scope.

You can control the behavior of the MCO when making the changes to the following Ignition configuration objects:

configuration files: You add to or update the files in the /var or /etc directory.
systemd units: You create and set the status of a systemd service or modify an existing systemd service.
users and groups: You change SSH keys in the passwd section postinstallation.
ICSP, ITMS, IDMS objects: You can remove mirroring rules from an ImageContentSourcePolicy (ICSP), ImageTagMirrorSet (ITMS), and ImageDigestMirrorSet (IDMS) object.

When you make any of these changes, the node disruption policy determines which of the following actions are required when the MCO implements the changes:

Reboot: The MCO drains and reboots the nodes. This is the default behavior.
None: The MCO does not drain or reboot the nodes. The MCO applies the changes with no further action.
Drain: The MCO cordons and drains the nodes of their workloads. The workloads restart with the new configurations.
Reload: For services, the MCO reloads the specified services without restarting the service.
Restart: For services, the MCO fully restarts the specified services.
DaemonReload: The MCO reloads the systemd manager configuration.
Special: This is an internal MCO-only action and cannot be set by the user.

The Reboot and None actions cannot be used with any other actions, as the Reboot and None actions override the others.
Actions are applied in the order that they are set in the node disruption policy list.
If you make other machine config changes that do require a reboot or other disruption to the nodes, that reboot supercedes the node disruption policy actions.

Example node disruption policies

The following example MachineConfiguration objects contain a node disruption policy.

A MachineConfiguration object and a MachineConfig object are different objects. A MachineConfiguration object is a singleton object in the MCO namespace that contains configuration parameters for the MCO operator. A MachineConfig object defines changes that are applied to a machine config pool.

The following example MachineConfiguration object shows no user defined policies. The default node disruption policy values are shown in the status stanza.

Default node disruption policy

apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
  name: cluster
spec:
  logLevel: Normal
  managementState: Managed
  operatorLogLevel: Normal
status:
  nodeDisruptionPolicyStatus:
    clusterPolicies:
      files:
      - actions:
        - type: None
        path: /etc/mco/internal-registry-pull-secret.json
      - actions:
        - type: None
        path: /var/lib/kubelet/config.json
      - actions:
        - reload:
            serviceName: crio.service
          type: Reload
        path: /etc/machine-config-daemon/no-reboot/containers-gpg.pub
      - actions:
        - reload:
            serviceName: crio.service
          type: Reload
        path: /etc/containers/policy.json
      - actions:
        - type: Special
        path: /etc/containers/registries.conf
      sshkey:
        actions:
        - type: None
  readyReplicas: 0

In the following example, when changes are made to the SSH keys, the MCO drains the cluster nodes, reloads the crio.service, reloads the systemd configuration, and restarts the crio-service.

Example node disruption policy for an SSH key change

apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
  name: cluster
  namespace: openshift-machine-config-operator
# ...
spec:
  nodeDisruptionPolicy:
    sshkey:
      actions:
      - type: Drain
      - reload:
          serviceName: crio.service
        type: Reload
      - type: DaemonReload
      - restart:
          serviceName: crio.service
        type: Restart
# ...

In the following example, when changes are made to the /etc/chrony.conf file, the MCO reloads the chronyd.service on the cluster nodes.

Example node disruption policy for a configuration file change

apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
  name: cluster
  namespace: openshift-machine-config-operator
# ...
spec:
  nodeDisruptionPolicy:
    files:
    - actions:
      - reload:
          serviceName: chronyd.service
        type: Reload
      path: /etc/chrony.conf

In the following example, when changes are made to the auditd.service systemd unit, the MCO drains the cluster nodes, reloads the crio.service, reloads the systemd manager configuration, and restarts the crio.service.

Example node disruption policy for a systemd unit change

apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
  name: cluster
  namespace: openshift-machine-config-operator
# ...
spec:
  nodeDisruptionPolicy:
    units:
      - name: auditd.service
        actions:
          - type: Drain
          - type: Reload
            reload:
              serviceName: crio.service
          - type: DaemonReload
          - type: Restart
            restart:
              serviceName: crio.service

In the following example, when changes are made to the registries.conf file, such as by editing an ImageContentSourcePolicy (ICSP) object, the MCO does not drain or reboot the nodes and applies the changes with no further action.

Example node disruption policy for a registries.conf file change

apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
  name: cluster
  namespace: openshift-machine-config-operator
# ...
spec:
  nodeDisruptionPolicy:
    files:
      - actions:
        - type: None
        path: /etc/containers/registries.conf

Configuring node restart behaviors upon machine config changes

You can create a node disruption policy to define the machine configuration changes that cause a disruption to your cluster, and which changes do not.

You can control how your nodes respond to changes in the files in the /var or /etc directory, the systemd units, the SSH keys, and the registries.conf file.

When you make any of these changes, the node disruption policy determines which of the following actions are required when the MCO implements the changes:

Reboot: The MCO drains and reboots the nodes. This is the default behavior.
None: The MCO does not drain or reboot the nodes. The MCO applies the changes with no further action.
Drain: The MCO cordons and drains the nodes of their workloads. The workloads restart with the new configurations.
Reload: For services, the MCO reloads the specified services without restarting the service.
Restart: For services, the MCO fully restarts the specified services.
DaemonReload: The MCO reloads the systemd manager configuration.
Special: This is an internal MCO-only action and cannot be set by the user.

The Reboot and None actions cannot be used with any other actions, as the Reboot and None actions override the others.
Actions are applied in the order that they are set in the node disruption policy list.
If you make other machine config changes that do require a reboot or other disruption to the nodes, that reboot supercedes the node disruption policy actions.

Prerequisites

You have enabled the TechPreviewNoUpgrade feature set by using the feature gates. For more information, see "Enabling features using feature gates".

Enabling the TechPreviewNoUpgrade feature set on your cluster prevents minor version updates. The TechPreviewNoUpgrade feature set cannot be disabled. Do not enable this feature set on production clusters.

Procedure

Edit the machineconfigurations.operator.openshift.io object to define the node disruption policy:
```
$ oc edit MachineConfiguration cluster -n openshift-machine-config-operator
```

Add a node disruption policy similar to the following:

apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
  name: cluster
# ...
spec:
  nodeDisruptionPolicy: (1)
    files: (2)
    - actions: (3)
      - reload: (4)
          serviceName: chronyd.service (5)
        type: Reload
      path: /etc/chrony.conf (6)
    sshkey: (7)
      actions:
      - type: Drain
      - reload:
          serviceName: crio.service
        type: Reload
      - type: DaemonReload
      - restart:
          serviceName: crio.service
        type: Restart
    units: (8)
    - actions:
      - type: Drain
      - reload:
          serviceName: crio.service
        type: Reload
      - type: DaemonReload
      - restart:
          serviceName: crio.service
        type: Restart
      name: test.service

1	Specifies the node disruption policy.
2	Specifies a list of machine config file definitions and actions to take to changes on those paths. This list supports a maximum of 50 entries.
3	Specifies the series of actions to be executed upon changes to the specified files. Actions are applied in the order that they are set in this list. This list supports a maximum of 10 entries.
4	Specifies that the listed service is to be reloaded upon changes to the specified files.
5	Specifies the full name of the service to be acted upon.
6	Specifies the location of a file that is managed by a machine config. The actions in the policy apply when changes are made to the file in `path`.
7	Specifies a list of service names and actions to take upon changes to the SSH keys in the cluster.
8	Specifies a list of systemd unit names and actions to take upon changes to those units.

Verification

View the MachineConfiguration object file that you created:

$ oc get MachineConfiguration/cluster -o yaml

Example output

apiVersion: operator.openshift.io/v1
kind: MachineConfiguration
metadata:
  labels:
    machineconfiguration.openshift.io/role: worker
  name: cluster
# ...
status:
  nodeDisruptionPolicyStatus: (1)
    clusterPolicies:
      files:
# ...
      - actions:
        - reload:
            serviceName: chronyd.service
          type: Reload
        path: /etc/chrony.conf
      sshkey:
        actions:
        - type: Drain
        - reload:
            serviceName: crio.service
          type: Reload
        - type: DaemonReload
        - restart:
            serviceName: crio.service
          type: Restart
      units:
      - actions:
        - type: Drain
        - reload:
            serviceName: crio.service
          type: Reload
        - type: DaemonReload
        - restart:
            serviceName: crio.service
          type: Restart
        name: test.se
# ...

1	Specifies the current cluster-validated policies.