There might be situations when the on-disk state of a node differs from what is configured in the machine config. This is known as configuration drift. For example, a cluster admin might manually modify a file, a systemd unit file, or a file permission that was configured through a machine config. This causes configuration drift. Configuration drift can cause problems between nodes in a Machine Config Pool or when the machine configs are updated.
The Machine Config Operator (MCO) uses the Machine Config Daemon (MCD) to check nodes for configuration drift on a regular basis. If detected, the MCO sets the node and the machine config pool (MCP) to Degraded
and reports the error. A degraded node is online and operational, but, it cannot be updated.
The MCD performs configuration drift detection upon each of the following conditions:
When performing configuration drift detection, the MCD validates that the file contents and permissions fully match what the currently-applied machine config specifies. Typically, the MCD detects configuration drift in less than a second after the detection is triggered.
If the MCD detects configuration drift, the MCD performs the following tasks:
-
Emits an error to the console logs
-
Emits a Kubernetes event
-
Stops further detection on the node
-
Sets the node and MCP to degraded
You can check if you have a degraded node by listing the MCPs:
If you have a degraded MCP, the DEGRADEDMACHINECOUNT
field is non-zero, similar to the following output:
Example output
NAME CONFIG UPDATED UPDATING DEGRADED MACHINECOUNT READYMACHINECOUNT UPDATEDMACHINECOUNT DEGRADEDMACHINECOUNT AGE
worker rendered-worker-404caf3180818d8ac1f50c32f14b57c3 False True True 2 1 1 1 5h51m
You can determine if the problem is caused by configuration drift by examining the machine config pool:
Example output
...
Last Transition Time: 2021-12-20T18:54:00Z
Message: Node ci-ln-j4h8nkb-72292-pxqxz-worker-a-fjks4 is reporting: "content mismatch for file \"/etc/mco-test-file\"" (1)
Reason: 1 nodes are reporting degraded status on sync
Status: True
Type: NodeDegraded (2)
...
1 |
This message shows that a node’s /etc/mco-test-file file, which was added by the machine config, has changed outside of the machine config. |
2 |
The state of the node is NodeDegraded . |
Or, if you know which node is degraded, examine that node:
$ oc describe node/ci-ln-j4h8nkb-72292-pxqxz-worker-a-fjks4
Example output
...
Annotations: cloud.network.openshift.io/egress-ipconfig: [{"interface":"nic0","ifaddr":{"ipv4":"10.0.128.0/17"},"capacity":{"ip":10}}]
csi.volume.kubernetes.io/nodeid:
{"pd.csi.storage.gke.io":"projects/openshift-gce-devel-ci/zones/us-central1-a/instances/ci-ln-j4h8nkb-72292-pxqxz-worker-a-fjks4"}
machine.openshift.io/machine: openshift-machine-api/ci-ln-j4h8nkb-72292-pxqxz-worker-a-fjks4
machineconfiguration.openshift.io/controlPlaneTopology: HighlyAvailable
machineconfiguration.openshift.io/currentConfig: rendered-worker-67bd55d0b02b0f659aef33680693a9f9
machineconfiguration.openshift.io/desiredConfig: rendered-worker-67bd55d0b02b0f659aef33680693a9f9
machineconfiguration.openshift.io/reason: content mismatch for file "/etc/mco-test-file" (1)
machineconfiguration.openshift.io/state: Degraded (2)
...
1 |
The error message indicating that configuration drift was detected between the node and the listed machine config. Here the error message indicates that the contents of the /etc/mco-test-file , which was added by the machine config, has changed outside of the machine config. |
2 |
The state of the node is Degraded . |
You can correct configuration drift and return the node to the Ready
state by performing one of the following remediations:
-
Ensure that the contents and file permissions of the files on the node match what is configured in the machine config. You can manually rewrite the file
contents or change the file permissions.
-
Generate a force file on the degraded node. The force file causes the MCD to bypass the usual configuration drift detection and reapplies the current machine config.
|
Generating a force file on a node causes that node to reboot.
|