$ oc get subs -n <operator_namespace>
Operators are a method of packaging, deploying, and managing an OpenShift Container Platform application. They act like an extension of the software vendor’s engineering team, watching over an OpenShift Container Platform environment and using its current state to make decisions in real time. Operators are designed to handle upgrades seamlessly, react to failures automatically, and not take shortcuts, such as skipping a software backup process to save time.
OpenShift Container Platform 4.6 includes a default set of Operators that are required for proper functioning of the cluster. These default Operators are managed by the Cluster Version Operator (CVO).
As a cluster administrator, you can install application Operators from the OperatorHub using the OpenShift Container Platform web console or the CLI. You can then subscribe the Operator to one or more namespaces to make it available for developers on your cluster. Application Operators are managed by Operator Lifecycle Manager (OLM).
If you experience Operator issues, verify Operator subscription status. Check Operator pod health across the cluster and gather Operator logs for diagnosis.
Subscriptions can report the following condition types:
Condition | Description |
---|---|
|
Some or all of the catalog sources to be used in resolution are unhealthy. |
|
An install plan for a subscription is missing. |
|
An install plan for a subscription is pending installation. |
|
An install plan for a subscription has failed. |
Default OpenShift Container Platform cluster Operators are managed by the Cluster Version Operator (CVO) and they do not have a |
You can view Operator subscription status using the CLI.
You have access to the cluster as a user with the cluster-admin
role.
You have installed the OpenShift CLI (oc
).
List Operator subscriptions:
$ oc get subs -n <operator_namespace>
Use the oc describe
command to inspect a Subscription
resource:
$ oc describe sub <subscription_name> -n <operator_namespace>
In the command output, find the Conditions
section for the status of Operator subscription condition types. In the following example, the CatalogSourcesUnhealthy
condition type has a status of false
because all available catalog sources are healthy:
Conditions:
Last Transition Time: 2019-07-29T13:42:57Z
Message: all available catalogsources are healthy
Reason: AllCatalogSourcesHealthy
Status: False
Type: CatalogSourcesUnhealthy
Default OpenShift Container Platform cluster Operators are managed by the Cluster Version Operator (CVO) and they do not have a |
You can list Operator pods within a cluster and their status. You can also collect a detailed Operator pod summary.
You have access to the cluster as a user with the cluster-admin
role.
Your API service is still functional.
You have installed the OpenShift CLI (oc
).
List Operators running in the cluster. The output includes Operator version, availability, and up-time information:
$ oc get clusteroperators
List Operator pods running in the Operator’s namespace, plus pod status, restarts, and age:
$ oc get pod -n <operator_namespace>
Output a detailed Operator pod summary:
$ oc describe pod <operator_pod_name> -n <operator_namespace>
If an Operator issue is node-specific, query Operator container status on that node.
Start a debug pod for the node:
$ oc debug node/my-node
Set /host
as the root directory within the debug shell. The debug pod mounts the host’s root file system in /host
within the pod. By changing the root directory to /host
, you can run binaries contained in the host’s executable paths:
# chroot /host
OpenShift Container Platform 4.6 cluster nodes running Red Hat Enterprise Linux CoreOS (RHCOS) are immutable and rely on Operators to apply cluster changes. Accessing cluster nodes using SSH is not recommended and nodes will be tainted as accessed. However, if the OpenShift Container Platform API is not available, or the kubelet is not properly functioning on the target node, |
List details about the node’s containers, including state and associated pod IDs:
# crictl ps
List information about a specific Operator container on the node. The following example lists information about the network-operator
container:
# crictl ps --name network-operator
Exit from the debug shell.
If you experience Operator issues, you can gather detailed diagnostic information from Operator pod logs.
You have access to the cluster as a user with the cluster-admin
role.
Your API service is still functional.
You have installed the OpenShift CLI (oc
).
You have the fully qualified domain names of the control plane, or master machines.
List the Operator pods that are running in the Operator’s namespace, plus the pod status, restarts, and age:
$ oc get pods -n <operator_namespace>
Review logs for an Operator pod:
$ oc logs pod/<pod_name> -n <operator_namespace>
If an Operator pod has multiple containers, the preceding command will produce an error that includes the name of each container. Query logs from an individual container:
$ oc logs pod/<operator_pod_name> -c <container_name> -n <operator_namespace>
If the API is not functional, review Operator pod and container logs on each master node by using SSH instead. Replace <master-node>.<cluster_name>.<base_domain>
with appropriate values.
List pods on each master node:
$ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl pods
For any Operator pods not showing a Ready
status, inspect the pod’s status in detail. Replace <operator_pod_id>
with the Operator pod’s ID listed in the output of the preceding command:
$ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl inspectp <operator_pod_id>
List containers related to an Operator pod:
$ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl ps --pod=<operator_pod_id>
For any Operator container not showing a Ready
status, inspect the container’s status in detail. Replace <container_id>
with a container ID listed in the output of the preceding command:
$ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl inspect <container_id>
Review the logs for any Operator containers not showing a Ready
status. Replace <container_id>
with a container ID listed in the output of the preceding command:
$ ssh core@<master-node>.<cluster_name>.<base_domain> sudo crictl logs -f <container_id>
OpenShift Container Platform 4.6 cluster nodes running Red Hat Enterprise Linux CoreOS (RHCOS) are immutable and rely on Operators to apply cluster changes. Accessing cluster nodes using SSH is not recommended and nodes will be tainted as accessed. Before attempting to collect diagnostic data over SSH, review whether the data collected by running |
When configuration changes are made by the Machine Config Operator, Red Hat Enterprise Linux CoreOS (RHCOS) must reboot for the changes to take effect. Whether the configuration change is automatic, such as when a kube-apiserver-to-kubelet-signer
CA is rotated, or manual, such as when a registry or SSH key is updated, an RHCOS node reboots automatically unless it is paused.
To avoid unwanted disruptions, you can modify the machine config pool to prevent automatic rebooting after the Operator makes changes to the machine config.
Pausing a machine config pool stops all system reboot processes and all configuration changes from being applied. |
You have access to the cluster as a user with the cluster-admin
role.
You have installed the OpenShift CLI (oc
).
You have root access in OpenShift Container Platform.
To pause the autoreboot process after machine config changes are applied:
As root, update the spec.paused
field to true
in the MachineConfigPool
custom resource.
# oc patch --type=merge --patch='{"spec":{"paused":true}}' machineconfigpool/master
# oc patch --type=merge --patch='{"spec":{"paused":true}}' machineconfigpool/worker
To verify that the machine config pool is paused:
# oc get machineconfigpool/master --template='{{.spec.paused}}'
# oc get machineconfigpool/worker --template='{{.spec.paused}}'
The spec.paused
field is true
and the machine config pool is paused.
Alternatively, to unpause the autoreboot process:
As root, update the spec.paused
field to false
in the MachineConfigPool CustomResourceDefinition (CRD).
# oc patch --type=merge --patch='{"spec":{"paused":false}}' machineconfigpool/master
# oc patch --type=merge --patch='{"spec":{"paused":false}}' machineconfigpool/worker
By unpausing a machine config pool, all paused changes are applied at reboot. |
To verify that the machine config pool is unpaused:
# oc get machineconfigpool/master --template='{{.spec.paused}}'
# oc get machineconfigpool/worker --template='{{.spec.paused}}'
The spec.paused
field is false
and the machine config pool is unpaused.
To see if the machine config pool has pending changes:
# oc get machineconfigpool
NAME CONFIG UPDATED UPDATING master rendered-master-546383f80705bd5aeaba93 True False worker rendered-worker-b4c51bb33ccaae6fc4a6a5 True False
When UPDATED
is True
and UPDATING
is False
, there are no pending changes, and vice versa.
It is recommended to schedule a maintenance window for a reboot as early as possible by setting |
In Operator Lifecycle Manager (OLM), if you subscribe to an Operator that references images that are not accessible on your network, you can find jobs in the openshift-marketplace
namespace that are failing with the following errors:
ImagePullBackOff for
Back-off pulling image "example.com/openshift4/ose-elasticsearch-operator-bundle@sha256:6d2587129c846ec28d384540322b40b05833e7e00b25cca584e004af9a1d292e"
rpc error: code = Unknown desc = error pinging docker registry example.com: Get "https://example.com/v2/": dial tcp: lookup example.com on 10.0.0.1:53: no such host
As a result, the subscription is stuck in this failing state and the Operator is unable to install or upgrade.
You can refresh a failing subscription by deleting the subscription, cluster service version (CSV), and other related objects. After recreating the subscription, OLM then reinstalls the correct version of the Operator.
You have a failing subscription that is unable to pull an inaccessible bundle image.
You have confirmed that the correct bundle image is accessible.
Get the names of the Subscription
and ClusterServiceVersion
objects from the namespace where the Operator is installed:
$ oc get sub,csv -n <namespace>
NAME PACKAGE SOURCE CHANNEL
subscription.operators.coreos.com/elasticsearch-operator elasticsearch-operator redhat-operators 5.0
NAME DISPLAY VERSION REPLACES PHASE
clusterserviceversion.operators.coreos.com/elasticsearch-operator.5.0.0-65 OpenShift Elasticsearch Operator 5.0.0-65 Succeeded
Delete the subscription:
$ oc delete subscription <subscription_name> -n <namespace>
Delete the cluster service version:
$ oc delete csv <csv_name> -n <namespace>
Get the names of any failing jobs and related config maps in the openshift-marketplace
namespace:
$ oc get job,configmap -n openshift-marketplace
NAME COMPLETIONS DURATION AGE
job.batch/1de9443b6324e629ddf31fed0a853a121275806170e34c926d69e53a7fcbccb 1/1 26s 9m30s
NAME DATA AGE
configmap/1de9443b6324e629ddf31fed0a853a121275806170e34c926d69e53a7fcbccb 3 9m30s
Delete the job:
$ oc delete job <job_name> -n openshift-marketplace
This ensures pods that try to pull the inaccessible image are not recreated.
Delete the config map:
$ oc delete configmap <configmap_name> -n openshift-marketplace
Reinstall the Operator using OperatorHub in the web console.
Check that the Operator has been reinstalled successfully:
$ oc get sub,csv,installplan -n <namespace>