oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- health
At least one primary shard and its replicas are not allocated to a node.
Check the Elasticsearch cluster health and verify that the cluster status
is red.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- health
List the nodes that have joined the cluster.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/nodes?v
List the Elasticsearch pods and compare them with the nodes in the command output from the previous step.
oc -n openshift-logging get pods -l component=elasticsearch
If some of the Elasticsearch nodes have not joined the cluster, perform the following steps.
Confirm that Elasticsearch has an elected control plane node.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/master?v
Review the pod logs of the elected control plane node for issues.
oc logs <elasticsearch_master_pod_name> -c elasticsearch -n openshift-logging
Review the logs of nodes that have not joined the cluster for issues.
oc logs <elasticsearch_node_name> -c elasticsearch -n openshift-logging
If all the nodes have joined the cluster, perform the following steps, check if the cluster is in the process of recovering.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/recovery?active_only=true
If there is no command output, the recovery process might be delayed or stalled by pending tasks.
Check if there are pending tasks.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- health |grep number_of_pending_tasks
If there are pending tasks, monitor their status.
If their status changes and indicates that the cluster is recovering, continue waiting. The recovery time varies according to the size of the cluster and other factors.
Otherwise, if the status of the pending tasks does not change, this indicates that the recovery has stalled.
If it seems like the recovery has stalled, check if cluster.routing.allocation.enable
is set to none
.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cluster/settings?pretty
If cluster.routing.allocation.enable
is set to none
, set it to all
.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cluster/settings?pretty -X PUT -d '{"persistent": {"cluster.routing.allocation.enable":"all"}}'
Check which indices are still red.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/indices?v
If any indices are still red, try to clear them by performing the following steps.
Clear the cache.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name>/_cache/clear?pretty
Increase the max allocation retries.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name>/_settings?pretty -X PUT -d '{"index.allocation.max_retries":10}'
Delete all the scroll items.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_search/scroll/_all -X DELETE
Increase the timeout.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name>/_settings?pretty -X PUT -d '{"index.unassigned.node_left.delayed_timeout":"10m"}'
If the preceding steps do not clear the red indices, delete the indices individually.
Identify the red index name.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cat/indices?v
Delete the red index.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_red_index_name> -X DELETE
If there are no red indices and the cluster status is red, check for a continuous heavy processing load on a data node.
Check if the Elasticsearch JVM Heap usage is high.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_nodes/stats?pretty
In the command output, review the node_name.jvm.mem.heap_used_percent
field to determine the JVM Heap usage.
Check for high CPU utilization.
Search for "Free up or increase disk space" in the Elasticsearch topic, Fix a red or yellow cluster status.
Replica shards for at least one primary shard are not allocated to nodes.
Increase the node count by adjusting nodeCount
in the ClusterLogging
CR.
Search for "Free up or increase disk space" in the Elasticsearch topic, Fix a red or yellow cluster status.
Elasticsearch does not allocate shards to nodes that reach the low watermark.
Identify the node on which Elasticsearch is deployed.
oc -n openshift-logging get po -o wide
Check if there are unassigned shards
.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cluster/health?pretty | grep unassigned_shards
If there are unassigned shards, check the disk space on each node.
for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod -- df -h /elasticsearch/persistent; done
Check the nodes.node_name.fs
field to determine the free disk space on that node.
If the used disk percentage is above 85%, the node has exceeded the low watermark, and shards can no longer be allocated to this node.
Try to increase the disk space on all nodes.
If increasing the disk space is not possible, try adding a new data node to the cluster.
If adding a new data node is problematic, decrease the total cluster redundancy policy.
Check the current redundancyPolicy
.
oc -n openshift-logging get es elasticsearch -o jsonpath='{.spec.redundancyPolicy}'
If you are using a
|
If the cluster redundancyPolicy
is higher than SingleRedundancy
, set it to SingleRedundancy
and save this change.
If the preceding steps do not fix the issue, delete the old indices.
Check the status of all indices on Elasticsearch.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- indices
Identify an old index that can be deleted.
Delete the index.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name> -X DELETE
Search for "redundancyPolicy" in the "Sample ClusterLogging
custom resource (CR)" in About the Cluster Logging custom resource
Elasticsearch attempts to relocate shards away from a node that has reached the high watermark.
Identify the node on which Elasticsearch is deployed.
oc -n openshift-logging get po -o wide
Check the disk space on each node.
for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod -- df -h /elasticsearch/persistent; done
Check if the cluster is rebalancing.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_cluster/health?pretty | grep relocating_shards
If the command output shows relocating shards, the High Watermark has been exceeded. The default value of the High Watermark is 90%.
The shards relocate to a node with low disk usage that has not crossed any watermark threshold limits.
To allocate shards to a particular node, free up some space.
Try to increase the disk space on all nodes.
If increasing the disk space is not possible, try adding a new data node to the cluster.
If adding a new data node is problematic, decrease the total cluster redundancy policy.
Check the current redundancyPolicy
.
oc -n openshift-logging get es elasticsearch -o jsonpath='{.spec.redundancyPolicy}'
If you are using a
|
If the cluster redundancyPolicy
is higher than SingleRedundancy
, set it to SingleRedundancy
and save this change.
If the preceding steps do not fix the issue, delete the old indices.
Check the status of all indices on Elasticsearch.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- indices
Identify an old index that can be deleted.
Delete the index.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name> -X DELETE
Search for "redundancyPolicy" in the "Sample ClusterLogging
custom resource (CR)" in About the Cluster Logging custom resource
Elasticsearch enforces a read-only index block on every index that has both of these conditions:
One or more shards are allocated to the node.
One or more disks exceed the flood stage.
Check the disk space of the Elasticsearch node.
for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod -- df -h /elasticsearch/persistent; done
Check the nodes.node_name.fs
field to determine the free disk space on that node.
If the used disk percentage is above 95%, it signifies that the node has crossed the flood watermark. Writing is blocked for shards allocated on this particular node.
Try to increase the disk space on all nodes.
If increasing the disk space is not possible, try adding a new data node to the cluster.
If adding a new data node is problematic, decrease the total cluster redundancy policy.
Check the current redundancyPolicy
.
oc -n openshift-logging get es elasticsearch -o jsonpath='{.spec.redundancyPolicy}'
If you are using a
|
If the cluster redundancyPolicy
is higher than SingleRedundancy
, set it to SingleRedundancy
and save this change.
If the preceding steps do not fix the issue, delete the old indices.
Check the status of all indices on Elasticsearch.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- indices
Identify an old index that can be deleted.
Delete the index.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name> -X DELETE
Continue freeing up and monitoring the disk space until the used disk space drops below 90%. Then, unblock write to this particular node.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=_all/_settings?pretty -X PUT -d '{"index.blocks.read_only_allow_delete": null}'
Search for "redundancyPolicy" in the "Sample ClusterLogging
custom resource (CR)" in About the Cluster Logging custom resource
The Elasticsearch node JVM Heap memory used is above 75%.
Consider increasing the heap size.
System CPU usage on the node is high.
Check the CPU of the cluster node. Consider allocating more CPU resources to the node.
Elasticsearch process CPU usage on the node is high.
Check the CPU of the cluster node. Consider allocating more CPU resources to the node.
The Elasticsearch Cluster is predicted to be out of disk space within the next 6 hours based on current disk usage.
Get the disk space of the Elasticsearch node.
for pod in `oc -n openshift-logging get po -l component=elasticsearch -o jsonpath='{.items[*].metadata.name}'`; do echo $pod; oc -n openshift-logging exec -c elasticsearch $pod -- df -h /elasticsearch/persistent; done
In the command output, check the nodes.node_name.fs
field to determine the free disk space on that node.
Try to increase the disk space on all nodes.
If increasing the disk space is not possible, try adding a new data node to the cluster.
If adding a new data node is problematic, decrease the total cluster redundancy policy.
Check the current redundancyPolicy
.
oc -n openshift-logging get es elasticsearch -o jsonpath='{.spec.redundancyPolicy}'
If you are using a
|
If the cluster redundancyPolicy
is higher than SingleRedundancy
, set it to SingleRedundancy
and save this change.
If the preceding steps do not fix the issue, delete the old indices.
Check the status of all indices on Elasticsearch.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- indices
Identify an old index that can be deleted.
Delete the index.
oc exec -n openshift-logging -c elasticsearch <elasticsearch_pod_name> -- es_util --query=<elasticsearch_index_name> -X DELETE
Search for "redundancyPolicy" in the "Sample ClusterLogging
custom resource (CR)" in About the Cluster Logging custom resource
Search for "ElasticsearchDiskSpaceRunningLow" in About Elasticsearch alerting rules.
Search for "Free up or increase disk space" in the Elasticsearch topic, Fix a red or yellow cluster status.
Based on current usage trends, the predicted number of file descriptors on the node is insufficient.
Check and, if needed, configure the value of max_file_descriptors
for each node, as described in the Elasticsearch File descriptors topic.
Search for "ElasticsearchHighFileDescriptorUsage" in About Elasticsearch alerting rules.
Search for "File Descriptors In Use" in OpenShift Logging dashboards.