$ oc label nodes <node_name> <node_label> (1)
You can configure the monitoring stack to optimize the performance and scale of your clusters. The following documentation provides information about how to distribute the monitoring components and control the impact of the monitoring stack on CPU and memory resources.
You can move the monitoring stack components to specific nodes:
Use the nodeSelector
constraint with labeled nodes to move any of the monitoring stack components to specific nodes.
Assign tolerations to enable moving components to tainted nodes.
By doing so, you control the placement and distribution of the monitoring components across a cluster.
By controlling placement and distribution of monitoring components, you can optimize system resource use, improve performance, and separate workloads based on specific requirements or policies.
You can move any of the components that monitor workloads for user-defined projects to specific worker nodes.
It is not permitted to move components to control plane or infrastructure nodes. |
You have access to the cluster as a user with the cluster-admin
cluster role or as a user with the user-workload-monitoring-config-edit
role in the openshift-user-workload-monitoring
project.
A cluster administrator has enabled monitoring for user-defined projects.
You have installed the OpenShift CLI (oc
).
If you have not done so yet, add a label to the nodes on which you want to run the monitoring components:
$ oc label nodes <node_name> <node_label> (1)
1 | Replace <node_name> with the name of the node where you want to add the label.
Replace <node_label> with the name of the wanted label. |
Edit the user-workload-monitoring-config
ConfigMap
object in the openshift-user-workload-monitoring
project:
$ oc -n openshift-user-workload-monitoring edit configmap user-workload-monitoring-config
Specify the node labels for the nodeSelector
constraint for the component under data/config.yaml
:
apiVersion: v1
kind: ConfigMap
metadata:
name: user-workload-monitoring-config
namespace: openshift-user-workload-monitoring
data:
config.yaml: |
# ...
<component>: (1)
nodeSelector:
<node_label_1> (2)
<node_label_2> (3)
# ...
1 | Substitute <component> with the appropriate monitoring stack component name. |
2 | Substitute <node_label_1> with the label you added to the node. |
3 | Optional: Specify additional labels. If you specify additional labels, the pods for the component are only scheduled on the nodes that contain all of the specified labels. |
If monitoring components remain in a |
Save the file to apply the changes. The components specified in the new configuration are automatically moved to the new nodes, and the pods affected by the new configuration are redeployed.
You can assign tolerations to the components that monitor user-defined projects, to enable moving them to tainted worker nodes. Scheduling is not permitted on control plane or infrastructure nodes.
You have access to the cluster as a user with the cluster-admin
cluster role, or as a user with the user-workload-monitoring-config-edit
role in the openshift-user-workload-monitoring
project.
A cluster administrator has enabled monitoring for user-defined projects.
You have installed the OpenShift CLI (oc
).
Edit the user-workload-monitoring-config
config map in the openshift-user-workload-monitoring
project:
$ oc -n openshift-user-workload-monitoring edit configmap user-workload-monitoring-config
Specify tolerations
for the component:
apiVersion: v1
kind: ConfigMap
metadata:
name: user-workload-monitoring-config
namespace: openshift-user-workload-monitoring
data:
config.yaml: |
<component>:
tolerations:
<toleration_specification>
Substitute <component>
and <toleration_specification>
accordingly.
For example, oc adm taint nodes node1 key1=value1:NoSchedule
adds a taint to node1
with the key key1
and the value value1
. This prevents monitoring components from deploying pods on node1
unless a toleration is configured for that taint. The following example configures the thanosRuler
component to tolerate the example taint:
apiVersion: v1
kind: ConfigMap
metadata:
name: user-workload-monitoring-config
namespace: openshift-user-workload-monitoring
data:
config.yaml: |
thanosRuler:
tolerations:
- key: "key1"
operator: "Equal"
value: "value1"
effect: "NoSchedule"
Save the file to apply the changes. The pods affected by the new configuration are automatically redeployed.
Taints and Tolerations (Kubernetes documentation)
You can ensure that the containers that run monitoring components have enough CPU and memory resources by specifying values for resource limits and requests for those components.
You can configure these limits and requests for monitoring components that monitor user-defined projects in the openshift-user-workload-monitoring
namespace.
To configure CPU and memory resources, specify values for resource limits and requests in the user-workload-monitoring-config
ConfigMap
object in the openshift-user-workload-monitoring
namespace.
You have access to the cluster as a user with the cluster-admin
cluster role, or as a user with the user-workload-monitoring-config-edit
role in the openshift-user-workload-monitoring
project.
You have installed the OpenShift CLI (oc
).
Edit the user-workload-monitoring-config
config map in the openshift-user-workload-monitoring
project:
$ oc -n openshift-user-workload-monitoring edit configmap user-workload-monitoring-config
Add values to define resource limits and requests for each component you want to configure.
Ensure that the value set for a limit is always higher than the value set for a request. Otherwise, an error will occur, and the container will not run. |
apiVersion: v1
kind: ConfigMap
metadata:
name: user-workload-monitoring-config
namespace: openshift-user-workload-monitoring
data:
config.yaml: |
alertmanager:
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 200m
memory: 500Mi
prometheus:
resources:
limits:
cpu: 500m
memory: 3Gi
requests:
cpu: 200m
memory: 500Mi
thanosRuler:
resources:
limits:
cpu: 500m
memory: 1Gi
requests:
cpu: 200m
memory: 500Mi
Save the file to apply the changes. The pods affected by the new configuration are automatically redeployed.
About specifying limits and requests for monitoring components
Kubernetes requests and limits documentation (Kubernetes documentation)
Cluster administrators can use the following measures to control the impact of unbound metrics attributes in user-defined projects:
Limit the number of samples that can be accepted per target scrape in user-defined projects
Limit the number of scraped labels, the length of label names, and the length of label values
Configure the intervals between consecutive scrapes and between Prometheus rule evaluations
Create alerts that fire when a scrape sample threshold is reached or when the target cannot be scraped
Limiting scrape samples can help prevent the issues caused by adding many unbound attributes to labels. Developers can also prevent the underlying cause by limiting the number of unbound attributes that they define for metrics. Using attributes that are bound to a limited set of possible values reduces the number of potential key-value pair combinations. |
You can set the following scrape and label limits for user-defined projects:
Limit the number of samples that can be accepted per target scrape
Limit the number of scraped labels
Limit the length of label names and label values
You can also set an interval between consecutive scrapes and between Prometheus rule evaluations.
If you set sample or label limits, no further sample data is ingested for that target scrape after the limit is reached. |
You have access to the cluster as a user with the cluster-admin
cluster role, or as a user with the user-workload-monitoring-config-edit
role in the openshift-user-workload-monitoring
project.
A cluster administrator has enabled monitoring for user-defined projects.
You have installed the OpenShift CLI (oc
).
Edit the user-workload-monitoring-config
ConfigMap
object in the openshift-user-workload-monitoring
project:
$ oc -n openshift-user-workload-monitoring edit configmap user-workload-monitoring-config
Add the enforced limit and time interval configurations to data/config.yaml
:
apiVersion: v1
kind: ConfigMap
metadata:
name: user-workload-monitoring-config
namespace: openshift-user-workload-monitoring
data:
config.yaml: |
prometheus:
enforcedSampleLimit: 50000 (1)
enforcedLabelLimit: 500 (2)
enforcedLabelNameLengthLimit: 50 (3)
enforcedLabelValueLengthLimit: 600 (4)
scrapeInterval: 1m30s (5)
evaluationInterval: 1m15s (6)
1 | A value is required if this parameter is specified. This enforcedSampleLimit example limits the number of samples that can be accepted per target scrape in user-defined projects to 50,000. |
2 | Specifies the maximum number of labels per scrape.
The default value is 0 , which specifies no limit. |
3 | Specifies the maximum character length for a label name.
The default value is 0 , which specifies no limit. |
4 | Specifies the maximum character length for a label value.
The default value is 0 , which specifies no limit. |
5 | Specifies the interval between consecutive scrapes. The interval must be set between 5 seconds and 5 minutes.
The default value is 30s . |
6 | Specifies the interval between Prometheus rule evaluations. The interval must be set between 5 seconds and 5 minutes.
The default value for Prometheus is 30s . |
You can also configure the |
Save the file to apply the changes. The limits are applied automatically.
You can create alerts that notify you when:
The target cannot be scraped or is not available for the specified for
duration
A scrape sample threshold is reached or is exceeded for the specified for
duration
You have access to the cluster as a user with the cluster-admin
cluster role, or as a user with the user-workload-monitoring-config-edit
role in the openshift-user-workload-monitoring
project.
A cluster administrator has enabled monitoring for user-defined projects.
You have limited the number of samples that can be accepted per target scrape in user-defined projects, by using enforcedSampleLimit
.
You have installed the OpenShift CLI (oc
).
Create a YAML file with alerts that inform you when the targets are down and when the enforced sample limit is approaching. The file in this example is called monitoring-stack-alerts.yaml
:
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
labels:
prometheus: k8s
role: alert-rules
name: monitoring-stack-alerts (1)
namespace: ns1 (2)
spec:
groups:
- name: general.rules
rules:
- alert: TargetDown (3)
annotations:
message: '{{ printf "%.4g" $value }}% of the {{ $labels.job }}/{{ $labels.service
}} targets in {{ $labels.namespace }} namespace are down.' (4)
expr: 100 * (count(up == 0) BY (job, namespace, service) / count(up) BY (job,
namespace, service)) > 10
for: 10m (5)
labels:
severity: warning (6)
- alert: ApproachingEnforcedSamplesLimit (7)
annotations:
message: '{{ $labels.container }} container of the {{ $labels.pod }} pod in the {{ $labels.namespace }} namespace consumes {{ $value | humanizePercentage }} of the samples limit budget.' (8)
expr: (scrape_samples_post_metric_relabeling / (scrape_sample_limit > 0)) > 0.9 (9)
for: 10m (10)
labels:
severity: warning (11)
1 | Defines the name of the alerting rule. |
2 | Specifies the user-defined project where the alerting rule is deployed. |
3 | The TargetDown alert fires if the target cannot be scraped and is not available for the for duration. |
4 | The message that is displayed when the TargetDown alert fires. |
5 | The conditions for the TargetDown alert must be true for this duration before the alert is fired. |
6 | Defines the severity for the TargetDown alert. |
7 | The ApproachingEnforcedSamplesLimit alert fires when the defined scrape sample threshold is exceeded and lasts for the specified for duration. |
8 | The message that is displayed when the ApproachingEnforcedSamplesLimit alert fires. |
9 | The threshold for the ApproachingEnforcedSamplesLimit alert. In this example, the alert fires when the number of ingested samples exceeds 90% of the configured limit. |
10 | The conditions for the ApproachingEnforcedSamplesLimit alert must be true for this duration before the alert is fired. |
11 | Defines the severity for the ApproachingEnforcedSamplesLimit alert. |
Apply the configuration to the user-defined project:
$ oc apply -f monitoring-stack-alerts.yaml
Additionally, you can check if a target has hit the configured limit:
In the Administrator perspective of the web console, go to Observe → Targets and select an endpoint with a Down
status that you want to check.
The Scrape failed: sample limit exceeded message is displayed if the endpoint failed because of an exceeded sample limit.
You can configure pod topology spread constraints for all the pods for user-defined monitoring to control how pod replicas are scheduled to nodes across zones. This ensures that the pods are highly available and run more efficiently, because workloads are spread across nodes in different data centers or hierarchical infrastructure zones.
You can configure pod topology spread constraints for monitoring pods by using the user-workload-monitoring-config
config map.
You have access to the cluster as a user with the cluster-admin
cluster role or as a user with the user-workload-monitoring-config-edit
role in the openshift-user-workload-monitoring
project.
A cluster administrator has enabled monitoring for user-defined projects.
You have installed the OpenShift CLI (oc
).
Edit the user-workload-monitoring-config
config map in the openshift-user-workload-monitoring
project:
$ oc -n openshift-user-workload-monitoring edit configmap user-workload-monitoring-config
Add the following settings under the data/config.yaml
field to configure pod topology spread constraints:
apiVersion: v1
kind: ConfigMap
metadata:
name: user-workload-monitoring-config
namespace: openshift-user-workload-monitoring
data:
config.yaml: |
<component>: (1)
topologySpreadConstraints:
- maxSkew: <n> (2)
topologyKey: <key> (3)
whenUnsatisfiable: <value> (4)
labelSelector: (5)
<match_option>
1 | Specify a name of the component for which you want to set up pod topology spread constraints. |
2 | Specify a numeric value for maxSkew , which defines the degree to which pods are allowed to be unevenly distributed. |
3 | Specify a key of node labels for topologyKey .
Nodes that have a label with this key and identical values are considered to be in the same topology.
The scheduler tries to put a balanced number of pods into each domain. |
4 | Specify a value for whenUnsatisfiable .
Available options are DoNotSchedule and ScheduleAnyway .
Specify DoNotSchedule if you want the maxSkew value to define the maximum difference allowed between the number of matching pods in the target topology and the global minimum.
Specify ScheduleAnyway if you want the scheduler to still schedule the pod but to give higher priority to nodes that might reduce the skew. |
5 | Specify labelSelector to find matching pods.
Pods that match this label selector are counted to determine the number of pods in their corresponding topology domain. |
apiVersion: v1
kind: ConfigMap
metadata:
name: user-workload-monitoring-config
namespace: openshift-user-workload-monitoring
data:
config.yaml: |
thanosRuler:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: monitoring
whenUnsatisfiable: ScheduleAnyway
labelSelector:
matchLabels:
app.kubernetes.io/name: thanos-ruler
Save the file to apply the changes. The pods affected by the new configuration are automatically redeployed.