×

Find troubleshooting steps for common issues with user-defined project monitoring.

Determining why user-defined project metrics are unavailable

If metrics are not displaying when monitoring user-defined projects, follow these steps to troubleshoot the issue.

Procedure
  1. Query the metric name and verify that the project is correct:

    1. From the Developer perspective in the web console, select ObserveMetrics.

    2. Select the project that you want to view metrics for in the Project: list.

    3. Choose a query from the Select query list, or run a custom PromQL query by selecting Show PromQL.

      The metrics are displayed in a chart.

      Queries must be done on a per-project basis. The metrics that are shown relate to the project that you have selected.

  2. Verify that the pod that you want metrics from is actively serving metrics. Run the following oc exec command into a pod to target the podIP, port, and /metrics.

    $ oc exec <sample_pod> -n <sample_namespace> -- curl <target_pod_IP>:<port>/metrics

    You must run the command on a pod that has curl installed.

    The following example output shows a result with a valid version metric.

    Example output
      % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                     Dload  Upload   Total   Spent    Left  Speed
    # HELP version Version information about this binary-- --:--:-- --:--:--     0
    # TYPE version gauge
    version{version="v0.1.0"} 1
    100   102  100   102    0     0  51000      0 --:--:-- --:--:-- --:--:-- 51000

    An invalid output indicates that there is a problem with the corresponding application.

  3. If you are using a PodMonitor CRD, verify that the PodMonitor CRD is configured to point to the correct pods using label matching. For more information, see the Prometheus Operator documentation.

  4. If you are using a ServiceMonitor CRD, and if the /metrics endpoint of the pod is showing metric data, follow these steps to verify the configuration:

    1. Verify that the service is pointed to the correct /metrics endpoint. The service labels in this output must match the services monitor labels and the /metrics endpoint defined by the service in the subsequent steps.

      $ oc get service
      Example output
      apiVersion: v1
      kind: Service (1)
      metadata:
        labels: (2)
          app: prometheus-example-app
        name: prometheus-example-app
        namespace: ns1
      spec:
        ports:
        - port: 8080
          protocol: TCP
          targetPort: 8080
          name: web
        selector:
          app: prometheus-example-app
        type: ClusterIP
      1 Specifies that this is a service API.
      2 Specifies the labels that are being used for this service.
    2. Query the serviceIP, port, and /metrics endpoints to see if the same metrics from the curl command you ran on the pod previously:

      1. Run the following command to find the service IP:

        $ oc get service -n <target_namespace>
      2. Query the /metrics endpoint:

        $ oc exec <sample_pod> -n <sample_namespace> -- curl <service_IP>:<port>/metrics

        Valid metrics are returned in the following example.

        Example output
        % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                       Dload  Upload   Total   Spent    Left  Speed
        100   102  100   102    0     0  51000      0 --:--:-- --:--:-- --:--:--   99k
        # HELP version Version information about this binary
        # TYPE version gauge
        version{version="v0.1.0"} 1
    3. Use label matching to verify that the ServiceMonitor object is configured to point to the desired service. To do this, compare the Service object from the oc get service output to the ServiceMonitor object from the oc get servicemonitor output. The labels must match for the metrics to be displayed.

      For example, from the previous steps, notice how the Service object has the app: prometheus-example-app label and the ServiceMonitor object has the same app: prometheus-example-app match label.

  5. If everything looks valid and the metrics are still unavailable, please contact the support team for further help.

Determining why Prometheus is consuming a lot of disk space

Developers can create labels to define attributes for metrics in the form of key-value pairs. The number of potential key-value pairs corresponds to the number of possible values for an attribute. An attribute that has an unlimited number of potential values is called an unbound attribute. For example, a customer_id attribute is unbound because it has an infinite number of possible values.

Every assigned key-value pair has a unique time series. The use of many unbound attributes in labels can result in an exponential increase in the number of time series created. This can impact Prometheus performance and can consume a lot of disk space.

You can use the following measures when Prometheus consumes a lot of disk:

  • Check the number of scrape samples that are being collected.

  • Check the time series database (TSDB) status using the Prometheus HTTP API for more information about which labels are creating the most time series. Doing so requires cluster administrator privileges.

  • Reduce the number of unique time series that are created by reducing the number of unbound attributes that are assigned to user-defined metrics.

    Using attributes that are bound to a limited set of possible values reduces the number of potential key-value pair combinations.

  • Enforce limits on the number of samples that can be scraped across user-defined projects. This requires cluster administrator privileges.

Prerequisites
  • You have access to the cluster as a user with the dedicated-admin role.

  • You have installed the OpenShift CLI (oc).

Procedure
  1. In the Administrator perspective, navigate to ObserveMetrics.

  2. Run the following Prometheus Query Language (PromQL) query in the Expression field. This returns the ten metrics that have the highest number of scrape samples:

    topk(10,count by (job)({__name__=~".+"}))
  3. Investigate the number of unbound label values assigned to metrics with higher than expected scrape sample counts.

    • If the metrics relate to a user-defined project, review the metrics key-value pairs assigned to your workload. These are implemented through Prometheus client libraries at the application level. Try to limit the number of unbound attributes referenced in your labels.

    • If the metrics relate to a core Red Hat OpenShift Service on AWS project, create a Red Hat support case on the Red Hat Customer Portal.

  4. Review the TSDB status using the Prometheus HTTP API by running the following commands as a dedicated-admin:

    $ oc login -u <username> -p <password>
    $ host=$(oc -n openshift-monitoring get route prometheus-k8s -ojsonpath={.spec.host})
    $ token=$(oc whoami -t)
    $ curl -H "Authorization: Bearer $token" -k "https://$host/api/v1/status/tsdb"
    Example output
    "status": "success",
Additional resources