Enabling Cluster Metrics | Installation and Configuration

Overview

The kubelet exposes metrics that can be collected and stored in back-ends by Heapster.

As an OpenShift Container Platform administrator, you can view a cluster’s metrics from all containers and components in one user interface. These metrics are also used by horizontal pod autoscalers in order to determine when and how to scale.

This topic describes using Hawkular Metrics as a metrics engine which stores the data persistently in a Cassandra database. When this is configured, CPU, memory and network-based metrics are viewable from the OpenShift Container Platform web console and are available for use by horizontal pod autoscalers.

Heapster retrieves a list of all nodes from the master server, then contacts each node individually through the /stats endpoint. From there, Heapster scrapes the metrics for CPU, memory and network usage, then exports them into Hawkular Metrics.

Browsing individual pods in the web console displays separate sparkline charts for memory and CPU. The time range displayed is selectable, and these charts automatically update every 30 seconds. If there are multiple containers on the pod, then you can select a specific container to display its metrics.

If resource limits are defined for your project, then you can also see a donut chart for each pod. The donut chart displays usage against the resource limit. For example: 145 Available of 200 MiB, with the donut chart showing 55 MiB Used.

Before You Begin

The components for cluster metrics must be deployed to the openshift-infra project. This allows horizontal pod autoscalers to discover the Heapster service and use it to retrieve metrics that can be used for autoscaling.

All of the following commands in this topic must be executed under the openshift-infra project. To switch to the openshift-infra project:

$ oc project openshift-infra

To enable cluster metrics, you must next configure the following:

Service Accounts
Metrics Data Storage
Metrics Deployer

Service Accounts

You must configure service accounts for:

Metrics Deployer
Heapster

Metrics Deployer Service Account

The Metrics Deployer will be discussed in a later step, but you must first set up a service account for it:

Grant view permissions to the Hawkular Metrics service:

$ oc adm policy add-role-to-user view system:serviceaccount:openshift-infra:hawkular -n openshift-infra

Create a metrics-deployer service account:

$ oc create -f - <<API
apiVersion: v1
kind: ServiceAccount
metadata:
  name: metrics-deployer
secrets:
- name: metrics-deployer
API

Before it can deploy components, the metrics-deployer service account must also be granted the edit permission for the openshift-infra project:
```
$ oadm policy add-role-to-user \
    edit system:serviceaccount:openshift-infra:metrics-deployer
```

Heapster Service Account

The Heapster component requires access to the master server to list all available nodes and access the /stats endpoint for each node. Before it can do this, the Heapster service account requires the cluster-reader permission:

$ oadm policy add-cluster-role-to-user \
    cluster-reader system:serviceaccount:openshift-infra:heapster

The Heapster service account is created automatically during the Deploying the Metrics Components step.

Metrics Data Storage

You can store the metrics data to either persistent storage or to a temporary pod volume.

Persistent Storage

Running OpenShift Container Platform cluster metrics with persistent storage means that your metrics will be stored to a persistent volume and be able to survive a pod being restarted or recreated. This is ideal if you require your metrics data to be guarded from data loss. For production environments it is highly recommended to configure persistent storage for your metrics pods.

The size of the persisted volume can be specified with the CASSANDRA_PV_SIZE template parameter. By default it is set to 10 GB, which may or may not be sufficient for the size of the cluster you are using. If you require more space, for instance 100 GB, you could specify it with something like this:

$ oc process -f metrics-deployer.yaml \
    -v HAWKULAR_METRICS_HOSTNAME=hawkular-metrics.example.com \
    -v CASSANDRA_PV_SIZE=100Gi \
    | oc create -f -

The size requirement of the Cassandra storage is dependent on the number of pods. It is the administrator’s responsibility to ensure that the size requirements are sufficient for their setup and to monitor usage to ensure that the disk does not become full.

If you would like to use dynamically provisioned persistent volumes you must set the DYNAMICALLY_PROVISION_STORAGE template option to true for the Metrics Deployer.

Capacity Planning for Cluster Metrics

After the metrics deployer runs, the output of oc get pods should resemble the following:

 # oc get pods -n openshift-infra
 NAME                                READY             STATUS      RESTARTS          AGE
 hawkular-cassandra-1-l5y4g          1/1               Running     0                 17h
 hawkular-metrics-1t9so              1/1               Running     0                 17h
 heapster-febru                      1/1               Running     0                 17h

OpenShift Container Platform metrics are stored using the Cassandra database, which is deployed with settings of MAX_HEAP_SIZE=512M and NEW_HEAP_SIZE=100M. These values should cover most OpenShift Container Platform metrics installations, but you can modify them in the Cassandra Dockerfile prior to deploying cluster metrics.

By default, metrics data is stored for 7 days. You can configure this with the METRIC_DURATION parameter in the metrics.yaml configuration file. After seven days, Cassandra begins to purge the oldest metrics data. Metrics data for deleted pods and projects is not automatically purged; it is only removed once the data is over seven days old.

If the Cassandra persisted volume runs out of sufficient space, then data loss will occur.

For cluster metrics to work with persistent storage, ensure that the persistent volume has the ReadWriteOnce access mode. If this mode is not active, then the persistent volume claim cannot locate the persistent volume, and Cassandra fails to start.

To use persistent storage with the metric components, ensure that a persistent volume of sufficient size is available. The creation of persistent volume claims is handled by the Metrics Deployer.

OpenShift Container Platform metrics also supports dynamically-provisioned persistent volumes. To use this feature with OpenShift Container Platform metrics, it is necessary to add an additional flag to the metrics-deployer: DYNAMICALLY_PROVISION_STORAGE=true. You can use EBS, GCE, and Cinder storage back-ends to dynamically provision persistent volumes.

In tests performed with with 210 and 990 OpenShift Container Platform nodes, where 10500 pods and 11000 pods were monitored respectively, the Cassandra database grew at the speed shown in the table below:

Table 1. Cassandra Database storage requirements based on number of nodes/pods in the cluster
Number of Nodes	Number of Pods	Cassandra Storage growth speed	Cassandra storage growth per day	Cassandra storage growth per week
210	10500	500 MB per hour	15 GB	75 GB
990	11000	1 GB per hour	30 GB	210 GB

In the above calculation, approximately 20% of the expected size was added as overhead to ensure that the storage requirements do not exceed calculated value.

If the METRICS_DURATION and METRICS_RESOLUTION values are kept at the default (7 days and 15 seconds respectively), it is safe to plan Cassandra storage size requrements for week, as in the values above.

Because OpenShift Container Platform metrics uses the Cassandra database as a datastore for metrics data, if USE_PERSISTANT_STORAGE=true is set during the metrics set up process, PV will be on top in the network storage, with NFS as the default. However, using network storage in combination with Cassandra is not recommended, as per the Cassandra documentation.

Known Issues and Limitations

Testing found that the heapster metrics component is capable of handling up to 12000 pods. If the amount of pods exceed that number, Heapster begins to fall behind in metrics processing, resulting in the possibility of metrics graphs no longer appearing. Work is ongoing to increase the number of pods that Heapster can gather metrics on, as well as upstream development of alternate metrics-gathering solutions.

Non-Persistent Storage

Running OpenShift Container Platform cluster metrics with non-persistent storage means that any stored metrics will be deleted when the pod is deleted. While it is much easier to run cluster metrics with non-persistent data, running with non-persistent data does come with the risk of permanent data loss. However, metrics can still survive a container being restarted.

In order to use non-persistent storage, you must set the USE_PERSISTENT_STORAGE template option to false for the Metrics Deployer.

When using non-persistent storage, metrics data will be written to /var/lib/origin/openshift.local.volumes/pods on the node where the Cassandra pod is running. Ensure /var has enough free space to accommodate metrics storage.

Metrics Deployer

The Metrics Deployer deploys and configures all of the metrics components. You can configure it by passing in information from secrets and by passing parameters to the Metrics Deployer’s template.

Using Secrets

The Metrics Deployer will auto-generate self-signed certificates for use between its components and will generate a re-encrypting route to expose the Hawkular Metrics service. This route is what allows the web console to access the Hawkular Metrics service.

In order for the browser running the web console to trust the connection through this route, it must trust the route’s certificate. This can be accomplished by providing your own certificates signed by a trusted Certificate Authority. The Metric Deployer’s secret allows you to pass your own certificates which it will then use when creating the route.

If you do not wish to provide your own certificates, the router’s default certificate will be used instead.

Providing Your Own Certificates

To provide your own certificate which will be used by the re-encrypting route, you can pass these values as secrets to the Metrics Deployer.

The hawkular-metrics.pem value needs to contain the certificate in its .pem format. You may also need to provide the certificate for the Certificate Authority which signed this pem file via the hawkular-metrics-ca.cert secret.

$ oc secrets new metrics-deployer \
    hawkular-metrics.pem=/home/openshift/metrics/hm.pem \
    hawkular-metrics-ca.cert=/home/openshift/metrics/hm-ca.cert

When these secrets are provided, the deployer uses these values to specify the key, certificate and caCertificate values for the re-encrypting route it generated.

For more information, please see the re-encryption route documentation.

Using the Router’s Default Certificate

If the hawkular-metrics.pem value is not specified, the re-encrypting route will use the router’s default certificate, which may not be trusted by browsers.

A secret named metrics-deployer will still be required in this situation. This can be considered a "dummy" secret since the secret it specifies is not actually used by the component.

To create a "dummy" secret that does not specify a certificate value:

$ oc secrets new metrics-deployer nothing=/dev/null

Deployer Secret Options

The following table contains more advanced configuration options, detailing all the secrets which can be used by the deployer:

Secret Name Description

Secret Name	Description
*hawkular-metrics.pem*	The *pem* file to use for the Hawkular Metrics certificate used for the re-encrypting route. This certificate must contain the host name used by the route (e.g., `HAWKULAR_METRICS_HOSTNAME`). The default router’s certificate is used for the route if this option is unspecified.
*hawkular-metrics-ca.cert*	The certificate for the CA used to sign the *hawkular-metrics.pem. This is optional if the hawkular-metrics.pem* does not contain the CA certificate directly.
*heapster.cert*	The certificate for Heapster to use. This is auto-generated if unspecified.
*heapster.key*	The key to use with the Heapster certificate. This is ignored if *heapster.cert* is not specified
*heapster_client_ca.cert*	The certificate that generates *heapster.cert. This is required if heapster.cert* is specified. Otherwise, the main CA for the OpenShift Container Platform installation is used. In order for horizontal pod autoscaling to function properly, this should not be overridden.
*heapster_allowed_users*	A file containing a comma-separated list of CN to accept from certificates signed with the specified CA. By default, this is set to allow the OpenShift Container Platform service proxy to connect. If you override this, make sure to add `system:master-proxy` to the list in order to allow horizontal pod autoscaling to function properly.

hawkular-metrics.pem

The pem file to use for the Hawkular Metrics certificate used for the re-encrypting route. This certificate must contain the host name used by the route (e.g., HAWKULAR_METRICS_HOSTNAME). The default router’s certificate is used for the route if this option is unspecified.

hawkular-metrics-ca.cert

The certificate for the CA used to sign the hawkular-metrics.pem. This is optional if the hawkular-metrics.pem does not contain the CA certificate directly.

heapster.cert

The certificate for Heapster to use. This is auto-generated if unspecified.

heapster.key

The key to use with the Heapster certificate. This is ignored if heapster.cert is not specified

heapster_client_ca.cert

The certificate that generates heapster.cert. This is required if heapster.cert is specified. Otherwise, the main CA for the OpenShift Container Platform installation is used. In order for horizontal pod autoscaling to function properly, this should not be overridden.

heapster_allowed_users

A file containing a comma-separated list of CN to accept from certificates signed with the specified CA. By default, this is set to allow the OpenShift Container Platform service proxy to connect. If you override this, make sure to add system:master-proxy to the list in order to allow horizontal pod autoscaling to function properly.

Modifying the Deployer Template

The OpenShift Container Platform installer uses a template to deploy the metrics components. The default template can be found at the following path:

/usr/share/ansible/openshift-ansible/roles/openshift_hosted_templates/files/v1.4/enterprise/metrics-deployer.yaml

In case you need to make any changes to this file, copy it to another directory with the file name metrics-deployer.yaml and refer to the new location when using it in the following sections.

Deployer Template Parameters

The deployer template parameter options and their defaults are listed in the default metrics-deployer.yaml file. If required, you can override these values when creating the Metrics Deployer.

Table 2. Template Parameters
Parameter	Description
`METRIC_DURATION`	The number of days metrics should be stored.
`CASSANDRA_PV_SIZE`	The persistent volume size for each of the Cassandra nodes.
`CASSANDRA_NODES`	The number of initial Cassandra nodes to deploy.
`USE_PERSISTENT_STORAGE`	Set to true for persistent storage; set to false to use non-persistent storage.
`DYNAMICALLY_PROVISION_STORAGE`	Set to true to allow for dynamically provisioned storage.
`REDEPLOY`	If set to true, the deployer will try to delete all the existing components before trying to redeploy.
`HAWKULAR_METRICS_HOSTNAME`	External host name where clients can reach Hawkular Metrics. This is the FQDN of the machine running the router pod.
`MASTER_URL`	Internal URL for the master, for authentication retrieval.
`IMAGE_VERSION`	Specify version for metrics components. For example, for openshift/origin-metrics-deployer:latest, set version to latest.
`IMAGE_PREFIX`	Specify prefix for metrics components. For example, for openshift/origin-metrics-deployer:latest, set prefix to openshift/origin-.
`MODE`	Can be set to: preflight to perform validation before a deployment. deploy to perform an initial deployment. refresh to delete and redeploy all components but to keep persisted data and routes. redeploy to delete and redeploy everything (losing all data in the process). validate to re-run validations after a deployment.
`IGNORE_PREFLIGHT`	Can be set to true to disable the preflight checks. This allows the deployer to continue even if the preflight check has failed.
`USER_WRITE_ACCESS`	Sets whether user accounts should be able to write metrics. Defaults to `false` so that only Heapster can write metrics and not individual users. It is recommended to disable user write access; if enabled, any user will be able to write metrics to the system which can affect performance and can cause Cassandra disk usage to unpredictably increase.

The only required parameter is HAWKULAR_METRICS_HOSTNAME. This value is required when creating the deployer, because it specifies the host name for the Hawkular Metrics route. This value should correspond to a fully qualified domain name. You must know the value of HAWKULAR_METRICS_HOSTNAME when configuring the console for metrics access.

If you are using persistent storage with Cassandra, it is the administrator’s responsibility to set a sufficient disk size for the cluster using the CASSANDRA_PV_SIZE parameter. It is also the administrator’s responsibility to monitor disk usage to make sure that it does not become full.

Data loss will result if the Cassandra persisted volume runs out of sufficient space.

All of the other parameters are optional and allow for greater customization. For instance, if you have a custom install in which the Kubernetes master is not available under https://kubernetes.default.svc:443 you can specify the value to use instead with the MASTER_URL parameter. To deploy a specific version of the metrics components, use the IMAGE_VERSION parameter.

It is highly recommended to not use latest for the IMAGE_VERSION. The latest version corresponds to the very latest version available and can cause issues if it brings in a newer version not meant to function on the version of OpenShift Container Platform you are currently running.

Deploying the Metric Components

Because deploying and configuring all the metric components is handled by the Metrics Deployer, you can simply deploy everything in one step.

The following examples show you how to deploy metrics with and without persistent storage using the default template parameters. Optionally, you can specify any of the template parameters when calling these commands.

In accordance with upstream Kubernetes rules, metrics can be collected only on the default interface of eth0.

Example 1. Deploying with Persistent Storage

The following command sets the Hawkular Metrics route to use hawkular-metrics.example.com and is deployed using persistent storage.

You must have a persistent volume of sufficient size available.

$ oc new-app --as=system:serviceaccount:openshift-infra:metrics-deployer \
    -f metrics-deployer.yaml \
    -p HAWKULAR_METRICS_HOSTNAME=hawkular-metrics.example.com

Example 2. Deploying without Persistent Storage

The following command sets the Hawkular Metrics route to use hawkular-metrics.example.com and deploy without persistent storage.

$ oc new-app --as=system:serviceaccount:openshift-infra:metrics-deployer \
    -f metrics-deployer.yaml \
    -p HAWKULAR_METRICS_HOSTNAME=hawkular-metrics.example.com \
    -p USE_PERSISTENT_STORAGE=false

Because this is being deployed without persistent storage, metric data loss can occur.

Metrics Deployer Validations

The metrics deployer does some validation both before and after deployment. If the pre-flight validation fails, the environment for deployment is considered unsuitable and the deployment is aborted. However, you can add IGNORE_PREFLIGHT=true to the deployer parameters if you believe the validation has erred.

If post-deployment validation fails, the deployer finishes in an Error state, which indicates that you should check the deployer logs for issues that may require addressing. For example, the validation may detect that the external hawkular-metrics route is not actually in use, because the route was already created somewhere else. The validation output at the end of a deployment should explain as clearly as possible any issues it finds and what you can do to address them.

Once you have addressed deployment validation issues, you can re-run just the validation by running the deployer again with the MODE=validate parameter added, for example:

$ oc new-app --as=system:serviceaccount:openshift-infra:metrics-deployer \
    -f metrics-deployer.yaml \
    -p HAWKULAR_METRICS_HOSTNAME=hawkular-metrics.example.com \
    -p MODE=validate

There is also a diagnostic for metrics:

$ oadm diagnostics MetricsApiProxy

Setting the Metrics Public URL

The OpenShift Container Platform web console uses the data coming from the Hawkular Metrics service to display its graphs. The URL for accessing the Hawkular Metrics service must be configured via the metricsPublicURL option in the master configuration file (/etc/origin/master/master-config.yaml). This URL corresponds to the route created with the HAWKULAR_METRICS_HOSTNAME template parameter during the deployment of the metrics components.

You must be able to resolve the HAWKULAR_METRICS_HOSTNAME from the browser accessing the console.

For example, if your HAWKULAR_METRICS_HOSTNAME corresponds to hawkular-metrics.example.com, then you must make the following change in the master-config.yaml file:

  assetConfig:
    ...
    metricsPublicURL: "https://hawkular-metrics.example.com/hawkular/metrics"

Once you have updated and saved the master-config.yaml file, you must restart your OpenShift Container Platform instance.

When your OpenShift Container Platform server is back up and running, metrics will be displayed on the pod overview pages.

If you are using self-signed certificates, remember that the Hawkular Metrics service is hosted under a different host name and uses different certificates than the console. You may need to explicitly open a browser tab to the value specified in metricsPublicURL and accept that certificate.

To avoid this issue, use certificates which are configured to be acceptable by your browser.

Accessing Hawkular Metrics Directly

To access and manage metrics more directly, use the Hawkular Metrics API.

When accessing Hawkular Metrics via the API, you will only be able to perform reads. Writing metrics has been disabled by default. If you want for individual users to also be able to write metrics, you must set the USER_WRITE_ACCESS deployer template parameter to true.

However, it is recommended to use the default configuration and only have metrics enter the system via Heapster. If write access is enabled, any user will be able to write metrics to the system, which can affect performance and cause Cassandra disk usage to unpredictably increase.

The Hawkular Metrics documentation covers how to use the API, but there are a few differences when dealing with the version of Hawkular Metrics configured for use on OpenShift Container Platform:

OpenShift Container Platform Projects and Hawkular Tenants

Hawkular Metrics is a multi-tenanted application. It is configured so that a project in OpenShift Container Platform corresponds to a tenant in Hawkular Metrics.

As such, when accessing metrics for a project named MyProject you must set the Hawkular-Tenant header to MyProject.

There is also a special tenant named _system which contains system level metrics. This requires either a cluster-reader or cluster-admin level privileges to access.

Authorization

The Hawkular Metrics service will authenticate the user against OpenShift Container Platform to determine if the user has access to the project it is trying to access.

Hawkular Metrics accepts a bearer token from the client and verifies that token with the OpenShift Container Platform server using a SubjectAccessReview. If the user has proper read privileges for the project, they are allowed to read the metrics for that project. For the _system tenant, the user requesting to read from this tenant must have cluster-reader permission.

When accessing the Hawkular Metrics API, you must pass a bearer token in the Authorization header.

Scaling OpenShift Container Platform Metrics Pods

One set of metrics pods (Cassandra/Hawkular/Heapster) is able to monitor at least 10,000 pods.

Autoscaling the metrics components, such as Hawkular and Heapster, is not supported by OpenShift Container Platform.

Pay attention to system load on nodes where OpenShift Container Platform metrics pods run. Use that information to determine if it is necessary to scale out a number of OpenShift Container Platform metrics pods and spread the load across multiple OpenShift Container Platform nodes. Scaling OpenShift Container Platform metrics heapster pods is not recommended.

Prerequisites

If persistent storage was used to deploy OpenShift Container Platform metrics, then you must create a persistent volume (PV) for the new Cassandra pod to use before you can scale out the number of OpenShift Container Platform metrics Cassandra pods. However, if Cassandra was deployed with dynamically provisioned PVs, then this step is not necessary.

Scaling the Cassandra Components

The Cassandra nodes use persistent storage, therefore scaling up or down is not possible with replication controllers.

Scaling a Cassandra cluster requires you to use the hawkular-cassandra-node template. By default, the Cassandra cluster is a single-node cluster.

To scale up the number of OpenShift Container Platform metrics hawkular pods to two replicas, run:

# oc scale -n openshift-infra --replicas=2 rc hawkular-metrics

Alternatively, update your inventory file and re-run the deployment.

If you add a new node to or remove an existing node from a Cassandra cluster, the data stored in the cluster rebalances across the cluster.

To scale down:

If remotely accessing the container, run the following for the Cassandra node you want to remove:
```
$ oc exec -it <hawkular-cassandra-pod> nodetool decommission
```
If locally accessing the container, run the following instead:
```
$ oc rsh <hawkular-cassandra-pod> nodetool decommission
```
This command can take a while to run since it copies data across the cluster. You can monitor the decommission progress with nodetool netstats -H.
Once the previous command succeeds, scale down the rc for the Cassandra instance to 0.
```
# oc scale -n openshift-infra --replicas=0 rc <hawkular-cassandra-rc>
```
This will remove the Cassandra pod.

If the scale down process completed and the existing Cassandra nodes are functioning as expected, you can also delete the rc for this Cassandra instance and its corresponding persistent volume claim (PVC). Deleting the PVC can permanently delete any data associated with this Cassandra instance, so if the scale down did not fully and successfully complete, you will not be able to recover the lost data.

Cleanup

You can remove everything deloyed by the metrics deployer by performing the following steps:

$ oc delete all,sa,templates,secrets,pvc --selector="metrics-infra"

To remove the deployer components, perform the following steps:

$ oc delete sa,secret metrics-deployer