Aggregating Container Logs | Installation and Configuration

Overview
Pre-deployment Configuration
Deploying the EFK Stack
Post-deployment Configuration
- Elasticsearch
- Fluentd
- Kibana
- Cleanup
Upgrading
Troubleshooting Kibana
External Elasticsearch Instance with Fluentd

Overview

As an OpenShift Enterprise cluster administrator, you can deploy the EFK stack to aggregate logs for a range of OpenShift Enterprise services. Application developers can view the logs of the projects for which they have view access. The EFK stack aggregates logs from hosts and applications, whether coming from multiple containers or even deleted pods.

The EFK stack is a modified version of the ELK stack and is comprised of:

Elasticsearch: An object store where all logs are stored.
Fluentd: Gathers logs from nodes and feeds them to Elasticsearch.
Kibana: A web UI for Elasticsearch.

Once deployed in a cluster, the stack aggregates logs from all nodes and projects into Elasticsearch, and provides a Kibana UI to view any logs. Cluster administrators can view all logs, but application developers can only view logs for projects they have permission to view. The stack components communicate securely.

Managing Docker Container Logs discusses the use of json-file logging driver options to manage container logs and prevent filling node disks.

Pre-deployment Configuration

Ensure that you have deployed a router for the cluster.
Ensure that you have the necessary storage for Elasticsearch. Note that each Elasticsearch replica requires its own storage volume. See Elasticsearch for more information.
Ansible-based installs should create the logging-deployer-template template in the openshift project. Otherwise you can create it with the following command:
$ oc create -n openshift -f \ /usr/share/openshift/examples/infrastructure-templates/enterprise/logging-deployer.yaml

Create a new project. Once implemented in a single project, the EFK stack collects logs for every project within your OpenShift Enterprise cluster. The examples in this topic use logging as an example project:

$ oadm new-project logging --node-selector=""
$ oc project logging

Specifying a non-empty node selector on the project is not recommended, as this would restrict where Fluentd can be deployed. Instead, specify node selectors for the deployer to be applied to your other deployment configurations.

Create a secret to provide security-related files to the deployer. While the secret is necessary, the contents of the secret are optional, and will be generated for you if none are supplied.

You can supply the following files when creating a new secret:

File Name	Description
*kibana.crt*	A browser-facing certificate for the Kibana server.
*kibana.key*	A key to be used with the Kibana certificate.
*kibana-ops.crt*	A browser-facing certificate for the Ops Kibana server.
*kibana-ops.key*	A key to be used with the Ops Kibana certificate.
*server-tls.json*	JSON TLS options to override the Kibana server defaults. Refer to Node.JS docs for available options.
*ca.crt*	A certificate for a CA that will be used to sign all certificates generated by the deployer.
*ca.key*	A matching CA key.

File Name

Description

kibana.crt

A browser-facing certificate for the Kibana server.

kibana.key

A key to be used with the Kibana certificate.

kibana-ops.crt

A browser-facing certificate for the Ops Kibana server.

kibana-ops.key

A key to be used with the Ops Kibana certificate.

server-tls.json

JSON TLS options to override the Kibana server defaults. Refer to Node.JS docs for available options.

ca.crt

A certificate for a CA that will be used to sign all certificates generated by the deployer.

ca.key

A matching CA key.

For example:

$ oc secrets new logging-deployer \
   kibana.crt=/path/to/cert kibana.key=/path/to/key

If a certificate file is not passed as a secret, the deployer will generate a self-signed certificate instead. However, a secret is still required for the deployer to run. In this case, you can create a "dummy" secret that does not specify a certificate value:

$ oc secrets new logging-deployer nothing=/dev/null

Create the deployer service account:

$ oc create -f - <<API
apiVersion: v1
kind: ServiceAccount
metadata:
  name: logging-deployer
secrets:
- name: logging-deployer
API

Enable the Fluentd service account, which the deployer will create, that requires special privileges to operate Fluentd. Add the service account user to the security context:
$ oadm policy add-scc-to-user \ privileged system:serviceaccount:logging:aggregated-logging-fluentd (1)
1 Use the new project you created earlier (e.g., logging) when specifying this service account.
Give the Fluentd service account permission to read labels from all pods:
$ oadm policy add-cluster-role-to-user cluster-reader \ system:serviceaccount:logging:aggregated-logging-fluentd (1)
1 Use the new project you created earlier (e.g., logging) when specifying this service account.

Deploying the EFK Stack

The EFK stack is deployed using a template.

Run the deployer, specifying at least the parameters in the following example (more are described in the table below):

$ oc new-app logging-deployer-template \
    --param KIBANA_HOSTNAME=kibana.example.com \
    --param ES_CLUSTER_SIZE=1 \
    --param PUBLIC_MASTER_URL=https://localhost:8443

Be sure to replace at least KIBANA_HOSTNAME and PUBLIC_MASTER_URL with values relevant to your deployment.

The available parameters are:

Variable Name Description

Variable Name	Description
`PUBLIC_MASTER_URL`	(Required with the `oc process` command) The external URL for the master. For OAuth use.
`ENABLE_OPS_CLUSTER`	If set to `true`, configures a second Elasticsearch cluster and Kibana for operations logs. Fluentd splits logs between the main cluster and a cluster reserved for operations logs (which consists of */var/log/messages* on nodes and the logs from the projects default, openshift, and openshift-infra). This means a second Elasticsearch and Kibana are deployed. The deployments are distinguishable by the -ops included in their names and have parallel deployment options listed below.
`KIBANA_HOSTNAME`, `KIBANA_OPS_HOSTNAME`	(Required with the `oc process` command) The external host name for web clients to reach Kibana.
`ES_CLUSTER_SIZE`, `ES_OPS_CLUSTER_SIZE`	(Required with the `oc process` command) The number of instances of Elasticsearch to deploy. Redundancy requires at least three, and more can be used for scaling.
`ES_INSTANCE_RAM`, `ES_OPS_INSTANCE_RAM`	Amount of RAM to reserve per Elasticsearch instance. The default is 8G (for 8GB), and it must be at least 512M. Possible suffixes are G,g,M,m.
`ES_NODE_QUORUM`, `ES_OPS_NODE_QUORUM`	The quorum required to elect a new master. Should be more than half the intended cluster size.
`ES_RECOVER_AFTER_NODES`, `ES_OPS_RECOVER_AFTER_NODES`	When restarting the cluster, require this many nodes to be present before starting recovery. Defaults to one less than the cluster size to allow for one missing node.
`ES_RECOVER_EXPECTED_NODES`, `ES_OPS_RECOVER_EXPECTED_NODES`	When restarting the cluster, wait for this number of nodes to be present before starting recovery. By default, the same as the cluster size.
`ES_RECOVER_AFTER_TIME`, `ES_OPS_RECOVER_AFTER_TIME`	When restarting the cluster, this is a timeout for waiting for the expected number of nodes to be present. Defaults to "5m".
`IMAGE_PREFIX`	The prefix for logging component images. For example, setting the prefix to registry.access.redhat.com/openshift3/ose- creates registry.access.redhat.com/openshift3/ose-logging-deployment:latest.
`IMAGE_VERSION`	The version for logging component images. For example, setting the version to 3.1.1 creates registry.access.redhat.com/openshift3/logging-deployment:3.1.1.

PUBLIC_MASTER_URL

(Required with the oc process command) The external URL for the master. For OAuth use.

ENABLE_OPS_CLUSTER

If set to true, configures a second Elasticsearch cluster and Kibana for operations logs. Fluentd splits logs between the main cluster and a cluster reserved for operations logs (which consists of /var/log/messages on nodes and the logs from the projects default, openshift, and openshift-infra). This means a second Elasticsearch and Kibana are deployed. The deployments are distinguishable by the -ops included in their names and have parallel deployment options listed below.

KIBANA_HOSTNAME, KIBANA_OPS_HOSTNAME

(Required with the oc process command) The external host name for web clients to reach Kibana.

ES_CLUSTER_SIZE, ES_OPS_CLUSTER_SIZE

(Required with the oc process command) The number of instances of Elasticsearch to deploy. Redundancy requires at least three, and more can be used for scaling.

ES_INSTANCE_RAM, ES_OPS_INSTANCE_RAM

Amount of RAM to reserve per Elasticsearch instance. The default is 8G (for 8GB), and it must be at least 512M. Possible suffixes are G,g,M,m.

ES_NODE_QUORUM, ES_OPS_NODE_QUORUM

The quorum required to elect a new master. Should be more than half the intended cluster size.

ES_RECOVER_AFTER_NODES, ES_OPS_RECOVER_AFTER_NODES

When restarting the cluster, require this many nodes to be present before starting recovery. Defaults to one less than the cluster size to allow for one missing node.

ES_RECOVER_EXPECTED_NODES, ES_OPS_RECOVER_EXPECTED_NODES

When restarting the cluster, wait for this number of nodes to be present before starting recovery. By default, the same as the cluster size.

ES_RECOVER_AFTER_TIME, ES_OPS_RECOVER_AFTER_TIME

When restarting the cluster, this is a timeout for waiting for the expected number of nodes to be present. Defaults to "5m".

IMAGE_PREFIX

The prefix for logging component images. For example, setting the prefix to registry.access.redhat.com/openshift3/ose- creates registry.access.redhat.com/openshift3/ose-logging-deployment:latest.

IMAGE_VERSION

The version for logging component images. For example, setting the version to 3.1.1 creates registry.access.redhat.com/openshift3/logging-deployment:3.1.1.

Running the deployer creates a deployer pod and prints its name.

Wait until the pod is running. It may take several minutes for OpenShift Enterprise to retrieve the deployer image from the registry.

The logs for the openshift and openshift-infra projects are automatically aggregated and grouped into the .operations item in the Kibana interface.

The project where you have deployed the EFK stack (logging, as documented here) is not aggregated into .operations and is found under its ID.

You can watch its progress with:

$ oc get pod/<pod_name> -w

If it seems to be taking too long to start, you can retrieve more details about the pod and any associated events with:

$ oc describe pod/<pod_name>

When it runs, you can check the logs of the resulting pod to see if the deployment was successful:

$ oc logs -f <pod_name>

As a cluster administrator, deploy the logging-support-template template that the deployer created:

$ oc process logging-support-template | oc create -f -

Deployment of logging components should begin automatically. However, because deployment is triggered based on tags being imported into the ImageStreams created in this step, and not all tags are automatically imported, this mechanism has become unreliable as multiple versions are released. Therefore, manual importing may be necessary as follows.

For each ImageStream logging-auth-proxy, logging-kibana, logging-elasticsearch, and logging-fluentd, manually import the tag corresponding to the IMAGE_VERSION specified (or defaulted) for the deployer.

$ oc import-image <name>:<version> --from <prefix><name>:<tag>

For example:

$ oc import-image logging-auth-proxy:3.1.1 \
     --from registry.access.redhat.com/openshift3/logging-auth-proxy:3.1.1
$ oc import-image logging-kibana:3.1.1 \
     --from registry.access.redhat.com/openshift3/logging-kibana:3.1.1
$ oc import-image logging-elasticsearch:3.1.1 \
     --from registry.access.redhat.com/openshift3/logging-elasticsearch:3.1.1
$ oc import-image logging-fluentd:3.1.1 \
     --from registry.access.redhat.com/openshift3/logging-fluentd:3.1.1

Post-deployment Configuration

Elasticsearch

A highly-available environment requires at least three replicas of Elasticsearch; each on a different host. Elasticsearch replicas require their own storage, but an OpenShift Enterprise deployment configuration shares storage volumes between all its pods. So, when scaled up, the EFK deployer ensures each replica of Elasticsearch has its own deployment configuration.

Viewing all Elasticsearch Deployments

To view all current Elasticsearch deployments:

$ oc get dc --selector logging-infra=elasticsearch

Persistent Elasticsearch Storage

The deployer creates an ephemeral deployment in which all of a pod’s data is lost upon restart. For production usage, add a persistent storage volume to each Elasticsearch deployment configuration.

The best-performing volumes are local disks, if it is possible to use them. Doing so requires some preparation as follows.

The relevant service account must be given the privilege to mount and edit a local volume, as follows:
$ oadm policy add-scc-to-user privileged \ system:serviceaccount:logging:aggregated-logging-elasticsearch (1)
1 Use the new project you created earlier (e.g., logging) when specifying this service account.

Each Elasticsearch replica definition must be patched to claim that privilege, for example:

$ for dc in $(oc get deploymentconfig --selector logging-infra=elasticsearch -o name); do
    oc scale $dc --replicas=0
    oc patch $dc \
       -p '{"spec":{"template":{"spec":{"containers":[{"name":"elasticsearch","securityContext":{"privileged": true}}]}}}}'
  done

The Elasticsearch pods must be located on the correct nodes to use the local storage, and should not move around even if those nodes are taken down for a period of time. This requires giving each Elasticsearch replica a node selector that is unique to the node where an administrator has allocated storage for it. See below for directions on setting a node selector.

Once these steps are taken, a local host mount can be applied to each replica as in this example (where we assume storage is mounted at the same path on each node):

$ for dc in $(oc get deploymentconfig --selector logging-infra=elasticsearch -o name); do
    oc volume $dc \
          --add --overwrite --name=elasticsearch-storage \
          --type=hostPath --path=/usr/local/es-storage
    oc scale $dc --replicas=1
  done

If using host mounts is impractical or undesirable, it may be necessary to attach block storage as a PersistentVolumeClaim as in the following example:

$ oc volume dc/logging-es-<unique> \
          --add --overwrite --name=elasticsearch-storage \
          --type=persistentVolumeClaim --claim-name=logging-es-1

Using NFS storage directly or as a PersistentVolume (or via other NAS such as Gluster) is not supported for Elasticsearch storage, as Lucene relies on filesystem behavior that NFS does not supply. Data corruption and other problems can occur. If NFS storage is a requirement, you can allocate a large file on that storage to serve as a storage device and treat it as a host mount on each host. For example:

$ truncate -s 1T /nfs/storage/elasticsearch-1
$ mkfs.xfs /nfs/storage/elasticsearch-1
$ mount -o loop /nfs/storage/elasticsearch-1 /usr/local/es-storage
$ chown 1000:1000 /usr/local/es-storage

Then, use /usr/local/es-storage as a host-mount as described above. Performance under this solution is significantly worse than using actual local drives.

Node Selector

Because Elasticsearch can use a lot of resources, all members of a cluster should have low latency network connections to each other. Ensure this by directing the instances to dedicated nodes, or a dedicated region within your cluster, using a node selector.

To configure a node selector, edit each deployment configuration and add the nodeSelector parameter to specify the label of the desired nodes:

apiVersion: v1
kind: DeploymentConfig
spec:
  template:
    spec:
      nodeSelector:
        nodelabel: logging-es-node-1

Alternatively you can use the oc patch command:

$ oc patch dc/logging-es-<unique_name> \
   -p '{"spec":{"template":{"spec":{"nodeSelector":{"<label_name>":"<label_value>"}}}}}'

Changing the Scale of Elasticsearch

If you need to scale up the number of Elasticsearch instances your cluster uses, it is not as simple as changing the number of Elasticsearch cluster nodes. This is due to the nature of persistent volumes and how Elasticsearch is configured to store its data and recover the cluster. Instead, you must create a deployment configuration for each Elasticsearch cluster node.

During installation, the deployer creates templates with the Elasticsearch configurations provided to it: logging-es-template and logging-es-ops-template if the deployer was run with ENABLE_OPS_CLUSTER=true.

The node quorum and recovery settings were initially set based on the CLUSTER_SIZE value provided to the deployer. Since the cluster size is changing, those values need to be updated.

Prior to changing the number of Elasticsearch cluster nodes, the EFK stack should first be scaled down to preserve log data as described in Upgrading the EFK Logging Stack.
Edit the cluster template you are scaling up and change the parameters to the desired value:
- NODE_QUORUM is the intended cluster size / 2 (rounded down) + 1. For an intended cluster size of 5, the quorum would be 3.
- RECOVER_EXPECTED_NODES is the same as the intended cluster size.
- RECOVER_AFTER_NODES is the intended cluster size - 1.
  $ oc edit template logging-es[-ops]-template
In addition to updating the template, all of the deployment configurations for that cluster also need to have the three environment variable values above updated. To edit each of the configurations for the cluster in series, you use the following.
$ oc get dc -l component=es[-ops] -o name | xargs -r oc edit
Create an additional deployment configuration, run the following command against the Elasticsearch cluster you want to to scale up for (logging-es-template or logging-es-ops-template).
$ oc new-app logging-es[-ops]-template
These deployments will be named differently, but all will have the logging-es prefix. Be aware of the cluster parameters (described in the deployer parameters) based on cluster size that may need corresponding adjustment in the template, as well as existing deployments.

After the intended number of deployment configurations are created, scale up your cluster, starting with Elasticsearch as described in Upgrading the EFK Logging Stack.

The oc new-app logging-es[-ops]-template command creates a deployment configuration with a persistent volume. If you want to create a Elasticsearch cluster node with a persistent volume attached to it, upon creation you can instead run the following command to create your deployment configuration with a persistent volume claim (PVC) attached.

$ oc process logging-es-template | oc volume -f - \
          --add --overwrite --name=elasticsearch-storage \
          --type=persistentVolumeClaim --claim-name={your_pvc}`

Fluentd

Once Elasticsearch is running, scale Fluentd to every node to feed logs into Elasticsearch. The following example is for an OpenShift Enterprise instance with three nodes:

$ oc scale dc/logging-fluentd --replicas=3

You will need to scale Fluentd if nodes are added or subtracted.

When you make changes to any part of the EFK stack, specifically Elasticsearch or Fluentd, you should first scale Elasicsearch down to zero and scale Fluentd so it does not match any other nodes. Then, make the changes and scale Elasicsearch and Fluentd back.

To scale Elasicsearch to zero:

$ oc scale --replicas=0 dc/<ELASTICSEARCH_DC>

Change nodeSelector in the daemonset configuration to match zero:

Get the fluentd node selector:

$ oc get ds logging-fluentd -o yaml |grep -A 1 Selector
     nodeSelector:
       logging-infra-fluentd: "true"

Use the oc patch command to modify the daemonset nodeSelector:

$ oc patch ds logging-fluentd -p '{"spec":{"template":{"spec":{"nodeSelector":{"nonexistlabel":"true"}}}}}'

Get the fluentd node selector:

$ oc get ds logging-fluentd -o yaml |grep -A 1 Selector
     nodeSelector:
       "nonexistlabel: "true"

Scale Elastcsearch back up from zero:

$ oc scale --replicas=# dc/<ELASTICSEARCH_DC>

Change nodeSelector in the daemonset configuration back to logging-infra-fluentd: "true".

Use the oc patch command to modify the daemonset nodeSelector:

oc patch ds logging-fluentd -p '{"spec":{"template":{"spec":{"nodeSelector":{"logging-infra-fluentd":"true"}}}}}'

Kibana

To access the Kibana console from the OpenShift Enterprise web console, add the loggingPublicURL parameter in the /etc/origin/master/master-config.yaml file, with the URL of the Kibana console (the KIBANA_HOSTNAME parameter). The value must be an HTTPS URL:

...
assetConfig:
  ...
  loggingPublicURL: "https://kibana.example.com"
...

Setting the loggingPublicURL parameter creates a View Archive button on the OpenShift Enterprise web console under the Browse → Pods → <pod_name> → Logs tab. This links to the Kibana console.

You can scale the Kibana deployment as usual for redundancy:

$ oc scale dc/logging-kibana --replicas=2

You can see the UI by visiting the site specified at the KIBANA_HOSTNAME variable.

See the Kibana documentation for more information on Kibana.

Cleanup

You can remove everything generated during the deployment while leaving other project contents intact:

$ oc delete all --selector logging-infra=kibana
$ oc delete all --selector logging-infra=fluentd
$ oc delete all --selector logging-infra=elasticsearch
$ oc delete all --selector logging-infra=curator
$ oc delete all,sa,oauthclient --selector logging-infra=support
$ oc delete secret logging-fluentd logging-elasticsearch \
    logging-es-proxy logging-kibana logging-kibana-proxy \
    logging-kibana-ops-proxy

Upgrading

To upgrade the EFK logging stack, see Manual Upgrades.

Troubleshooting Kibana

Using the Kibana console with OpenShift Enterprise can cause problems that are easily solved, but are not accompanied with useful error messages. Check the following troubleshooting sections if you are experiencing any problems when deploying Kibana on OpenShift Enterprise:

Login Loop

The OAuth2 proxy on the Kibana console must share a secret with the master host’s OAuth2 server. If the secret is not identical on both servers, it can cause a login loop where you are continuously redirected back to the Kibana login page.

To fix this issue, delete the current oauthclient, and create a new one, using the same template as before:

$ oc delete oauthclient/kibana-proxy
$ oc process logging-support-template | oc create -f -

Cryptic Error When Viewing the Console

When attempting to visit the Kibana console, you may instead receive a browser error:

{"error":"invalid_request","error_description":"The request is missing a required parameter,
 includes an invalid parameter value, includes a parameter more than once, or is otherwise malformed."}

This can be caused by a mismatch between the OAuth2 client and server. The return address for the client must be in a whitelist so the server can securely redirect back after logging in.

Fix this issue by replacing the OAuth client entry:

$ oc delete oauthclient/kibana-proxy
$ oc process logging-support-template | oc create -f -

If the problem persists, check that you are accessing Kibana at a URL listed in the OAuth client. This issue can be caused by accessing the URL at a forwarded port, such as 1443 instead of the standard 443 HTTPS port. You can adjust the server whitelist by editing the OAuth client:

$ oc edit oauthclient/kibana-proxy

503 Error When Viewing the Console

If you receive a proxy error when viewing the Kibana console, it could be caused by one of two issues.

First, Kibana may not be recognizing pods. If Elasticsearch is slow in starting up, Kibana may timeout trying to reach it. Check whether the relevant service has any endpoints:

$ oc describe service logging-kibana
Name:                   logging-kibana
[...]
Endpoints:              <none>

If any Kibana pods are live, endpoints will be listed. If they are not, check the state of the Kibana pods and deployment. You may need to scale the deployment down and back up again.

The second possible issue may be caused if the route for accessing the Kibana service is masked. This can happen if you perform a test deployment in one project, then deploy in a different project without completely removing the first deployment. When multiple routes are sent to the same destination, the default router will only route to the first created. Check the problematic route to see if it is defined in multiple places:

$ oc get route  --all-namespaces --selector logging-infra=support

External Elasticsearch Instance with Fluentd

It is possible to configure the Fluentd pod created with aggregated logging to connect to an externally hosted Elasticsearch instance.

Fluentd knows where to send its logs to based on the ES_HOST, ES_PORT, OPS_HOST and OPS_PORT environment variables. If you have an external Elasticsearch instance that will contain both application and operations logs, ensure that ES_HOST and OPS_HOST are the same and that ES_PORT and OPS_PORT are also the same. Fluentd is configured to send its application logs to the ES_HOST destination and all of its operations logs to OPS_HOST.

If your externally hosted Elasticsearch does not make use of TLS you will need to update the *_CLIENT_CERT, *_CLIENT_KEY and *_CA variables to be empty. If it uses TLS but not Mutual TLS, update the *_CLIENT_CERT and *_CLIENT_KEY variables to be empty and patch or recreate the logging-fluentd secret with the appropriate *_CA for communicating with your Elasticsearch. If it uses Mutual TLS as the provided Elasticsearch does, you will just need to patch or recreate the logging-fluentd secret with your client key, client cert, and CA.

You can use oc edit dc/logging-fluentd to update your Fluentd configuration. It is advised that you first scale down your number of replicas to 0 before editing the DeploymentConfig.

If you are not using the provided Kibana and Elasticsearch images, you will not have the same multi-tenant capabilities and your data will not be restricted by user access to a particular project.