Overview

As an OpenShift Enterprise cluster administrator, you can deploy the EFK stack to aggregate logs for a range of OpenShift Enterprise services. Application developers can view the logs of the projects for which they have view access. The EFK stack aggregates logs from hosts and applications, whether coming from multiple containers or even deleted pods.

The EFK stack is a modified version of the ELK stack and is comprised of:

  • Elasticsearch: An object store where all logs are stored.

  • Fluentd: Gathers logs from nodes and feeds them to Elasticsearch.

  • Kibana: A web UI for Elasticsearch.

Once deployed in a cluster, the stack aggregates logs from all nodes and projects into Elasticsearch, and provides a Kibana UI to view any logs. Cluster administrators can view all logs, but application developers can only view logs for projects they have permission to view. The stack components communicate securely.

Managing Docker Container Logs discusses the use of json-file logging driver options to manage container logs and prevent filling node disks.

Pre-deployment Configuration

  1. Ensure that you have deployed a router for the cluster.

  2. Ensure that you have the necessary storage for Elasticsearch. Note that each Elasticsearch replica requires its own storage volume. See Elasticsearch for more information.

  3. Ansible-based installs should create the logging-deployer-template template in the openshift project. Otherwise you can create it with the following command:

    $ oc create -n openshift -f \
        /usr/share/openshift/examples/infrastructure-templates/enterprise/logging-deployer.yaml
  4. Create a new project. Once implemented in a single project, the EFK stack collects logs for every project within your OpenShift Enterprise cluster. The examples in this topic use logging as an example project:

    $ oadm new-project logging --node-selector=""
    $ oc project logging

    Specifying a non-empty node selector on the project is not recommended, as this would restrict where Fluentd can be deployed. Instead, specify node selectors for the deployer to be applied to your other deployment configurations.

  5. Create a secret to provide security-related files to the deployer. Providing the secret is optional, and the objects will be randomly generated if not supplied.

    You can supply the following files when creating a new secret:

    File Name Description

    kibana.crt

    A browser-facing certificate for the Kibana server.

    kibana.key

    A key to be used with the Kibana certificate.

    kibana-ops.crt

    A browser-facing certificate for the Ops Kibana server.

    kibana-ops.key

    A key to be used with the Ops Kibana certificate.

    server-tls.json

    JSON TLS options to override the Kibana server defaults. Refer to Node.JS docs for available options.

    ca.crt

    A certificate for a CA that will be used to sign all certificates generated by the deployer.

    ca.key

    A matching CA key.

    For example:

    $ oc secrets new logging-deployer \
       kibana.crt=/path/to/cert kibana.key=/path/to/key

    If a certificate file is not passed as a secret, the deployer will generate a self-signed certificate instead. However, a secret is still required for the deployer to run. In this case, you can create a "dummy" secret that does not specify a certificate value:

    $ oc secrets new logging-deployer nothing=/dev/null
  6. Create the deployer service account:

    $ oc create -f - <<API
    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: logging-deployer
    secrets:
    - name: logging-deployer
    API

    Then, enable the logging-deployer service account to create the objects needed for a logging deployment:

    $ oc policy add-role-to-user edit --serviceaccount logging-deployer
  7. Enable the Fluentd service account, which the deployer will create, that requires special privileges to operate Fluentd. Add the service account user to the security context:

    $ oadm policy add-scc-to-user  \
        privileged system:serviceaccount:logging:aggregated-logging-fluentd (1)
    1 Use the new project you created earlier (e.g., logging) when specifying this service account.

    Give the Fluentd service account permission to read labels from all pods:

    $ oadm policy add-cluster-role-to-user cluster-reader \
        system:serviceaccount:logging:aggregated-logging-fluentd (1)
    1 Use the new project you created earlier (e.g., logging) when specifying this service account.

Deploying the EFK Stack

The EFK stack is deployed using a template.

  1. Run the deployer, specifying at least the parameters in the following example (more are described in the table below):

    $ oc new-app logging-deployer-template \
        --param KIBANA_HOSTNAME=kibana.example.com \
        --param ES_CLUSTER_SIZE=1 \
        --param PUBLIC_MASTER_URL=https://localhost:8443

    Be sure to replace at least KIBANA_HOSTNAME and PUBLIC_MASTER_URL with values relevant to your deployment.

    The available parameters are:

    Variable Name Description

    PUBLIC_MASTER_URL

    (Required with the oc new-app command) The external URL for the master. For OAuth use.

    ENABLE_OPS_CLUSTER

    If set to true, configures a second Elasticsearch cluster and Kibana for operations logs. Fluentd splits logs between the main cluster and a cluster reserved for operations logs (which consists of /var/log/messages on nodes and the logs from the projects default, openshift, and openshift-infra). This means a second Elasticsearch and Kibana are deployed. The deployments are distinguishable by the -ops included in their names and have parallel deployment options listed below.

    KIBANA_HOSTNAME, KIBANA_OPS_HOSTNAME

    (Required with the oc new-app command) The external host name for web clients to reach Kibana.

    ES_CLUSTER_SIZE, ES_OPS_CLUSTER_SIZE

    (Required with the oc new-app command) The number of instances of Elasticsearch to deploy. Redundancy requires at least three, and more can be used for scaling.

    ES_INSTANCE_RAM, ES_OPS_INSTANCE_RAM

    Amount of RAM to reserve per Elasticsearch instance. The default is 8G (for 8GB), and it must be at least 512M. Possible suffixes are G,g,M,m.

    ES_NODE_QUORUM, ES_OPS_NODE_QUORUM

    The quorum required to elect a new master. Should be more than half the intended cluster size.

    ES_RECOVER_AFTER_NODES, ES_OPS_RECOVER_AFTER_NODES

    When restarting the cluster, require this many nodes to be present before starting recovery. Defaults to one less than the cluster size to allow for one missing node.

    ES_RECOVER_EXPECTED_NODES, ES_OPS_RECOVER_EXPECTED_NODES

    When restarting the cluster, wait for this number of nodes to be present before starting recovery. By default, the same as the cluster size.

    ES_RECOVER_AFTER_TIME, ES_OPS_RECOVER_AFTER_TIME

    When restarting the cluster, this is a timeout for waiting for the expected number of nodes to be present. Defaults to "5m".

    IMAGE_PREFIX

    The prefix for logging component images. For example, setting the prefix to registry.access.redhat.com/openshift3/ose- creates registry.access.redhat.com/openshift3/ose-logging-deployer:latest.

    IMAGE_VERSION

    The version for logging component images. For example, setting the version to v3.2 creates registry.access.redhat.com/openshift3/ose-logging-deployer:v3.2.

    Running the deployer creates a deployer pod and prints its name.

  2. Wait until the pod is running. It may take several minutes for OpenShift Enterprise to retrieve the deployer image from the registry.

    The logs for the openshift and openshift-infra projects are automatically aggregated and grouped into the .operations item in the Kibana interface.

    The project where you have deployed the EFK stack (logging, as documented here) is not aggregated into .operations and is found under its ID.

    You can watch its progress with:

    $ oc get pod/<pod_name> -w

    If it seems to be taking too long to start, you can retrieve more details about the pod and any associated events with:

    $ oc describe pod/<pod_name>

    When it runs, you can check the logs of the resulting pod to see if the deployment was successful:

    $ oc logs -f <pod_name>
  3. As a cluster administrator, deploy the logging-support-template template that the deployer created:

    $ oc new-app logging-support-template

    Deployment of logging components should begin automatically. However, because deployment is triggered based on tags being imported into the ImageStreams created in this step, and not all tags are automatically imported, this mechanism has become unreliable as multiple versions are released. Therefore, manual importing may be necessary as follows.

    For each ImageStream logging-auth-proxy, logging-kibana, logging-elasticsearch, and logging-fluentd, manually import the tag corresponding to the IMAGE_VERSION specified (or defaulted) for the deployer.

    $ oc import-image <name>:<version> --from <prefix><name>:<tag>

    For example:

    $ oc import-image logging-auth-proxy:3.2.0 \
         --from registry.access.redhat.com/openshift3/logging-auth-proxy:3.2.0
    $ oc import-image logging-kibana:3.2.0 \
         --from registry.access.redhat.com/openshift3/logging-kibana:3.2.0
    $ oc import-image logging-elasticsearch:3.2.0 \
         --from registry.access.redhat.com/openshift3/logging-elasticsearch:3.2.0
    $ oc import-image logging-fluentd:3.2.0 \
         --from registry.access.redhat.com/openshift3/logging-fluentd:3.2.0

Post-deployment Configuration

Elasticsearch

A highly-available environment requires at least three replicas of Elasticsearch; each on a different host. Elasticsearch replicas require their own storage, but an OpenShift Enterprise deployment configuration shares storage volumes between all its pods. So, when scaled up, the EFK deployer ensures each replica of Elasticsearch has its own deployment configuration.

Viewing all Elasticsearch Deployments

To view all current Elasticsearch deployments:

$ oc get dc --selector logging-infra=elasticsearch

Persistent Elasticsearch Storage

The deployer creates an ephemeral deployment in which all of a pod’s data is lost upon restart. For production usage, add a persistent storage volume to each Elasticsearch deployment configuration.

The best-performing volumes are local disks, if it is possible to use them. Doing so requires some preparation as follows.

  1. The relevant service account must be given the privilege to mount and edit a local volume, as follows:

    $ oadm policy add-scc-to-user privileged  \
           system:serviceaccount:logging:aggregated-logging-elasticsearch (1)
    1 Use the new project you created earlier (e.g., logging) when specifying this service account.
  2. Each Elasticsearch replica definition must be patched to claim that privilege, for example:

    $ for dc in $(oc get deploymentconfig --selector logging-infra=elasticsearch -o name); do
        oc scale $dc --replicas=0
        oc patch $dc \
           -p '{"spec":{"template":{"spec":{"containers":[{"name":"elasticsearch","securityContext":{"privileged": true}}]}}}}'
      done
  3. The Elasticsearch pods must be located on the correct nodes to use the local storage, and should not move around even if those nodes are taken down for a period of time. This requires giving each Elasticsearch replica a node selector that is unique to the node where an administrator has allocated storage for it. See below for directions on setting a node selector.

  4. Once these steps are taken, a local host mount can be applied to each replica as in this example (where we assume storage is mounted at the same path on each node):

    $ for dc in $(oc get deploymentconfig --selector logging-infra=elasticsearch -o name); do
        oc set volume $dc \
              --add --overwrite --name=elasticsearch-storage \
              --type=hostPath --path=/usr/local/es-storage
        oc scale $dc --replicas=1
      done

If using host mounts is impractical or undesirable, it may be necessary to attach block storage as a PersistentVolumeClaim as in the following example:

$ oc set volume dc/logging-es-<unique> \
          --add --overwrite --name=elasticsearch-storage \
          --type=persistentVolumeClaim --claim-name=logging-es-1

Using NFS storage directly or as a PersistentVolume (or via other NAS such as Gluster) is not supported for Elasticsearch storage, as Lucene relies on filesystem behavior that NFS does not supply. Data corruption and other problems can occur. If NFS storage is a requirement, you can allocate a large file on that storage to serve as a storage device and treat it as a host mount on each host. For example:

$ truncate -s 1T /nfs/storage/elasticsearch-1
$ mkfs.xfs /nfs/storage/elasticsearch-1
$ mount -o loop /nfs/storage/elasticsearch-1 /usr/local/es-storage
$ chown 1000:1000 /usr/local/es-storage

Then, use /usr/local/es-storage as a host-mount as described above. Performance under this solution is significantly worse than using actual local drives.

Node Selector

Because Elasticsearch can use a lot of resources, all members of a cluster should have low latency network connections to each other. Ensure this by directing the instances to dedicated nodes, or a dedicated region within your cluster, using a node selector.

To configure a node selector, edit each deployment configuration and add the nodeSelector parameter to specify the label of the desired nodes:

apiVersion: v1
kind: DeploymentConfig
spec:
  template:
    spec:
      nodeSelector:
        nodelabel: logging-es-node-1

Alternatively you can use the oc patch command:

$ oc patch dc/logging-es-<unique_name> \
   -p '{"spec":{"template":{"spec":{"nodeSelector":{"nodeLabel":"logging-es-node-1"}}}}}'

Changing the Scale of Elasticsearch

If you need to scale up the number of Elasticsearch instances your cluster uses, it is not as simple as changing the number of Elasticsearch cluster nodes. This is due to the nature of persistent volumes and how Elasticsearch is configured to store its data and recover the cluster. Instead, you must create a deployment configuration for each Elasticsearch cluster node.

During installation, the deployer creates templates with the Elasticsearch configurations provided to it: logging-es-template and logging-es-ops-template if the deployer was run with ENABLE_OPS_CLUSTER=true.

The node quorum and recovery settings were initially set based on the CLUSTER_SIZE value provided to the deployer. Since the cluster size is changing, those values need to be updated.

  1. Prior to changing the number of Elasticsearch cluster nodes, the EFK stack should first be scaled down to preserve log data as described in Upgrading the EFK Logging Stack.

  2. Edit the cluster template you are scaling up and change the parameters to the desired value:

    • NODE_QUORUM is the intended cluster size / 2 (rounded down) + 1. For an intended cluster size of 5, the quorum would be 3.

    • RECOVER_EXPECTED_NODES is the same as the intended cluster size.

    • RECOVER_AFTER_NODES is the intended cluster size - 1.

      $ oc edit template logging-es[-ops]-template
  3. In addition to updating the template, all of the deployment configurations for that cluster also need to have the three environment variable values above updated. To edit each of the configurations for the cluster in series, you use the following.

    $ oc get dc -l component=es[-ops] -o name | xargs -r oc edit
  4. Create an additional deployment configuration, run the following command against the Elasticsearch cluster you want to to scale up for (logging-es-template or logging-es-ops-template).

    $ oc new-app logging-es[-ops]-template

    These deployments will be named differently, but all will have the logging-es prefix. Be aware of the cluster parameters (described in the deployer parameters) based on cluster size that may need corresponding adjustment in the template, as well as existing deployments.

  5. After the intended number of deployment configurations are created, scale up your cluster, starting with Elasticsearch as described in Upgrading the EFK Logging Stack.

    The oc new-app logging-es[-ops]-template command creates a deployment configuration with a persistent volume. If you want to create a Elasticsearch cluster node with a persistent volume attached to it, upon creation you can instead run the following command to create your deployment configuration with a persistent volume claim (PVC) attached.

    $ oc process logging-es-template | oc volume -f - \
              --add --overwrite --name=elasticsearch-storage \
              --type=persistentVolumeClaim --claim-name={your_pvc}`

Fluentd

Once Elasticsearch is running, scale Fluentd to every node to feed logs into Elasticsearch. The following example is for an OpenShift Enterprise instance with three nodes:

$ oc scale dc/logging-fluentd --replicas=3

You will need to scale Fluentd if nodes are added or subtracted.

Kibana

To access the Kibana console from the OpenShift Enterprise web console, add the loggingPublicURL parameter in the /etc/origin/master/master-config.yaml file, with the URL of the Kibana console (the KIBANA_HOSTNAME parameter). The value must be an HTTPS URL:

...
assetConfig:
  ...
  loggingPublicURL: "https://kibana.example.com"
...

Setting the loggingPublicURL parameter creates a View Archive button on the OpenShift Enterprise web console under the BrowsePods<pod_name>Logs tab. This links to the Kibana console.

You can scale the Kibana deployment as usual for redundancy:

$ oc scale dc/logging-kibana --replicas=2

You can see the UI by visiting the site specified at the KIBANA_HOSTNAME variable.

See the Kibana documentation for more information on Kibana.

Cleanup

You can remove everything generated during the deployment while leaving other project contents intact:

$ oc delete all --selector logging-infra=kibana
$ oc delete all --selector logging-infra=fluentd
$ oc delete all --selector logging-infra=elasticsearch
$ oc delete all --selector logging-infra=curator
$ oc delete all,sa,oauthclient --selector logging-infra=support
$ oc delete secret logging-fluentd logging-elasticsearch \
    logging-es-proxy logging-kibana logging-kibana-proxy \
    logging-kibana-ops-proxy

Upgrading

To upgrade the EFK logging stack, see Manual Upgrades.

Troubleshooting Kibana

Using the Kibana console with OpenShift Enterprise can cause problems that are easily solved, but are not accompanied with useful error messages. Check the following troubleshooting sections if you are experiencing any problems when deploying Kibana on OpenShift Enterprise:

Login Loop

The OAuth2 proxy on the Kibana console must share a secret with the master host’s OAuth2 server. If the secret is not identical on both servers, it can cause a login loop where you are continuously redirected back to the Kibana login page.

To fix this issue, delete the current oauthclient, and create a new one, using the same template as before:

$ oc delete oauthclient/kibana-proxy
$ oc new-app logging-support-template

Cryptic Error When Viewing the Console

When attempting to visit the Kibana console, you may instead receive a browser error:

{"error":"invalid_request","error_description":"The request is missing a required parameter,
 includes an invalid parameter value, includes a parameter more than once, or is otherwise malformed."}

This can be caused by a mismatch between the OAuth2 client and server. The return address for the client must be in a whitelist so the server can securely redirect back after logging in.

Fix this issue by replacing the OAuth client entry:

$ oc delete oauthclient/kibana-proxy
$ oc new-app logging-support-template

If the problem persists, check that you are accessing Kibana at a URL listed in the OAuth client. This issue can be caused by accessing the URL at a forwarded port, such as 1443 instead of the standard 443 HTTPS port. You can adjust the server whitelist by editing the OAuth client:

$ oc edit oauthclient/kibana-proxy

503 Error When Viewing the Console

If you receive a proxy error when viewing the Kibana console, it could be caused by one of two issues.

First, Kibana may not be recognizing pods. If Elasticsearch is slow in starting up, Kibana may timeout trying to reach it. Check whether the relevant service has any endpoints:

$ oc describe service logging-kibana
Name:                   logging-kibana
[...]
Endpoints:              <none>

If any Kibana pods are live, endpoints will be listed. If they are not, check the state of the Kibana pods and deployment. You may need to scale the deployment down and back up again.

The second possible issue may be caused if the route for accessing the Kibana service is masked. This can happen if you perform a test deployment in one project, then deploy in a different project without completely removing the first deployment. When multiple routes are sent to the same destination, the default router will only route to the first created. Check the problematic route to see if it is defined in multiple places:

$ oc get route  --all-namespaces --selector logging-infra=support

Sending Logs to an External Elasticsearch Instance

Fluentd sends logs to the value of the ES_HOST, ES_PORT, OPS_HOST, and OPS_PORT environment variables of the Elasticsearch deployment configuration. The application logs are directed to the ES_HOST destination, and operations logs to OPS_HOST.

To direct logs to a specific Elasticsearch instance, edit the deployment configuration and replace the value of the above variables with the desired instance:

$ oc edit dc/<deployment_configuration>

For an external Elasticsearch instance to contain both application and operations logs, you can set ES_HOST and OPS_HOST to the same destination, while ensuring that ES_PORT and OPS_PORT also have the same value.

If your externally hosted Elasticsearch instance does not use TLS, update the *_CLIENT_CERT, *_CLIENT_KEY, and *_CA variables to be empty. If it does use TLS, but not mutual TLS, update the *_CLIENT_CERT and *_CLIENT_KEY variables to be empty and patch or recreate the logging-fluentd secret with the appropriate *_CA value for communicating with your Elasticsearch instance. If it uses Mutual TLS as the provided Elasticsearch instance does, patch or recreate the logging-fluentd secret with your client key, client cert, and CA.

You can use oc edit dc/logging-fluentd to update your Fluentd configuration, making sure to first scale down your number of replicas to zero before editing the deployment configuration.

If you are not using the provided Kibana and Elasticsearch images, you will not have the same multi-tenant capabilities and your data will not be restricted by user access to a particular project.

Performing Elasticsearch Maintenance Operations

As of the Deployer version 3.2.0, an admin certificate, key, and CA that can be used to communicate with and perform administrative operations on Elasticsearch are provided within the logging-elasticsearch secret.

To confirm whether or not your EFK installation provides these, run:

$ oc describe secret logging-elasticsearch

If they are not available, refer to Manual Upgrades to ensure you are on the latest version first.

Connect to an Elasticsearch pod that is in the cluster on which you are attempting to perform maintenance.

To find a pod in a cluster use either:

$ oc get pods -l component=es -o name | head -1
$ oc get pods -l component=es-ops -o name | head -1

Then, connect to a pod:

$ oc rsh <your_Elasticsearch_pod>

Once connected to an Elasticsearch container, you can use the certificates mounted from the secret to communicate with Elasticsearch per its 1.5 Document APIs.

Fluentd sends its logs to Elasticsearch using the index format "{project_name}.{project_uuid}.YYYY.MM.DD" where YYYY.MM.DD is the date of the log record.

For example, to delete all logs for the logging project with uuid 3b3594fa-2ccd-11e6-acb7-0eb6b35eaee3 from June 15, 2016, we can run:

$ curl --key /etc/elasticsearch/keys/admin-key --cert /etc/elasticsearch/keys/admin-cert \
  --cacert /etc/elasticsearch/keys/admin-ca -XDELETE \
  "https://localhost:9200/logging.3b3594fa-2ccd-11e6-acb7-0eb6b35eaee3.2016.06.15"

Configuring Curator

With Aggregated Logging version 3.2.1, Curator is available for use as Tech Preview. To start it, after completing an installation using the 3.2.1 Deployer, scale up the Curator deployment configuration that was created. (It defaults to zero replicas.)

There should be one Curator pod running per Elasticsearch cluster. If you deployed aggregated logging with ENABLE_OPS_CLUSTER=true, then you will have a second deployment configuration (one for the ops cluster and one for the non-ops cluster).

$ oc scale dc/logging-curator --replicas=1
$ oc scale dc/logging-curator-ops --replicas=1

Curator allows administrators to configure scheduled Elasticsearch maintenance operations to be performed automatically on a per-project basis. It is scheduled to perform actions daily based on its configuration. Only one Curator pod is recommended per Elasticsearch cluster. Curator is configured via a mounted YAML configuration file with the following structure:

$PROJECT_NAME:
  $ACTION:
    $UNIT: $VALUE

$PROJECT_NAME:
  $ACTION:
    $UNIT: $VALUE
 ...

The available parameters are:

Variable Name Description

$PROJECT_NAME

The actual name of a project, such as myapp-devel. For OpenShift Enterprise operations logs, use the name .operations as the project name.

$ACTION

The action to take, currently only delete is allowed.

$UNIT

One of days, weeks, or months.

$VALUE

An integer for the number of units.

.defaults

Use .defaults as the $PROJECT_NAME to set the defaults for projects that are not specified.

runhour

(Number) the hour of the day in 24-hour format at which to run the Curator jobs. For use with .defaults.

runminute

(Number) the minute of the hour at which to run the Curator jobs. For use with .defaults.

For example, to configure Curator to

  • delete indices in the myapp-dev project older than 1 day

  • delete indices in the myapp-qe project older than 1 week

  • delete operations logs older than 8 weeks

  • delete all other projects indices after they are 30 days old

  • run the Curator jobs at midnight every day

you would use:

myapp-dev:
 delete:
   days: 1

myapp-qe:
  delete:
    weeks: 1

.operations:
  delete:
    weeks: 8

.defaults:
  delete:
    days: 30
  runhour: 0
  runminute: 0

When you use month as the $UNIT for an operation, Curator starts counting at the first day of the current month, not the current day of the current month. For example, if today is April 15, and you want to delete indices that are 2 months older than today (delete: months: 2), Curator does not delete indices that are dated older than February 15; it deletes indices older than February 1. That is, it goes back to the first day of the current month, then goes back two whole months from that date. If you want to be exact with Curator, it is best to use days (for example, delete: days: 30).

Creating the Curator Configuration

To create the Curator configuration:

  1. Create a YAML file with your configuration settings using your favorite editor.

  2. Create a secret from your created yaml file:

    $ oc secrets new index-management settings=</path/to/your/yaml/file>
  3. Mount your created secret as a volume in your Curator DC:

    $ oc volumes dc/logging-curator \
        --add \
        --type=secret \
        --secret-name=index-management \
        --mount-path=/etc/curator \
        --name=index-management \
        --overwrite

    The mount-path value (e.g. /etc/curator) must match the CURATOR_CONF_LOCATION in the environment.

You can also specify default values for the run hour, run minute, and age in days of the indices when processing the Curator template. Use CURATOR_RUN_HOUR and CURATOR_RUN_MINUTE to set the default runhour and runminute, and use CURATOR_DEFAULT_DAYS to set the default index age.