In software systems, components can become unhealthy due to transient issues such as temporary connectivity loss, configuration errors, or problems with external dependencies. OpenShift Container Platform applications have a number of options to detect and handle unhealthy containers.

Understanding health checks

A health check periodically performs diagnostics on a running container using any combination of the readiness, liveness, and startup health checks.

You can include one or more probes in the specification for the pod that contains the container which you want to perform the health checks.

If you want to add or edit health checks in an existing pod, you must edit the pod DeploymentConfig object or use the Developer perspective in the web console. You cannot use the CLI to add or edit health checks for an existing pod.

Readiness probe

A readiness probe determines if a container is ready to accept service requests. If the readiness probe fails for a container, the kubelet removes the pod from the list of available service endpoints.

After a failure, the probe continues to examine the pod. If the pod becomes available, the kubelet adds the pod to the list of available service endpoints.

Liveness health check

A liveness probe determines if a container is still running. If the liveness probe fails due to a condition such as a deadlock, the kubelet kills the container. The pod then responds based on its restart policy.

For example, a liveness probe on a pod with a restartPolicy of Always or OnFailure kills and restarts the container.

Startup probe

A startup probe indicates whether the application within a container is started. All other probes are disabled until the startup succeeds. If the startup probe does not succeed within a specified time period, the kubelet kills the container, and the container is subject to the pod restartPolicy.

Some applications can require additional start-up time on their first initialization. You can use a startup probe with a liveness or readiness probe to delay that probe long enough to handle lengthy start-up time using the failureThreshold and periodSeconds parameters.

For example, you can add a startup probe, with a failureThreshold of 30 failures and a periodSeconds of 10 seconds (30 * 10s = 300s) for a maximum of 5 minutes, to a liveness probe. After the startup probe succeeds the first time, the liveness probe takes over.

You can configure liveness, readiness, and startup probes with any of the following types of tests:

  • HTTP GET: When using an HTTP GET test, the test determines the healthiness of the container by using a web hook. The test is successful if the HTTP response code is between 200 and 399.

    You can use an HTTP GET test with applications that return HTTP status codes when completely initialized.

  • Container Command: When using a container command test, the probe executes a command inside the container. The probe is successful if the test exits with a 0 status.

  • TCP socket: When using a TCP socket test, the probe attempts to open a socket to the container. The container is only considered healthy if the probe can establish a connection. You can use a TCP socket test with applications that do not start listening until initialization is complete.

You can configure several fields to control the behavior of a probe:

  • initialDelaySeconds: The time, in seconds, after the container starts before the probe can be scheduled. The default is 0.

  • periodSeconds: The delay, in seconds, between performing probes. The default is 10.

  • timeoutSeconds: The number of seconds of inactivity after which the probe times out and the container is assumed to have failed. The default is 1.

  • successThreshold: The number of times that the probe must report success after a failure in order to reset the container status to successful. The value must be 1 for a liveness probe. The default is 1.

  • failureThreshold: The number of times that the probe is allowed to fail. The default is 3. After the specified attempts:

    • for a liveness probe, the container is restarted

    • for a readiness probe, the pod is marked Unready

    • for a startup probe, the container is killed and is subject to the pod’s restartPolicy

      The timeoutSeconds parameter has no effect on the readiness and liveness probes for container command probes, as OpenShift Container Platform cannot time out on an exec call into the container. One way to implement a timeout in a container command probe is by using the exec-timeout command to run your liveness or readiness probes, as shown in the examples.

Example probes

The following are samples of different probes as they would appear in an object specification.

Sample readiness probe with a container command readiness probe in a pod spec
apiVersion: v1
kind: Pod
metadata:
  labels:
    test: health-check
  name: my-application
...
spec:
  containers:
  - name: goproxy-app (1)
    args:
    image: k8s.gcr.io/goproxy:0.1 (2)
    readinessProbe: (3)
      exec: (4)
        command: (5)
        - cat
        - /tmp/healthy
...
1 The container name.
2 The container image to deploy.
3 A readiness probe.
4 A container command test.
5 The commands to execute on the container.
Sample container command startup probe and liveness probe with container command tests in a pod spec
apiVersion: v1
kind: Pod
metadata:
  labels:
    test: health-check
  name: my-application
...
spec:
  containers:
  - name: goproxy-app (1)
    args:
    image: k8s.gcr.io/goproxy:0.1 (2)
    livenessProbe: (3)
      httpGet: (4)
        scheme: HTTPS (5)
        path: /healthz
        port: 8080 (6)
        httpHeaders:
        - name: X-Custom-Header
          value: Awesome
    startupProbe: (7)
      httpGet: (8)
        path: /healthz
        port: 8080 (9)
   failureThreshold: 30 (10)
   periodSeconds: 10 (11)
...
1 The container name.
2 Specify the container image to deploy.
3 A liveness probe.
4 An HTTP GET test.
5 The internet scheme: HTTP or HTTPS. The default value is HTTP.
6 The port on which the container is listening.
7 A startup probe.
8 An HTTP GET test.
9 The port on which the container is listening.
10 The number of times to try the probe after a failure.
11 The number of seconds to perform the probe.
Sample liveness probe with a container command test that uses a timeout in a pod spec
apiVersion: v1
kind: Pod
metadata:
  labels:
    test: health-check
  name: my-application
...
spec:
  containers:
  - name: goproxy-app (1)
    args:
    image: k8s.gcr.io/goproxy:0.1 (2)
    livenessProbe: (3)
      exec: (4)
        command: (5)
        - /bin/bash
        - '-c'
        - timeout 60 /opt/eap/bin/livenessProbe.sh
      periodSeconds: 10 (6)
      successThreshold: 1 (7)
      failureThreshold: 3 (8)
...
1 The container name.
2 Specify the container image to deploy.
3 The liveness probe.
4 The type of probe, here a container command probe.
5 The command line to execute inside the container.
6 How often in seconds to perform the probe.
7 The number of number of consecutive successes needed to show success after a failure.
8 The number of times to try the probe after a failure.
Sample readiness probe and liveness probe with a TCP socket test in a deployment
kind: Deployment
apiVersion: apps/v1
...
spec:
...
  template:
    spec:
      containers:
        - resources: {}
          readinessProbe: (1)
            tcpSocket:
              port: 8080
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
          terminationMessagePath: /dev/termination-log
          name: ruby-ex
          livenessProbe: (2)
            tcpSocket:
              port: 8080
            initialDelaySeconds: 15
            timeoutSeconds: 1
            periodSeconds: 10
            successThreshold: 1
            failureThreshold: 3
...
1 The readiness probe.
2 The liveness probe.

Configuring health checks

To configure readiness, liveness, and startup probes, add one or more probes to the specification for the pod that contains the container which you want to perform the health checks

If you want to add or edit health checks in an existing pod, you must edit the pod DeploymentConfig object or use the Developer perspective in the web console. You cannot use the CLI to add or edit health checks for an existing pod.

Procedure

To add probes for a container:

  1. Create a Pod object to add one or more probes:

    apiVersion: v1
    kind: Pod
    metadata:
      labels:
        test: health-check
      name: my-application
    spec:
      containers:
      - name: my-container (1)
        args:
        image: k8s.gcr.io/goproxy:0.1 (2)
        livenessProbe: (3)
          tcpSocket:  (4)
            port: 8080 (5)
          initialDelaySeconds: 15 (6)
          timeoutSeconds: 1 (7)
        readinessProbe: (8)
          httpGet: (9)
            host: my-host (10)
            scheme: HTTPS (11)
            path: /healthz
            port: 8080 (12)
        startupProbe: (13)
          exec: (14)
            command: (15)
            - cat
            - /tmp/healthy
          failureThreshold: 30 (16)
          periodSeconds: 10 (17)
    1 Specify the container name.
    2 Specify the container image to deploy.
    3 Optional: Create a Liveness probe.
    4 Specify a test to perform, here a TCP Socket test.
    5 Specify the port on which the container is listening.
    6 Specify the time, in seconds, after the container starts before the probe can be scheduled.
    7 Specify the number of seconds between probes.
    8 Optional: Create a Readiness probe.
    9 Specify the type of test to perform, here an HTTP test.
    10 Specify a host IP address. When host is not defined, the PodIP is used.
    11 Specify HTTP or HTTPS. When scheme is not defined, the HTTP scheme is used.
    12 Specify the port on which the container is listening.
    13 Optional: Create a Startup probe.
    14 Specify the type of test to perform, here an Container Execution probe.
    15 Specify the commands to execute on the container.
    16 Specify the number of times to try the probe after a failure.
    17 Specify the number of seconds to perform the probe.

    If the initialDelaySeconds value is lower than the periodSeconds value, the first Readiness probe occurs at some point between the two periods due to an issue with timers.

  2. Create the Pod object:

    $ oc create -f <file-name>.yaml
  3. Verify the state of the health check pod:

    $ oc describe pod my-application
    Example output
    Events:
      Type    Reason     Age   From                                  Message
      ----    ------     ----  ----                                  -------
      Normal  Scheduled  9s    default-scheduler                     Successfully assigned openshift-logging/liveness-exec to ip-10-0-143-40.ec2.internal
      Normal  Pulling    2s    kubelet, ip-10-0-143-40.ec2.internal  pulling image "k8s.gcr.io/liveness"
      Normal  Pulled     1s    kubelet, ip-10-0-143-40.ec2.internal  Successfully pulled image "k8s.gcr.io/liveness"
      Normal  Created    1s    kubelet, ip-10-0-143-40.ec2.internal  Created container
      Normal  Started    1s    kubelet, ip-10-0-143-40.ec2.internal  Started container

    The following is the output of a failed probe that restarted a container:

    Sample Liveness check output with unhealthy container
    $ oc describe pod pod1
    Example output
    ....
    
    Events:
      Type     Reason          Age                From                                               Message
      ----     ------          ----               ----                                               -------
      Normal   Scheduled       <unknown>                                                             Successfully assigned aaa/liveness-http to ci-ln-37hz77b-f76d1-wdpjv-worker-b-snzrj
      Normal   AddedInterface  47s                multus                                             Add eth0 [10.129.2.11/23]
      Normal   Pulled          46s                kubelet, ci-ln-37hz77b-f76d1-wdpjv-worker-b-snzrj  Successfully pulled image "k8s.gcr.io/liveness" in 773.406244ms
      Normal   Pulled          28s                kubelet, ci-ln-37hz77b-f76d1-wdpjv-worker-b-snzrj  Successfully pulled image "k8s.gcr.io/liveness" in 233.328564ms
      Normal   Created         10s (x3 over 46s)  kubelet, ci-ln-37hz77b-f76d1-wdpjv-worker-b-snzrj  Created container liveness
      Normal   Started         10s (x3 over 46s)  kubelet, ci-ln-37hz77b-f76d1-wdpjv-worker-b-snzrj  Started container liveness
      Warning  Unhealthy       10s (x6 over 34s)  kubelet, ci-ln-37hz77b-f76d1-wdpjv-worker-b-snzrj  Liveness probe failed: HTTP probe failed with statuscode: 500
      Normal   Killing         10s (x2 over 28s)  kubelet, ci-ln-37hz77b-f76d1-wdpjv-worker-b-snzrj  Container liveness failed liveness probe, will be restarted
      Normal   Pulling         10s (x3 over 47s)  kubelet, ci-ln-37hz77b-f76d1-wdpjv-worker-b-snzrj  Pulling image "k8s.gcr.io/liveness"
      Normal   Pulled          10s                kubelet, ci-ln-37hz77b-f76d1-wdpjv-worker-b-snzrj  Successfully pulled image "k8s.gcr.io/liveness" in 244.116568ms