Performing latency tests for platform verification | Scalability and performance

Prerequisites for running latency tests
About discovery mode for latency tests
Measuring latency
Running the latency tests
- Running oslat
Generating a latency test failure report
Generating a JUnit latency test report
Running latency tests in a disconnected cluster
Troubleshooting errors with the cnf-tests container

You can use the Cloud-native Network Functions (CNF) tests image to run latency tests on a CNF-enabled OpenShift Container Platform cluster, where all the components required for running CNF workloads are installed. Run the latency tests to validate node tuning for your workload.

The cnf-tests container image is available at registry.redhat.io/openshift4/cnf-tests-rhel8:v4.8.

The cnf-tests image also includes several tests that are not supported by Red Hat at this time. Only the latency tests are supported by Red Hat.

Prerequisites for running latency tests

Your cluster must meet the following requirements before you can run the latency tests:

You have configured a performance profile with the Performance Addon Operator.
You have applied all the required CNF configurations in the cluster.
You have a pre-existing MachineConfigPool CR applied in the cluster. The default worker pool is worker-cnf.

Additional resources

For more information about creating the cluster performance profile, see Provisioning real-time and low latency workloads.

About discovery mode for latency tests

Use discovery mode to validate the functionality of a cluster without altering its configuration. Existing environment configurations are used for the tests. The tests can find the configuration items needed and use those items to execute the tests. If resources needed to run a specific test are not found, the test is skipped, providing an appropriate message to the user. After the tests are finished, no cleanup of the pre-configured configuration items is done, and the test environment can be immediately used for another test run.

When running the latency tests, always run the tests with -e DISCOVERY_MODE=true and -ginkgo.focus set to the appropriate latency test. If you do not run the latency tests in discovery mode, your existing live cluster performance profile configuration will be modified by the test run.

Limiting the nodes used during tests

The nodes on which the tests are executed can be limited by specifying a NODES_SELECTOR environment variable, for example, -e NODES_SELECTOR=node-role.kubernetes.io/worker-cnf. Any resources created by the test are limited to nodes with matching labels.

If you want to override the default worker pool, pass the -e ROLE_WORKER_CNF=<custom_worker_pool> variable to the command specifying an appropriate label.

Measuring latency

The cnf-tests image uses the oslat tool to measure the latency of the system.

oslat: Behaves similarly to a CPU-intensive DPDK application and measures all the interruptions and disruptions to the busy loop that simulates CPU heavy data processing.

The tests introduce the following environment variables:

Table 1. Latency test environment variables
Environment variables	Description
`LATENCY_TEST_DELAY`	Specifies the amount of time in seconds after which the test starts running. You can use the variable to allow the CPU manager reconcile loop to update the default CPU pool. The default value is 0.
`LATENCY_TEST_CPUS`	Specifies the number of CPUs that the pod running the latency tests uses. If you do not set the variable, the default configuration includes all isolated CPUs.
`LATENCY_TEST_RUNTIME`	Specifies the amount of time in seconds that the latency test must run. The default value is 300 seconds.
`OSLAT_MAXIMUM_LATENCY`	Specifies the maximum acceptable latency in microseconds for the `oslat` test results. If you do not set a value for `OSLAT_MAXIMUM_LATENCY`, the tool skips the comparison of the expected and the actual maximum latency.
`LATENCY_TEST_RUN`	Boolean parameter that indicates whether the tests should run. `LATENCY_TEST_RUN` is set to `false` by default. To run the latency tests, set this value to `true`.

Running the latency tests

Run the cluster latency tests to validate node tuning for your Cloud-native Network Functions (CNF) workload.

Always run the latency tests with DISCOVERY_MODE=true set. If you don’t, the test suite will make changes to the running cluster configuration.

When executing podman commands as a non-root or non-privileged user, mounting paths can fail with permission denied errors. To make the podman command work, append :Z to the volumes creation; for example, -v $(pwd)/:/kubeconfig:Z. This allows podman to do the proper SELinux relabeling.

Procedure

Open a shell prompt in the directory containing the kubeconfig file.

You provide the test image with a kubeconfig file in current directory and its related $KUBECONFIG environment variable, mounted through a volume. This allows the running container to use the kubeconfig file from inside the container.

Run the latency tests by entering the following command:

$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
-e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true registry.redhat.io/openshift4/cnf-tests-rhel8:v4.8 \
/usr/bin/test-run.sh -ginkgo.focus="\[performance\]\ Latency\ Test"

Optional: Append -ginkgo.dryRun to run the latency tests in dry-run mode. This is useful for checking what the tests run.
Optional: Append -ginkgo.v to run the tests with increased verbosity.

Optional: To run the latency tests against a specific performance profile, run the following command, substituting appropriate values:

$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
-e LATENCY_TEST_RUN=true -e LATENCY_TEST_RUNTIME=600 -e OSLAT_MAXIMUM_LATENCY=20 \
-e PERF_TEST_PROFILE=<performance_profile> registry.redhat.io/openshift4/cnf-tests-rhel8:v4.8 \
/usr/bin/test-run.sh -ginkgo.focus="[performance]\ Latency\ Test"

where:

<performance_profile>: Is the name of the performance profile you want to run the latency tests against.

For valid latency tests results, run the tests for at least 12 hours.

Running oslat

The oslat test simulates a CPU-intensive DPDK application and measures all the interruptions and disruptions to test how the cluster handles CPU heavy data processing.

Always run the latency tests with DISCOVERY_MODE=true set. If you don’t, the test suite will make changes to the running cluster configuration.

Prerequisites

You have logged in to registry.redhat.io with your Customer Portal credentials.
You have applied a cluster performance profile by using the Performance addon operator.

Procedure

To perform the oslat test, run the following command, substituting variable values as appropriate:

$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
-e LATENCY_TEST_RUN=true -e DISCOVERY_MODE=true -e ROLE_WORKER_CNF=worker-cnf \
-e LATENCY_TEST_CPUS=7 -e LATENCY_TEST_RUNTIME=600 -e OSLAT_MAXIMUM_LATENCY=20 \
registry.redhat.io/openshift4/cnf-tests-rhel8:v4.8 \
/usr/bin/test-run.sh -ginkgo.v -ginkgo.focus="oslat"

LATENCY_TEST_CPUS specifices the list of CPUs to test with the oslat command.

The command runs the oslat tool for 10 minutes (600 seconds). The test runs successfully when the maximum observed latency is lower than OSLAT_MAXIMUM_LATENCY (20 μs).

If the results exceed the latency threshold, the test fails.

For valid results, the test should run for at least 12 hours.

Example failure output

running /usr/bin//validationsuite -ginkgo.v -ginkgo.focus=oslat
I0829 12:36:55.386776       8 request.go:668] Waited for 1.000303471s due to client-side throttling, not priority and fairness, request: GET:https://api.cnfdc8.t5g.lab.eng.bos.redhat.com:6443/apis/authentication.k8s.io/v1?timeout=32s
Running Suite: CNF Features e2e validation
==========================================

Discovery mode enabled, skipping setup
running /usr/bin//cnftests -ginkgo.v -ginkgo.focus=oslat
I0829 12:37:01.219077      20 request.go:668] Waited for 1.050010755s due to client-side throttling, not priority and fairness, request: GET:https://api.cnfdc8.t5g.lab.eng.bos.redhat.com:6443/apis/snapshot.storage.k8s.io/v1beta1?timeout=32s
Running Suite: CNF Features e2e integration tests
=================================================
Random Seed: 1630240617
Will run 1 of 142 specs

SSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSSS
------------------------------
[performance] Latency Test with the oslat image
  should succeed
  /go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:134
STEP: Waiting two minutes to download the latencyTest image
STEP: Waiting another two minutes to give enough time for the cluster to move the pod to Succeeded phase
Aug 29 12:37:59.324: [INFO]: found mcd machine-config-daemon-wf4w8 for node cnfdc8.clus2.t5g.lab.eng.bos.redhat.com

• Failure [49.246 seconds]
[performance] Latency Test
/go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:59
  with the oslat image
  /go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:112
    should succeed [It]
    /go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:134

    The current latency 27 is bigger than the expected one 20 (1)
    Expected
        <bool>: false
    to be true
 /go/src/github.com/openshift-kni/cnf-features-deploy/vendor/github.com/openshift-kni/performance-addon-operators/functests/4_latency/latency.go:168

Log file created at: 2021/08/29 13:25:21
Running on machine: oslat-57c2g
Binary: Built with gc go1.16.6 for linux/amd64
Log line format: [IWEF]mmdd hh:mm:ss.uuuuuu threadid file:line] msg
I0829 13:25:21.569182       1 node.go:37] Environment information: /proc/cmdline: BOOT_IMAGE=(hd0,gpt3)/ostree/rhcos-612d89f4519a53ad0b1a132f4add78372661bfb3994f5fe115654971aa58a543/vmlinuz-4.18.0-305.10.2.rt7.83.el8_4.x86_64 ip=dhcp random.trust_cpu=on console=tty0 console=ttyS0,115200n8 ostree=/ostree/boot.0/rhcos/612d89f4519a53ad0b1a132f4add78372661bfb3994f5fe115654971aa58a543/0 ignition.platform.id=openstack root=UUID=5a4ddf16-9372-44d9-ac4e-3ee329e16ab3 rw rootflags=prjquota skew_tick=1 nohz=on rcu_nocbs=1-3 tuned.non_isolcpus=000000ff,ffffffff,ffffffff,fffffff1 intel_pstate=disable nosoftlockup tsc=nowatchdog intel_iommu=on iommu=pt isolcpus=managed_irq,1-3 systemd.cpu_affinity=0,4,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23,24,25,26,27,28,29,30,31,32,33,34,35,36,37,38,39,40,41,42,43,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,62,63,64,65,66,67,68,69,70,71,72,73,74,75,76,77,78,79,80,81,82,83,84,85,86,87,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103 default_hugepagesz=1G hugepagesz=2M hugepages=128 nmi_watchdog=0 audit=0 mce=off processor.max_cstate=1 idle=poll intel_idle.max_cstate=0
I0829 13:25:21.569345       1 node.go:44] Environment information: kernel version 4.18.0-305.10.2.rt7.83.el8_4.x86_64
I0829 13:25:21.569367       1 main.go:53] Running the oslat command with arguments \
[--duration 600 --rtprio 1 --cpu-list 4,6,52,54,56,58 --cpu-main-thread 2]
I0829 13:35:22.632263       1 main.go:59] Succeeded to run the oslat command: oslat V 2.00
Total runtime:    600 seconds
Thread priority:  SCHED_FIFO:1
CPU list:     4,6,52,54,56,58
CPU for main thread:  2
Workload:     no
Workload mem:     0 (KiB)
Preheat cores:    6

Pre-heat for 1 seconds...
Test starts...
Test completed.

        Core:  4 6 52 54 56 58
    CPU Freq:  2096 2096 2096 2096 2096 2096 (Mhz)
    001 (us):  19390720316 19141129810 20265099129 20280959461 19391991159 19119877333
    002 (us):  5304 5249 5777 5947 6829 4971
    003 (us):  28 14 434 47 208 21
    004 (us):  1388 853 123568 152817 5576 0
    005 (us):  207850 223544 103827 91812 227236 231563
    006 (us):  60770 122038 277581 323120 122633 122357
    007 (us):  280023 223992 63016 25896 214194 218395
    008 (us):  40604 25152 24368 4264 24440 25115
    009 (us):  6858 3065 5815 810 3286 2116
    010 (us):  1947 936 1452 151 474 361
  ...
     Minimum:  1 1 1 1 1 1 (us)
     Average:  1.000 1.000 1.000 1.000 1.000 1.000 (us)
     Maximum:  37 38 49 28 28 19 (us)
     Max-Min:  36 37 48 27 27 18 (us)
    Duration:  599.667 599.667 599.667 599.667 599.667 599.667 (sec)

1	In this example, the measured latency is outside the maximum allowed value.

Generating a latency test failure report

Use the following procedures to generate a JUnit latency test output and test failure report.

Prerequisites

You have installed the OpenShift CLI (oc).
You have logged in as a user with cluster-admin privileges.

Procedure

Create a test failure report with information about the cluster state and resources for troubleshooting by passing the --report parameter with the path to where the report is dumped:

$ podman run -v $(pwd)/:/kubeconfig:Z -v $(pwd)/reportdest:<report_folder_path> \
-e KUBECONFIG=/kubeconfig/kubeconfig  -e DISCOVERY_MODE=true \
registry.redhat.io/openshift4/cnf-tests-rhel8:v4.8 \
/usr/bin/test-run.sh --report <report_folder_path> \
-ginkgo.focus="\[performance\]\ Latency\ Test"

where:

<report_folder_path>: Is the path to the folder where the report is generated.

Generating a JUnit latency test report

Use the following procedures to generate a JUnit latency test output and test failure report.

Prerequisites

You have installed the OpenShift CLI (oc).
You have logged in as a user with cluster-admin privileges.

Procedure

Create a JUnit-compliant XML report by passing the --junit parameter together with the path to where the report is dumped:

$ podman run -v $(pwd)/:/kubeconfig:Z -v $(pwd)/junitdest:<junit_folder_path> \
-e KUBECONFIG=/kubeconfig/kubeconfig  -e DISCOVERY_MODE=true \
registry.redhat.io/openshift4/cnf-tests-rhel8:v4.8 \
/usr/bin/test-run.sh --junit <junit_folder_path> \
-ginkgo.focus="\[performance\]\ Latency\ Test"

where:

<junit_folder_path>: Is the path to the folder where the junit report is generated

Running latency tests in a disconnected cluster

The CNF tests image can run tests in a disconnected cluster that is not able to reach external registries. This requires two steps:

Mirroring the cnf-tests image to the custom disconnected registry.
Instructing the tests to consume the images from the custom disconnected registry.

Mirroring the images to a custom registry accessible from the cluster

A mirror executable is shipped in the image to provide the input required by oc to mirror the test image to a local registry.

Run this command from an intermediate machine that has access to the cluster and registry.redhat.io:
```
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
registry.redhat.io/openshift4/cnf-tests-rhel8:v4.8 \
/usr/bin/mirror -registry <disconnected_registry> | oc image mirror -f -
```
where:

<disconnected_registry>

Is the disconnected mirror registry you have configured, for example, my.local.registry:5000/.

When you have mirrored the cnf-tests image into the disconnected registry, you must override the original registry used to fetch the images when running the tests, for example:

$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
-e DISCOVERY_MODE=true -e IMAGE_REGISTRY="<disconnected_registry>" \
-e CNF_TESTS_IMAGE="cnf-tests-rhel8:v4.8" \
/usr/bin/test-run.sh -ginkgo.focus="\[performance\]\ Latency\ Test"

Configuring the tests to consume images from a custom registry

You can run the latency tests using a custom test image and image registry using CNF_TESTS_IMAGE and IMAGE_REGISTRY variables.

To configure the latency tests to use a custom test image and image registry, run the following command:
```
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
-e IMAGE_REGISTRY="<custom_image_registry>" \
-e CNF_TESTS_IMAGE="<custom_cnf-tests_image>" \
registry.redhat.io/openshift4/cnf-tests-rhel8:v4.8 /usr/bin/test-run.sh
```
where:

<custom_image_registry>

is the custom image registry, for example, custom.registry:5000/.

<custom_cnf-tests_image>

is the custom cnf-tests image, for example, custom-cnf-tests-image:latest.

Mirroring images to the cluster internal registry

OpenShift Container Platform provides a built-in container image registry, which runs as a standard workload on the cluster.

Procedure

Gain external access to the registry by exposing it with a route:

$ oc patch configs.imageregistry.operator.openshift.io/cluster --patch '{"spec":{"defaultRoute":true}}' --type=merge

Fetch the registry endpoint by running the following command:

$ REGISTRY=$(oc get route default-route -n openshift-image-registry --template='{{ .spec.host }}')

Create a namespace for exposing the images:
```
$ oc create ns cnftests
```

Make the image stream available to all the namespaces used for tests. This is required to allow the tests namespaces to fetch the images from the cnf-tests image stream. Run the following commands:

$ oc policy add-role-to-user system:image-puller system:serviceaccount:cnf-features-testing:default --namespace=cnftests

$ oc policy add-role-to-user system:image-puller system:serviceaccount:performance-addon-operators-testing:default --namespace=cnftests

Retrieve the docker secret name and auth token by running the following commands:

$ SECRET=$(oc -n cnftests get secret | grep builder-docker | awk {'print $1'}

$ TOKEN=$(oc -n cnftests get secret $SECRET -o jsonpath="{.data['\.dockercfg']}" | base64 --decode | jq '.["image-registry.openshift-image-registry.svc:5000"].auth')

Create a dockerauth.json file, for example:

$ echo "{\"auths\": { \"$REGISTRY\": { \"auth\": $TOKEN } }}" > dockerauth.json

Do the image mirroring:

$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
registry.redhat.io/openshift4/cnf-tests-rhel8:4.8 \
/usr/bin/mirror -registry $REGISTRY/cnftests |  oc image mirror --insecure=true \
-a=$(pwd)/dockerauth.json -f -

Run the tests:

$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
-e DISCOVERY_MODE=true -e IMAGE_REGISTRY=image-registry.openshift-image-registry.svc:5000/cnftests \
cnf-tests-local:latest /usr/bin/test-run.sh -ginkgo.focus="\[performance\]\ Latency\ Test"

Mirroring a different set of test images

You can optionally change the default upstream images that are mirrored for the latency tests.

Procedure

The mirror command tries to mirror the upstream images by default. This can be overridden by passing a file with the following format to the image:
```
[
    {
        "registry": "public.registry.io:5000",
        "image": "imageforcnftests:4.8"
    }
]
```

Pass the file to the mirror command, for example saving it locally as images.json. With the following command, the local path is mounted in /kubeconfig inside the container and that can be passed to the mirror command.

$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
registry.redhat.io/openshift4/cnf-tests-rhel8:v4.8 /usr/bin/mirror \
--registry "my.local.registry:5000/" --images "/kubeconfig/images.json" \
|  oc image mirror -f -

Troubleshooting errors with the cnf-tests container

To run latency tests, the cluster must be accessible from within the cnf-tests container.

Prerequisites

You have installed the OpenShift CLI (oc).
You have logged in as a user with cluster-admin privileges.

Procedure

Verify that the cluster is accessible from inside the cnf-tests container by running the following command:
```
$ podman run -v $(pwd)/:/kubeconfig:Z -e KUBECONFIG=/kubeconfig/kubeconfig \
registry.redhat.io/openshift4/cnf-tests-rhel8:v4.8 \
oc get nodes
```
If this command does not work, an error related to spanning across DNS, MTU size, or firewall access might be occurring.