Scalability and Performance | Serving | Red Hat OpenShift Serverless 1.34

Overhead of OpenShift Serverless Serving
Known limitations of OpenShift Serverless Serving
Scaling and performance of OpenShift Serverless Serving

OpenShift Serverless consists of several different components that have different resource requirements and scaling behaviors. These components are horizontally and vertically scalable, but their resource requirements and configuration highly depend on the actual use-case.

Control-plane components: These components are responsible for observing and reacting to custom resources and continuously reconfiguring the system, for example, the controller pods.
Data-plane components: These components are directly involved in requests and response handling, for example, the Knative Servings activator component.

The following metrics and findings were recorded using the following test setup:

A cluster running OpenShift Container Platform 4.13
The cluster running 4 compute nodes in AWS with a machine type of m6.xlarge
OpenShift Serverless 1.30

Overhead of OpenShift Serverless Serving

As components of OpenShift Serverless Serving are part of the data-plane, requests from clients are routed through:

The ingress-gateway (Kourier or Service Mesh)
The activator component
The queue-proxy sidecar container in each Knative Service

These components introduce an additional hop in networking and perform additional tasks, for example, adding observability and request queuing. The following are the measured latency overheads:

Each additional network hop adds 0.5 ms to 1 ms latency to a request. Depending on the current load of the Knative Service and if the Knative Service was scaled to zero before the request, the activator component is not always a part of the data-plane.
Depending on the payload size, each of the components is consuming up to 1 vCPU of CPU for handling 2500 requests per second.

Known limitations of OpenShift Serverless Serving

The maximum number of Knative Services that can be created is 3,000. This corresponds to the OpenShift Container Platform Kubernetes services limit of 10,000, since 1 Knative Service creates 3 Kubernetes services.

Scaling and performance of OpenShift Serverless Serving

OpenShift Serverless Serving has to be scaled and configured based on the following parameters:

Number of Knative Services
Number of Revisions
Amount of concurrent requests in the system
Size of payloads of the requests
The startup-latency and response latency of the Knative Service added by the user’s web application
Number of changes of the KnativeService custom resource (CR) over time

KnativeServing default configuration

Per default, OpenShift Serverless Serving is configured to run all components with high-availability and medium-sized CPU and memory requests and limits. This means that the high-available field in KnativeServing CR is automatically set to a value of 2 and all system components are scaled to two replicas. This configuration is suitable for medium workload scenarios and has been tested with:

170 Knative Services
1-2 Revisions per Knative Service
89 test scenarios mainly focused on testing the control plane
48 re-creating scenarios where Knative Services are deleted and re-created
41 stable scenarios, in which requests are slowly but continuously sent to the system

During these test cases, the system components effectively consumed:

Component Measured Resources

Component	Measured Resources
Operator in project `openshift-serverless`	1 GB Memory, 0.2 Cores of CPU
Serving components in project `knative-serving`	5 GB Memory, 2.5 Cores of CPU

Operator in project openshift-serverless

1 GB Memory, 0.2 Cores of CPU

Serving components in project knative-serving

5 GB Memory, 2.5 Cores of CPU

Minimal requirements of OpenShift Serverless Serving

While the default setup is suitable for medium-sized workloads, it might be over-sized for smaller setups or under-sized for high-workload scenarios. To configure OpenShift Serverless Serving for a minimal workload scenario, you need to know the idle consumption of the system components.

Idle consumption

The idle consumption is dependent on the number of Knative Services. The following memory usage has been measured for the components in the knative-serving and knative-serving-ingress OpenShift Container Platform projects:

Component 0 Services 100 Services 500 Services 1000 Services

Component	0 Services	100 Services	500 Services	1000 Services
`activator`	55Mi	86Mi	300Mi	450Mi
`autoscaler`	52Mi	102Mi	225Mi	350Mi
`controller`	100Mi	135Mi	310Mi	500Mi
`webhook`	60Mi	60Mi	60Mi	60Mi
`3scale-kourier-gateway`	20Mi	60Mi	190Mi	330Mi
`net-kourier-controller`	90Mi	170Mi	340Mi	430Mi

activator

55Mi

86Mi

300Mi

450Mi

autoscaler

52Mi

102Mi

225Mi

350Mi

controller

100Mi

135Mi

310Mi

500Mi

webhook

60Mi

3scale-kourier-gateway

20Mi

60Mi

190Mi

330Mi

net-kourier-controller

90Mi

170Mi

340Mi

430Mi

Either 3scale-kourier-gateway and net-kourier-controller components or istio-ingressgateway and net-istio-controller components are installed.

The memory consumption of net-istio is based on the total number of pods within the mesh.

Configuring Serving for minimal workloads

Procedure

You can configure Knative Serving for minimal workloads using the KnativeServing custom resource (CR):

A minimal workload configuration in KnativeServing CR

apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
metadata:
  name: knative-serving
  namespace: knative-serving
spec:
  high-availability:
    replicas: 1 (1)
  workloads:
    - name: activator
      replicas: 2 (2)
      resources:
        - container: activator
          requests:
            cpu: 250m (3)
            memory: 60Mi (4)
          limits:
            cpu: 1000m
            memory: 600Mi
    - name: controller
      replicas: 1 (5)
      resources:
        - container: controller
          requests:
            cpu: 10m
            memory: 100Mi
          limits: (6)
            cpu: 200m
            memory: 300Mi
    - name: webhook
      replicas: 2
      resources:
        - container: webhook
          requests:
            cpu: 100m (7)
            memory: 60Mi
          limits:
            cpu: 200m
            memory: 200Mi
  podDisruptionBudgets: (8)
    - name: activator-pdb
      minAvailable: 1
    - name: webhook-pdb
      minAvailable: 1

1	Setting this to `1` scales all system components to one replica.
2	Activator should always be scaled to a minimum of `2` instances to avoid downtime.
3	Activator CPU requests should not be set lower than `250m`, as a `HorizontalPodAutoscaler` will use this as a reference to scale up and down.
4	Adjust memory requests to the idle values from the previous table. Also adjust memory limits according to your expected load (this might need custom testing to find the best values).
5	One webhook and one controller are sufficient for a minimal-workload scenario
6	These limits are sufficient for a minimal-workload scenario, but they also might need adjustments depending on your concrete workload.
7	Webhook CPU requests should not be set lower than `100m`, as a HorizontalPodAutoscaler will use this as a reference to scale up and down.
8	Adjust the `PodDistruptionBudgets` to a value lower than `replicas`, to avoid problems during node maintenance.

Configuring Serving for high workloads

You can configure Knative Serving for high workloads using the KnativeServing custom resource (CR). The following findings are relevant to configuring Knative Serving for a high workload:

These findings have been tested with requests with a payload size of 0-32 kb. The Knative Service backends used in those tests had a startup latency between 0 to 10 seconds and response times between 0 to 5 seconds.

All data-plane components are mostly increasing CPU usage on higher requests and payload scenarios, so the CPU requests and limits have to be tested and potentially increased.
The activator component also might need more memory, when it has to buffer more or bigger request payloads, so the memory requests and limits might need to be increased as well.
One activator pod can handle approximately 2500 requests per second before it starts to increase latency and, at some point, leads to errors.
One 3scale-kourier-gateway or istio-ingressgateway pod can also handle approximately 2500 requests per second before it starts to increase latency and, at some point, leads to errors.
Each of the data-plane components consumes up to 1 vCPU of CPU for handling 2500 requests per second. Note that this highly depends on the payload size and the response times of the Knative Service backend.

Fast startup and fast response-times of your Knative Service user workloads are critical for good performance of the overall system. The Knative Serving components are buffering incoming requests when the Knative Service user backend is scaling up or when request concurrency has reached its capacity. If your Knative Service user workload introduces long startup or request latency, it will either overload the activator component (when the CPU and memory configuration is too low) or lead to errors for the calling clients.

Procedure

To fine-tune your installation, use the previous findings combined with your own test results to configure the KnativeServing custom resource:

A high workload configuration in KnativeServing CR

apiVersion: operator.knative.dev/v1beta1
kind: KnativeServing
metadata:
  name: knative-serving
  namespace: knative-serving
spec:
  high-availability:
    replicas: 2 (1)
  workloads:
    - name: component-name (2)
      replicas: 2 (3)
      resources:
        - container: container-name
          requests:
            cpu: (4)
            memory:
          limits:
            cpu:
            memory:
  podDisruptionBudgets: (5)
    - name: name-of-pod-disruption-budget
      minAvailable: 1

1	Set this parameter to at least `2` to make sure you always have at least two instances of every component running. You can also use `workloads` to override the replicas for certain components.
2	Use the `workloads` list to configure specific components. Use the `deployment` name of the component and set the `replicas` field.
3	For the `activator`, `webhook`, and `3scale-kourier-gateway` components, which use horizontal pod autoscalers (HPAs), the `replicas` field sets the minimum number of replicas. The actual number of replicas depends on the CPU load and scaling done by the HPAs.
4	Set the requested and limited CPU and memory according to at least the idle consumption while also taking the previous findings and your own test results into consideration.
5	Adjust the `PodDistruptionBudgets` to a value lower than `replicas` to avoid problems during node maintenance. The default `minAvailable` is set to `1`, so if you increase the required replicas, you must also increase `minAvailable`.

As each environment is highly specific, it is essential to test and find your own ideal configuration. Use the monitoring and alerting functionality of OpenShift Container Platform to continuously monitor your actual resource consumption and make adjustments if needed.

If you are using the OpenShift Serverless and Service Mesh integration, additional CPU processing is added by the istio-proxy sidecar containers. For more information about this, see the Service Mesh documentation.

Scalability and performance of OpenShift Serverless Serving