apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: example-service
namespace: default
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/min-scale: "0"
...
Knative Serving provides automatic scaling, or autoscaling, for applications to match incoming demand. For example, if an application is receiving no traffic, and scale-to-zero is enabled, Knative Serving scales the application down to zero replicas. If scale-to-zero is disabled, the application is scaled down to the minimum number of replicas configured for applications on the cluster. Replicas can also be scaled up to meet demand if traffic to the application increases.
Autoscaling settings for Knative services can be global settings that are configured by cluster administrators, or per-revision settings that are configured for individual services. You can modify per-revision settings for your services by using the OpenShift Container Platform web console, by modifying the YAML file for your service, or by using the Knative (kn
) CLI.
Any limits or targets that you set for a service are measured against a single instance of your application. For example, setting the |
Scale bounds determine the minimum and maximum numbers of replicas that can serve an application at any given time. You can set scale bounds for an application to help prevent cold starts or control computing costs.
The minimum number of replicas that can serve an application is determined by the min-scale
annotation. If scale to zero is not enabled, the min-scale
value defaults to 1
.
The min-scale
value defaults to 0
replicas if the following conditions are met:
The min-scale
annotation is not set
Scaling to zero is enabled
The class KPA
is used
min-scale
annotationapiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: example-service
namespace: default
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/min-scale: "0"
...
Using the Knative (kn
) CLI to set the min-scale
annotation provides a more streamlined and intuitive user interface over modifying YAML files directly. You can use the kn service
command with the --scale-min
flag to create or modify the min-scale
value for a service.
Knative Serving is installed on the cluster.
You have installed the Knative (kn
) CLI.
Set the minimum number of replicas for the service by using the --scale-min
flag:
$ kn service create <service_name> --image <image_uri> --scale-min <integer>
$ kn service create example-service --image quay.io/openshift-knative/knative-eventing-sources-event-display:latest --scale-min 2
The maximum number of replicas that can serve an application is determined by the max-scale
annotation. If the max-scale
annotation is not set, there is no upper limit for the number of replicas created.
max-scale
annotationapiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: example-service
namespace: default
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/max-scale: "10"
...
Using the Knative (kn
) CLI to set the max-scale
annotation provides a more streamlined and intuitive user interface over modifying YAML files directly. You can use the kn service
command with the --scale-max
flag to create or modify the max-scale
value for a service.
Knative Serving is installed on the cluster.
You have installed the Knative (kn
) CLI.
Set the maximum number of replicas for the service by using the --scale-max
flag:
$ kn service create <service_name> --image <image_uri> --scale-max <integer>
$ kn service create example-service --image quay.io/openshift-knative/knative-eventing-sources-event-display:latest --scale-max 10
Concurrency determines the number of simultaneous requests that can be processed by each replica of an application at any given time. Concurrency can be configured as a soft limit or a hard limit:
A soft limit is a targeted requests limit, rather than a strictly enforced bound. For example, if there is a sudden burst of traffic, the soft limit target can be exceeded.
A hard limit is a strictly enforced upper bound requests limit. If concurrency reaches the hard limit, surplus requests are buffered and must wait until there is enough free capacity to execute the requests.
Using a hard limit configuration is only recommended if there is a clear use case for it with your application. Having a low, hard limit specified may have a negative impact on the throughput and latency of an application, and might cause cold starts. |
Adding a soft target and a hard limit means that the autoscaler targets the soft target number of concurrent requests, but imposes a hard limit of the hard limit value for the maximum number of requests.
If the hard limit value is less than the soft limit value, the soft limit value is tuned down, because there is no need to target more requests than the number that can actually be handled.
A soft limit is a targeted requests limit, rather than a strictly enforced bound. For example, if there is a sudden burst of traffic, the soft limit target can be exceeded. You can specify a soft concurrency target for your Knative service by setting the autoscaling.knative.dev/target
annotation in the spec, or by using the kn service
command with the correct flags.
Optional: Set the autoscaling.knative.dev/target
annotation for your Knative service in the spec of the Service
custom resource:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: example-service
namespace: default
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/target: "200"
Optional: Use the kn service
command to specify the --concurrency-target
flag:
$ kn service create <service_name> --image <image_uri> --concurrency-target <integer>
$ kn service create example-service --image quay.io/openshift-knative/knative-eventing-sources-event-display:latest --concurrency-target 50
A hard concurrency limit is a strictly enforced upper bound requests limit. If concurrency reaches the hard limit, surplus requests are buffered and must wait until there is enough free capacity to execute the requests. You can specify a hard concurrency limit for your Knative service by modifying the containerConcurrency
spec, or by using the kn service
command with the correct flags.
Optional: Set the containerConcurrency
spec for your Knative service in the spec of the Service
custom resource:
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: example-service
namespace: default
spec:
template:
spec:
containerConcurrency: 50
The default value is 0
, which means that there is no limit on the number of simultaneous requests that are permitted to flow into one replica of the service at a time.
A value greater than 0
specifies the exact number of requests that are permitted to flow into one replica of the service at a time. This example would enable a hard concurrency limit of 50 requests.
Optional: Use the kn service
command to specify the --concurrency-limit
flag:
$ kn service create <service_name> --image <image_uri> --concurrency-limit <integer>
$ kn service create example-service --image quay.io/openshift-knative/knative-eventing-sources-event-display:latest --concurrency-limit 50
This value specifies the percentage of the concurrency limit that is actually targeted by the autoscaler. This is also known as specifying the hotness at which a replica runs, which enables the autoscaler to scale up before the defined hard limit is reached.
For example, if the containerConcurrency
value is set to 10, and the target-utilization-percentage
value is set to 70 percent, the autoscaler creates a new replica when the average number of concurrent requests across all existing replicas reaches 7. Requests numbered 7 to 10 are still sent to the existing replicas, but additional replicas are started in anticipation of being required after the containerConcurrency
value is reached.
apiVersion: serving.knative.dev/v1
kind: Service
metadata:
name: example-service
namespace: default
spec:
template:
metadata:
annotations:
autoscaling.knative.dev/target-utilization-percentage: "70"
...