Recommended cluster scaling practices | Scalability and performance

Recommended practices for scaling the cluster
Modifying a MachineSet
About MachineHealthChecks
Sample MachineHealthCheck resource
Creating a MachineHealthCheck resource

The guidance in this section is only relevant for installations with cloud provider integration.

Apply the following best practices to scale the number of worker machines in your OpenShift Container Platform cluster. You scale the worker machines by increasing or decreasing the number of replicas that are defined in the worker MachineSet.

Recommended practices for scaling the cluster

When scaling up the cluster to higher node counts:

Spread nodes across all of the available zones for higher availability.
Scale up by no more than 25 to 50 machines at once.
Consider creating new MachineSets in each available zone with alternative instance types of similar size to help mitigate any periodic provider capacity constraints. For example, on AWS, use m5.large and m5d.large.

Cloud providers might implement a quota for API services. Therefore, gradually scale the cluster.

The controller might not be able to create the machines if the replicas in the MachineSets are set to higher numbers all at one time. The number of requests the cloud platform, which OpenShift Container Platform is deployed on top of, is able to handle impacts the process. The controller will start to query more while trying to create, check, and update the machines with the status. The cloud platform on which OpenShift Container Platform is deployed has API request limits and excessive queries might lead to machine creation failures due to cloud platform limitations.

Enable machine health checks when scaling to large node counts. In case of failures, the health checks monitor the condition and automatically repair unhealthy machines.

Modifying a MachineSet

To make changes to a MachineSet, edit the MachineSet YAML. Then, remove all machines associated with the MachineSet by deleting each machine 'or scaling down the MachineSet to 0 replicas. Then, scale the replicas back to the desired number. Changes you make to a MachineSet do not affect existing machines.

If you need to scale a MachineSet without making other changes, you do not need to delete the machines.

By default, the OpenShift Container Platform router pods are deployed on workers. Because the router is required to access some cluster resources, including the web console, do not scale the worker MachineSet to 0 unless you first relocate the router pods.

Prerequisites

Install an OpenShift Container Platform cluster and the oc command line.
Log in to oc as a user with cluster-admin permission.

Procedure

Edit the MachineSet:

$ oc edit machineset <machineset> -n openshift-machine-api

Scale down the MachineSet to 0:

$ oc scale --replicas=0 machineset <machineset> -n openshift-machine-api

Or:

$ oc edit machineset <machineset> -n openshift-machine-api

Wait for the machines to be removed.

Scale up the MachineSet as needed:

$ oc scale --replicas=2 machineset <machineset> -n openshift-machine-api

Or:

$ oc edit machineset <machineset> -n openshift-machine-api

Wait for the machines to start. The new Machines contain changes you made to the Machineset.

About MachineHealthChecks

MachineHealthChecks automatically repairs unhealthy Machines in a particular MachinePool.

To monitor machine health, you create a resource to define the configuration for a controller. You set a condition to check for, such as staying in the NotReady status for 15 minutes or displaying a permanent condition in the node-problem-detector, and a label for the set of machines to monitor.

You cannot apply a MachineHealthCheck to a machine with the master role.

The controller that observes a MachineHealthCheck resource checks for the status that you defined. If a machine fails the health check, it is automatically deleted and a new one is created to take its place. When a machine is deleted, you see a machine deleted event. To limit disruptive impact of the machine deletion, the controller drains and deletes only one node at a time. If there are more unhealthy machines than the maxUnhealthy threshold allows for in the targeted pool of machines, remediation stops so that manual intervention can take place.

To stop the check, you remove the resource.

Sample MachineHealthCheck resource

The MachineHealthCheck resource resembles the following YAML file:

MachineHealthCheck

apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
  name: example (1)
  namespace: openshift-machine-api
spec:
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-machine-role: <role> (2)
      machine.openshift.io/cluster-api-machine-type: <role> (2)
      machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone> (3)
  unhealthyConditions:
  - type:    "Ready"
    timeout: "300s" (4)
    status: "False"
  - type:    "Ready"
    timeout: "300s" (4)
    status: "Unknown"
  maxUnhealthy: "40%" (5)

1	Specify the name of the MachineHealthCheck to deploy.
2	Specify a label for the machine pool that you want to check.
3	Specify the MachineSet to track in `<cluster_name>-<label>-<zone>` format. For example, `prod-node-us-east-1a`.
4	Specify the timeout duration for a node condition. If a condition is met for the duration of the timeout, the Machine will be remediated. Long timeouts can result in long periods of downtime for the workload on the unhealthy Machine.
5	Specify the amount of unhealthy machines allowed in the targeted pool of machines. This can be set as a percentage or an integer.

The matchLabels are examples only; you must map your machine groups based on your specific needs.

Creating a MachineHealthCheck resource

You can create a MachineHealthCheck resource for all MachinePools in your cluster except the master pool.

Prerequisites

Install the oc command line interface.

Procedure

Create a healthcheck.yml file that contains the definition of your MachineHealthCheck.
Apply the healthcheck.yml file to your cluster:
```
$ oc apply -f healthcheck.yml
```