Adjust worker nodes

If you incorrectly sized the worker nodes during deployment, adjust them by creating one or more new MachineSets, scale them up, then scale the original MachineSet down before removing them.

Understanding the difference between MachineSets and the MachineConfigPool

MachineSets describe OpenShift Container Platform nodes with respect to the cloud or machine provider.

The MachineConfigPool allows MachineConfigController components to define and provide the status of machines in the context of upgrades.

The MachineConfigPool allows users to configure how upgrades are rolled out to the OpenShift Container Platform nodes in the MachineConfigPool.

NodeSelector can be replaced with a reference to MachineSets.

Scaling a MachineSet manually

If you must add or remove an instance of a machine in a MachineSet, you can manually scale the MachineSet.

This guidance is relevant to fully automated, installer provisioned infrastructure installations. Customized, user provisioned infrastructure installations does not have MachineSets.

Prerequisites
  • Install an OpenShift Container Platform cluster and the oc command line.

  • Log in to oc as a user with cluster-admin permission.

Procedure
  1. View the MachineSets that are in the cluster:

    $ oc get machinesets -n openshift-machine-api

    The MachineSets are listed in the form of <clusterid>-worker-<aws-region-az>.

  2. Scale the MachineSet:

    $ oc scale --replicas=2 machineset <machineset> -n openshift-machine-api

    Or:

    $ oc edit machineset <machineset> -n openshift-machine-api

    You can scale the MachineSet up or down. It takes several minutes for the new machines to be available.

The MachineSet delete policy

Random, Newest, and Oldest are the three supported options. The default is Random, meaning that random machines are chosen and deleted when scaling MachineSets down. The delete policy can be set according to the use case by modifying the particular MachineSet:

spec:
  deletePolicy: <delete policy>
  replicas: <desired replica count>

Specific machines can also be prioritized for deletion by adding the annotation machine.openshift.io/cluster-api-delete-machine to the machine of interest, regardless of the delete policy.

By default, the OpenShift Container Platform router pods are deployed on workers. Because the router is required to access some cluster resources, including the web console, do not scale the worker MachineSet to 0 unless you first relocate the router pods.

Custom MachineSets can be used for use cases requiring that services run on specific nodes and that those services are ignored by the controller when the worker MachineSets are scaling down. This prevents service disruption.

Creating infrastructure MachineSets

You can create a MachineSet to host only infrastructure components. You apply specific Kubernetes labels to these Machines and then update the infrastructure components to run on only those Machines. These infrastructure nodes are not counted toward the total number of subscriptions that are required to run the environment.

OpenShift Container Platform infrastructure components

The following OpenShift Container Platform components are infrastructure components:

  • Kubernetes and OpenShift Container Platform control plane services that run on masters

  • The default router

  • The container image registry

  • The cluster metrics collection, or monitoring service

  • Cluster aggregated logging

  • Service brokers

Any node that runs any other container, pod, or component is a worker node that your subscription must cover.

Creating infrastructure MachineSets for production environments

In a production deployment, deploy at least three MachineSets to hold infrastructure components. Both the logging aggregation solution and the service mesh deploy Elasticsearch, and Elasticsearch requires three instances that are installed on different nodes. For high availability, install deploy these nodes to different availability zones. Since you need different MachineSets for each availability zone, create at least three MachineSets.

Creating a MachineSet

In addition to the ones created by the installation program, you can create your own MachineSets to dynamically manage the machine compute resources for specific workloads of your choice.

Prerequisites
  • Deploy an OpenShift Container Platform cluster.

  • Install the OpenShift CLI (oc).

  • Log in to oc as a user with cluster-admin permission.

Procedure
  1. Create a new YAML file that contains the MachineSet Custom Resource sample, as shown, and is named <file_name>.yaml.

    Ensure that you set the <clusterID> and <role> parameter values.

    1. If you are not sure about which value to set for a specific field, you can check an existing MachineSet from your cluster.

      $ oc get machinesets -n openshift-machine-api
      Example output
      NAME                                DESIRED   CURRENT   READY   AVAILABLE   AGE
      agl030519-vplxk-worker-us-east-1a   1         1         1       1           55m
      agl030519-vplxk-worker-us-east-1b   1         1         1       1           55m
      agl030519-vplxk-worker-us-east-1c   1         1         1       1           55m
      agl030519-vplxk-worker-us-east-1d   0         0                             55m
      agl030519-vplxk-worker-us-east-1e   0         0                             55m
      agl030519-vplxk-worker-us-east-1f   0         0                             55m
    2. Check values of a specific MachineSet:

      $ oc get machineset <machineset_name> -n \
           openshift-machine-api -o yaml
      Example output
      ...
      template:
          metadata:
            labels:
              machine.openshift.io/cluster-api-cluster: agl030519-vplxk (1)
              machine.openshift.io/cluster-api-machine-role: worker (2)
              machine.openshift.io/cluster-api-machine-type: worker
              machine.openshift.io/cluster-api-machineset: agl030519-vplxk-worker-us-east-1a
      1 The cluster ID.
      2 A default node label.
  2. Create the new MachineSet:

    $ oc create -f <file_name>.yaml
  3. View the list of MachineSets:

    $ oc get machineset -n openshift-machine-api
    Example output
    NAME                                DESIRED   CURRENT   READY   AVAILABLE   AGE
    agl030519-vplxk-infra-us-east-1a    1         1         1       1           11m
    agl030519-vplxk-worker-us-east-1a   1         1         1       1           55m
    agl030519-vplxk-worker-us-east-1b   1         1         1       1           55m
    agl030519-vplxk-worker-us-east-1c   1         1         1       1           55m
    agl030519-vplxk-worker-us-east-1d   0         0                             55m
    agl030519-vplxk-worker-us-east-1e   0         0                             55m
    agl030519-vplxk-worker-us-east-1f   0         0                             55m

    When the new MachineSet is available, the DESIRED and CURRENT values match. If the MachineSet is not available, wait a few minutes and run the command again.

  4. After the new MachineSet is available, check status of the machine and the node that it references:

    $ oc describe machine <name> -n openshift-machine-api

    For example:

    $ oc describe machine agl030519-vplxk-infra-us-east-1a -n openshift-machine-api
    Example output
    status:
      addresses:
      - address: 10.0.133.18
        type: InternalIP
      - address: ""
        type: ExternalDNS
      - address: ip-10-0-133-18.ec2.internal
        type: InternalDNS
      lastUpdated: "2019-05-03T10:38:17Z"
      nodeRef:
        kind: Node
        name: ip-10-0-133-18.ec2.internal
        uid: 71fb8d75-6d8f-11e9-9ff3-0e3f103c7cd8
      providerStatus:
        apiVersion: awsproviderconfig.openshift.io/v1beta1
        conditions:
        - lastProbeTime: "2019-05-03T10:34:31Z"
          lastTransitionTime: "2019-05-03T10:34:31Z"
          message: machine successfully created
          reason: MachineCreationSucceeded
          status: "True"
          type: MachineCreation
        instanceId: i-09ca0701454124294
        instanceState: running
        kind: AWSMachineProviderStatus
  5. View the new node and confirm that the new node has the label that you specified:

    $ oc get node <node_name> --show-labels

    Review the command output and confirm that node-role.kubernetes.io/<your_label> is in the LABELS list.

Any change to a MachineSet is not applied to existing machines owned by the MachineSet. For example, labels edited or added to an existing MachineSet are not propagated to existing machines and Nodes associated with the MachineSet.

Next steps

If you need MachineSets in other availability zones, repeat this process to create more MachineSets.

Creating MachineSets for different clouds

Use the sample MachineSet for your cloud.

Sample YAML for a MachineSet Custom Resource on AWS

This sample YAML defines a MachineSet that runs in the us-east-1a Amazon Web Services (AWS) zone and creates nodes that are labeled with node-role.kubernetes.io/<role>: ""

In this sample, <infrastructureID> is the infrastructure ID label that is based on the cluster ID that you set when you provisioned the cluster, and <role> is the node label to add.

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  labels:
    machine.openshift.io/cluster-api-cluster: <infrastructureID> (1)
  name: <infrastructureID>-<role>-<zone> (2)
  namespace: openshift-machine-api
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: <infrastructureID> (1)
      machine.openshift.io/cluster-api-machineset: <infrastructureID>-<role>-<zone> (2)
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: <infrastructureID> (1)
        machine.openshift.io/cluster-api-machine-role: <role> (3)
        machine.openshift.io/cluster-api-machine-type: <role> (3)
        machine.openshift.io/cluster-api-machineset: <infrastructureID>-<role>-<zone> (2)
    spec:
      metadata:
        labels:
          node-role.kubernetes.io/<role>: "" (3)
      providerSpec:
        value:
          ami:
            id: ami-046fe691f52a953f9 (4)
          apiVersion: awsproviderconfig.openshift.io/v1beta1
          blockDevices:
            - ebs:
                iops: 0
                volumeSize: 120
                volumeType: gp2
          credentialsSecret:
            name: aws-cloud-credentials
          deviceIndex: 0
          iamInstanceProfile:
            id: <infrastructureID>-worker-profile (1)
          instanceType: m4.large
          kind: AWSMachineProviderConfig
          placement:
            availabilityZone: us-east-1a
            region: us-east-1
          securityGroups:
            - filters:
                - name: tag:Name
                  values:
                    - <infrastructureID>-worker-sg (1)
          subnet:
            filters:
              - name: tag:Name
                values:
                  - <infrastructureID>-private-us-east-1a (1)
          tags:
            - name: kubernetes.io/cluster/<infrastructureID> (1)
              value: owned
          userDataSecret:
            name: worker-user-data
1 Specify the infrastructure ID that is based on the cluster ID that you set when you provisioned the cluster. If you have the OpenShift CLI and jq package installed, you can obtain the infrastructure ID by running the following command:
$ oc get -o jsonpath='{.status.infrastructureName}{"\n"}' infrastructure cluster
2 Specify the infrastructure ID, node label, and zone.
3 Specify the node label to add.
4 Specify a valid Red Hat Enterprise Linux CoreOS (RHCOS) AMI for your AWS zone for your OpenShift Container Platform nodes.

AWS MachineSets support non-guaranteed Spot Instances, which can save you on costs compared to On-Demand Instance prices. You can configure Spot Instances by adding SpotMarketOptions to the MachineSet YAML file.

Sample YAML for a MachineSet Custom Resource on Azure

This sample YAML defines a MachineSet that runs in the 1 Microsoft Azure zone in the centralus region and creates nodes that are labeled with node-role.kubernetes.io/<role>: ""

In this sample, <infrastructureID> is the infrastructure ID label that is based on the cluster ID that you set when you provisioned the cluster, and <role> is the node label to add.

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  labels:
    machine.openshift.io/cluster-api-cluster: <infrastructureID> (1)
    machine.openshift.io/cluster-api-machine-role: <role> (2)
    machine.openshift.io/cluster-api-machine-type: <role> (2)
  name: <infrastructureID>-<role>-<region> (3)
  namespace: openshift-machine-api
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: <infrastructureID> (1)
      machine.openshift.io/cluster-api-machineset: <infrastructureID>-<role>-<region> (3)
  template:
    metadata:
      creationTimestamp: null
      labels:
        machine.openshift.io/cluster-api-cluster: <infrastructureID> (1)
        machine.openshift.io/cluster-api-machine-role: <role> (2)
        machine.openshift.io/cluster-api-machine-type: <role> (2)
        machine.openshift.io/cluster-api-machineset: <infrastructureID>-<role>-<region> (3)
    spec:
      metadata:
        creationTimestamp: null
        labels:
          node-role.kubernetes.io/<role>: "" (2)
      providerSpec:
        value:
          apiVersion: azureproviderconfig.openshift.io/v1beta1
          credentialsSecret:
            name: azure-cloud-credentials
            namespace: openshift-machine-api
          image:
            offer: ""
            publisher: ""
            resourceID: /resourceGroups/<infrastructureID>-rg/providers/Microsoft.Compute/images/<infrastructureID>
            sku: ""
            version: ""
          internalLoadBalancer: ""
          kind: AzureMachineProviderSpec
          location: centralus
          managedIdentity: <infrastructureID>-identity (1)
          metadata:
            creationTimestamp: null
          natRule: null
          networkResourceGroup: ""
          osDisk:
            diskSizeGB: 128
            managedDisk:
              storageAccountType: Premium_LRS
            osType: Linux
          publicIP: false
          publicLoadBalancer: ""
          resourceGroup: <infrastructureID>-rg (1)
          sshPrivateKey: ""
          sshPublicKey: ""
          subnet: <infrastructureID>-<role>-subnet  (1) (2)
          userDataSecret:
            name: <role>-user-data (2)
          vmSize: Standard_D2s_v3
          vnet: <infrastructureID>-vnet (1)
          zone: "1" (4)
1 Specify the infrastructure ID that is based on the cluster ID that you set when you provisioned the cluster. If you have the OpenShift CLI and jq package installed, you can obtain the infrastructure ID by running the following command:
$ oc get -o jsonpath='{.status.infrastructureName}{"\n"}' infrastructure cluster
2 Specify the node label to add.
3 Specify the infrastructure ID, node label, and region.
4 Specify the zone within your region to place Machines on. Be sure that your region supports the zone that you specify.

Sample YAML for a MachineSet Custom Resource on GCP

This sample YAML defines a MachineSet that runs in Google Cloud Platform (GCP) and creates nodes that are labeled with node-role.kubernetes.io/<role>: ""

In this sample, <infrastructureID> is the infrastructure ID label that is based on the cluster ID that you set when you provisioned the cluster, and <role> is the node label to add.

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  labels:
    machine.openshift.io/cluster-api-cluster: <infrastructureID> (1)
  name: <infrastructureID>-w-a (1)
  namespace: openshift-machine-api
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: <infrastructureID> (1)
      machine.openshift.io/cluster-api-machineset: <infrastructureID>-w-a (1)
  template:
    metadata:
      creationTimestamp: null
      labels:
        machine.openshift.io/cluster-api-cluster: <infrastructureID> (1)
        machine.openshift.io/cluster-api-machine-role: <role> (3)
        machine.openshift.io/cluster-api-machine-type: <role> (3)
        machine.openshift.io/cluster-api-machineset: <infrastructureID>-w-a (1)
    spec:
      metadata:
        labels:
          node-role.kubernetes.io/<role>: "" (3)
      providerSpec:
        value:
          apiVersion: gcpprovider.openshift.io/v1beta1
          canIPForward: false
          credentialsSecret:
            name: gcp-cloud-credentials
          deletionProtection: false
          disks:
          - autoDelete: true
            boot: true
            image: <infrastructureID>-rhcos-image (1)
            labels: null
            sizeGb: 128
            type: pd-ssd
          kind: GCPMachineProviderSpec
          machineType: n1-standard-4
          metadata:
            creationTimestamp: null
          networkInterfaces:
          - network: <infrastructureID>-network (1)
            subnetwork: <infrastructureID>-<role>-subnet (2)
          projectID: <project_name> (4)
          region: us-central1
          serviceAccounts:
          - email: <infrastructureID>-w@<project_name>.iam.gserviceaccount.com  (1) (4)
            scopes:
            - https://www.googleapis.com/auth/cloud-platform
          tags:
          - <infrastructureID>-<role> (2)
          userDataSecret:
            name: worker-user-data
          zone: us-central1-a
1 Specify the infrastructure ID that is based on the cluster ID that you set when you provisioned the cluster. If you have the OpenShift CLI and jq package installed, you can obtain the infrastructure ID by running the following command:
$ oc get -o jsonpath='{.status.infrastructureName}{"\n"}' infrastructure cluster
2 Specify the infrastructure ID and node label.
3 Specify the node label to add.
4 Specify the name of the GCP project that you use for your cluster.

About the ClusterAutoscaler

The ClusterAutoscaler adjusts the size of an OpenShift Container Platform cluster to meet its current deployment needs. It uses declarative, Kubernetes-style arguments to provide infrastructure management that does not rely on objects of a specific cloud provider. The ClusterAutoscaler has a cluster scope, and is not associated with a particular namespace.

The ClusterAutoscaler increases the size of the cluster when there are Pods that failed to schedule on any of the current nodes due to insufficient resources or when another node is necessary to meet deployment needs. The ClusterAutoscaler does not increase the cluster resources beyond the limits that you specify.

Ensure that the maxNodesTotal value in the ClusterAutoscaler definition that you create is large enough to account for the total possible number of machines in your cluster. This value must encompass the number of control plane machines and the possible number of compute machines that you might scale to.

The ClusterAutoscaler decreases the size of the cluster when some nodes are consistently not needed for a significant period, such as when it has low resource use and all of its important Pods can fit on other nodes.

If the following types of Pods are present on a node, the ClusterAutoscaler will not remove the node:

  • Pods with restrictive PodDisruptionBudgets (PDBs).

  • Kube-system Pods that do not run on the node by default.

  • Kube-system Pods that do not have a PDB or have a PDB that is too restrictive.

  • Pods that are not backed by a controller object such as a Deployment, ReplicaSet, or StatefulSet.

  • Pods with local storage.

  • Pods that cannot be moved elsewhere because of a lack of resources, incompatible node selectors or affinity, matching anti-affinity, and so on.

  • Unless they also have a "cluster-autoscaler.kubernetes.io/safe-to-evict": "true" annotation, Pods that have a "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" annotation.

If you configure the ClusterAutoscaler, additional usage restrictions apply:

  • Do not modify the nodes that are in autoscaled node groups directly. All nodes within the same node group have the same capacity and labels and run the same system Pods.

  • Specify requests for your Pods.

  • If you have to prevent Pods from being deleted too quickly, configure appropriate PDBs.

  • Confirm that your cloud provider quota is large enough to support the maximum node pools that you configure.

  • Do not run additional node group autoscalers, especially the ones offered by your cloud provider.

The Horizontal Pod Autoscaler (HPA) and the ClusterAutoscaler modify cluster resources in different ways. The HPA changes the deployment’s or ReplicaSet’s number of replicas based on the current CPU load. If the load increases, the HPA creates new replicas, regardless of the amount of resources available to the cluster. If there are not enough resources, the ClusterAutoscaler adds resources so that the HPA-created Pods can run. If the load decreases, the HPA stops some replicas. If this action causes some nodes to be underutilized or completely empty, the ClusterAutoscaler deletes the unnecessary nodes.

The ClusterAutoscaler takes Pod priorities into account. The Pod Priority and Preemption feature enables scheduling Pods based on priorities if the cluster does not have enough resources, but the ClusterAutoscaler ensures that the cluster has resources to run all Pods. To honor the intention of both features, the ClusterAutoscaler inclues a priority cutoff function. You can use this cutoff to schedule "best-effort" Pods, which do not cause the ClusterAutoscaler to increase resources but instead run only when spare resources are available.

Pods with priority lower than the cutoff value do not cause the cluster to scale up or prevent the cluster from scaling down. No new nodes are added to run the Pods, and nodes running these Pods might be deleted to free resources.

ClusterAutoscaler resource definition

This ClusterAutoscaler resource definition shows the parameters and sample values for the ClusterAutoscaler.

apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  podPriorityThreshold: -10 (1)
  resourceLimits:
    maxNodesTotal: 24 (2)
    cores:
      min: 8 (3)
      max: 128 (4)
    memory:
      min: 4 (5)
      max: 256 (6)
    gpus:
      - type: nvidia.com/gpu (7)
        min: 0 (8)
        max: 16 (9)
      - type: amd.com/gpu (7)
        min: 0 (8)
        max: 4 (9)
  scaleDown: (10)
    enabled: true (11)
    delayAfterAdd: 10m (12)
    delayAfterDelete: 5m (13)
    delayAfterFailure: 30s (14)
    unneededTime: 60s (15)
1 Specify the priority that a pod must exceed to cause the ClusterAutoscaler to deploy additional nodes. Enter a 32-bit integer value. The podPriorityThreshold value is compared to the value of the PriorityClass that you assign to each pod.
2 Specify the maximum number of nodes to deploy. This value is the total number of machines that are deployed in your cluster, not just the ones that the autoscaler controls. Ensure that this value is large enough to account for all of your control plane and compute machines and the total number of replicas that you specify in your MachineAutoscaler resources.
3 Specify the minimum number of cores to deploy.
4 Specify the maximum number of cores to deploy.
5 Specify the minimum amount of memory, in GiB, per node.
6 Specify the maximum amount of memory, in GiB, per node.
7 Optionally, specify the type of GPU node to deploy. Only nvidia.com/gpu and amd.com/gpu are valid types.
8 Specify the minimum number of GPUs to deploy.
9 Specify the maximum number of GPUs to deploy.
10 In this section, you can specify the period to wait for each action by using any valid ParseDuration interval, including ns, us, ms, s, m, and h.
11 Specify whether the ClusterAutoscaler can remove unnecessary nodes.
12 Optionally, specify the period to wait before deleting a node after a node has recently been added. If you do not specify a value, the default value of 10m is used.
13 Specify the period to wait before deleting a node after a node has recently been deleted. If you do not specify a value, the default value of 10s is used.
14 Specify the period to wait before deleting a node after a scale down failure occurred. If you do not specify a value, the default value of 3m is used.
15 Specify the period before an unnecessary node is eligible for deletion. If you do not specify a value, the default value of 10m is used.

Deploying the ClusterAutoscaler

To deploy the ClusterAutoscaler, you create an instance of the ClusterAutoscaler resource.

Procedure
  1. Create a YAML file for the ClusterAutoscaler resource that contains the customized resource definition.

  2. Create the resource in the cluster:

    $ oc create -f <filename>.yaml (1)
    1 <filename> is the name of the resource file that you customized.

About the MachineAutoscaler

The MachineAutoscaler adjusts the number of Machines in the MachineSets that you deploy in an OpenShift Container Platform cluster. You can scale both the default worker MachineSet and any other MachineSets that you create. The MachineAutoscaler makes more Machines when the cluster runs out of resources to support more deployments. Any changes to the values in MachineAutoscaler resources, such as the minimum or maximum number of instances, are immediately applied to the MachineSet they target.

You must deploy a MachineAutoscaler for the ClusterAutoscaler to scale your machines. The ClusterAutoscaler uses the annotations on MachineSets that the MachineAutoscaler sets to determine the resources that it can scale. If you define a ClusterAutoscaler without also defining MachineAutoscalers, the ClusterAutoscaler will never scale your cluster.

MachineAutoscaler resource definition

This MachineAutoscaler resource definition shows the parameters and sample values for the MachineAutoscaler.

apiVersion: "autoscaling.openshift.io/v1beta1"
kind: "MachineAutoscaler"
metadata:
  name: "worker-us-east-1a" (1)
  namespace: "openshift-machine-api"
spec:
  minReplicas: 1 (2)
  maxReplicas: 12 (3)
  scaleTargetRef: (4)
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet (5)
    name: worker-us-east-1a (6)
1 Specify the MachineAutoscaler name. To make it easier to identify which MachineSet this MachineAutoscaler scales, specify or include the name of the MachineSet to scale. The MachineSet name takes the following form: <clusterid>-<machineset>-<aws-region-az>
2 Specify the minimum number Machines of the specified type that must remain in the specified zone after the ClusterAutoscaler initiates cluster scaling. If running in AWS, GCP, or Azure, this value can be set to 0. For other providers, do not set this value to 0.
3 Specify the maximum number Machines of the specified type that the ClusterAutoscaler can deploy in the specified AWS zone after it initiates cluster scaling. Ensure that the maxNodesTotal value in the ClusterAutoscaler definition is large enough to allow the MachineAutoScaler to deploy this number of machines.
4 In this section, provide values that describe the existing MachineSet to scale.
5 The kind parameter value is always MachineSet.
6 The name value must match the name of an existing MachineSet, as shown in the metadata.name parameter value.

Deploying the MachineAutoscaler

To deploy the MachineAutoscaler, you create an instance of the MachineAutoscaler resource.

Procedure
  1. Create a YAML file for the MachineAutoscaler resource that contains the customized resource definition.

  2. Create the resource in the cluster:

    $ oc create -f <filename>.yaml (1)
    1 <filename> is the name of the resource file that you customized.

Enabling Technology Preview features using FeatureGates

You can turn Technology Preview features on and off for all nodes in the cluster by editing the FeatureGates Custom Resource, named cluster, in the openshift-config project.

The following Technology Preview features are enabled by feature gates:

  • RotateKubeletServerCertificate

  • SupportPodPidsLimit

Turning on Technology Preview features cannot be undone and prevents upgrades.

Procedure

To turn on the Technology Preview features for the entire cluster:

  1. Create the FeatureGates instance:

    1. Switch to the AdministrationCustom Resource Definitions page.

    2. On the Custom Resource Definitions page, click FeatureGate.

    3. On the Custom Resource Definitions page, click the Actions Menu and select View Instances.

    4. On the Feature Gates page, click Create Feature Gates.

    5. Replace the code with following sample:

      apiVersion: config.openshift.io/v1
      kind: FeatureGate
      metadata:
        name: cluster
      spec: {}
    6. Click Create.

  2. To turn on the Technology Preview features, change the spec parameter to:

    apiVersion: config.openshift.io/v1
    kind: FeatureGate
    metadata:
      name: cluster
    spec:
      featureSet: TechPreviewNoUpgrade (1)
    1 Add featureSet: TechPreviewNoUpgrade to enable the Technology Preview features that are affected by FeatureGates.

    Turning on Technology Preview features cannot be undone and prevents upgrades.

etcd tasks

Enable, disable, or back up etcd.

Recommended etcd practices

For large and dense clusters, etcd can suffer from poor performance if the keyspace grows excessively large and exceeds the space quota. Periodic maintenance of etcd, including defragmentation, must be performed to free up space in the data store. It is highly recommended that you monitor Prometheus for etcd metrics and defragment it when required before etcd raises a cluster-wide alarm that puts the cluster into a maintenance mode, which only accepts key reads and deletes. Some of the key metrics to monitor are etcd_server_quota_backend_bytes which is the current quota limit, etcd_mvcc_db_total_size_in_use_in_bytes which indicates the actual database usage after a history compaction, and etcd_debugging_mvcc_db_total_size_in_bytes which shows the database size including free space waiting for defragmentation.

About etcd encryption

By default, etcd data is not encrypted in OpenShift Container Platform. You can enable etcd encryption for your cluster to provide an additional layer of data security. For example, it can help protect the loss of sensitive data if an etcd backup is exposed to the incorrect parties.

When you enable etcd encryption, the following OpenShift API server and Kubernetes API server resources are encrypted:

  • Secrets

  • ConfigMaps

  • Routes

  • OAuth access tokens

  • OAuth authorize tokens

When you enable etcd encryption, encryption keys are created. These keys are rotated on a weekly basis. You must have these keys in order to restore from an etcd backup.

Enabling etcd encryption

You can enable etcd encryption to encrypt sensitive resources in your cluster.

It is not recommended to take a backup of etcd until the initial encryption process is complete. If the encryption process has not completed, the backup might be only partially encrypted.

Prerequisites
  • Access to the cluster as a user with the cluster-admin role.

Procedure
  1. Modify the API server object:

    $ oc edit apiserver
  2. Set the encryption field type to aescbc:

    spec:
      encryption:
        type: aescbc (1)
    1 The aescbc type means that AES-CBC with PKCS#7 padding and a 32 byte key is used to perform the encryption.
  3. Save the file to apply the changes.

    The encryption process starts. It can take 20 minutes or longer for this process to complete, depending on the size of your cluster.

  4. Verify that etcd encryption was successful.

    1. Review the Encrypted status condition for the OpenShift API server to verify that its resources were successfully encrypted:

      $ oc get openshiftapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="Encrypted")]}{.reason}{"\n"}{.message}{"\n"}'

      The output shows EncryptionCompleted upon successful encryption:

      EncryptionCompleted
      All resources encrypted: routes.route.openshift.io, oauthaccesstokens.oauth.openshift.io, oauthauthorizetokens.oauth.openshift.io

      If the output shows EncryptionInProgress, this means that encryption is still in progress. Wait a few minutes and try again.

    2. Review the Encrypted status condition for the Kubernetes API server to verify that its resources were successfully encrypted:

      $ oc get kubeapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="Encrypted")]}{.reason}{"\n"}{.message}{"\n"}'

      The output shows EncryptionCompleted upon successful encryption:

      EncryptionCompleted
      All resources encrypted: secrets, configmaps

      If the output shows EncryptionInProgress, this means that encryption is still in progress. Wait a few minutes and try again.

Disabling etcd encryption

You can disable encryption of etcd data in your cluster.

Prerequisites
  • Access to the cluster as a user with the cluster-admin role.

Procedure
  1. Modify the API server object:

    $ oc edit apiserver
  2. Set the encryption field type to identity:

    spec:
      encryption:
        type: identity (1)
    1 The identity type is the default value and means that no encryption is performed.
  3. Save the file to apply the changes.

    The decryption process starts. It can take 20 minutes or longer for this process to complete, depending on the size of your cluster.

  4. Verify that etcd decryption was successful.

    1. Review the Encrypted status condition for the OpenShift API server to verify that its resources were successfully decrypted:

      $ oc get openshiftapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="Encrypted")]}{.reason}{"\n"}{.message}{"\n"}'

      The output shows DecryptionCompleted upon successful decryption:

      DecryptionCompleted
      Encryption mode set to identity and everything is decrypted

      If the output shows DecryptionInProgress, this means that decryption is still in progress. Wait a few minutes and try again.

    2. Review the Encrypted status condition for the Kubernetes API server to verify that its resources were successfully decrypted:

      $ oc get kubeapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="Encrypted")]}{.reason}{"\n"}{.message}{"\n"}'

      The output shows DecryptionCompleted upon successful decryption:

      DecryptionCompleted
      Encryption mode set to identity and everything is decrypted

      If the output shows DecryptionInProgress, this means that decryption is still in progress. Wait a few minutes and try again.

Backing up etcd data

Follow these steps to back up etcd data by creating an etcd snapshot and backing up the resources for the static Pods. This backup can be saved and used at a later time if you need to restore etcd.

Only save a backup from a single master host. Do not take a backup from each master host in the cluster.

Prerequisites
  • You have access to the cluster as a user with the cluster-admin role.

  • You have checked whether the cluster-wide proxy is enabled.

    You can check whether the proxy is enabled by reviewing the output of oc get proxy cluster -o yaml. The proxy is enabled if the httpProxy, httpsProxy, and noProxy fields have values set.

Procedure
  1. Start a debug session for a master node:

    $ oc debug node/<node_name>
  2. Change your root directory to the host:

    sh-4.2# chroot /host
  3. If the cluster-wide proxy is enabled, be sure that you have exported the NO_PROXY, HTTP_PROXY, and HTTPS_PROXY environment variables.

  4. Run the cluster-backup.sh script and pass in the location to save the backup to.

    sh-4.4# /usr/local/bin/cluster-backup.sh /home/core/assets/backup
    Example script output
    1bf371f1b5a483927cd01bb593b0e12cff406eb8d7d0acf4ab079c36a0abd3f7
    etcdctl version: 3.3.18
    API version: 3.3
    found latest kube-apiserver-pod: /etc/kubernetes/static-pod-resources/kube-apiserver-pod-7
    found latest kube-controller-manager-pod: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-8
    found latest kube-scheduler-pod: /etc/kubernetes/static-pod-resources/kube-scheduler-pod-6
    found latest etcd-pod: /etc/kubernetes/static-pod-resources/etcd-pod-2
    Snapshot saved at /home/core/assets/backup/snapshot_2020-03-18_220218.db
    snapshot db and kube resources are successfully saved to /home/core/assets/backup

    In this example, two files are created in the /home/core/assets/backup/ directory on the master host:

    • snapshot_<datetimestamp>.db: This file is the etcd snapshot.

    • static_kuberesources_<datetimestamp>.tar.gz: This file contains the resources for the static Pods. If etcd encryption is enabled, it also contains the encryption keys for the etcd snapshot.

      If etcd encryption is enabled, it is recommended to store this second file separately from the etcd snapshot for security reasons. However, this file is required in order to restore from the etcd snapshot.

      Keep in mind that etcd encryption only encrypts values, not keys. This means that resource types, namespaces, and object names are unencrypted.

Restoring to a previous cluster state

You can use a saved etcd backup to restore back to a previous cluster state. You use the etcd backup to restore a single control plane host. Then the etcd cluster Operator handles scaling to the remaining master hosts.

Prerequisites
  • Access to the cluster as a user with the cluster-admin role.

  • SSH access to master hosts.

  • A backup directory containing both the etcd snapshot and the resources for the static Pods, which were from the same backup. The file names in the directory must be in the following formats: snapshot_<datetimestamp>.db and static_kuberesources_<datetimestamp>.tar.gz.

Procedure
  1. Select a control plane host to use as the recovery host. This is the host that you will run the restore operation on.

  2. Establish SSH connectivity to each of the control plane nodes, including the recovery host.

    The Kubernetes API server becomes inaccessible after the restore process starts, so you cannot access the control plane nodes. For this reason, it is recommended to establish SSH connectivity to each control plane host in a separate terminal.

    If you do not complete this step, you will not be able to access the master hosts to complete the restore procedure, and you will be unable to recover your cluster from this state.

  3. Copy the etcd backup directory to the recovery control plane host.

    This procedure assumes that you copied the backup directory containing the etcd snapshot and the resources for the static Pods to the /home/core/ directory of your recovery control plane host.

  4. Stop the static Pods on all other control plane nodes.

    It is not required to manually stop the Pods on the recovery host. The recovery script will stop the Pods on the recovery host.

    1. Access a control plane host that is not the recovery host.

    2. Move the existing etcd Pod file out of the kubelet manifest directory:

      [core@ip-10-0-154-194 ~]$ sudo mv /etc/kubernetes/manifests/etcd-pod.yaml /tmp
    3. Verify that the etcd Pods are stopped.

      [core@ip-10-0-154-194 ~]$ sudo crictl ps | grep etcd

      The output of this command should be empty. If it is not empty, wait a few minutes and check again.

    4. Move the existing Kubernetes API server Pod file out of the kubelet manifest directory:

      [core@ip-10-0-154-194 ~]$ sudo mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml /tmp
    5. Verify that the Kubernetes API server Pods are stopped.

      [core@ip-10-0-154-194 ~]$ sudo crictl ps | grep kube-apiserver

      The output of this command should be empty. If it is not empty, wait a few minutes and check again.

    6. Move the etcd data directory to a different location:

      [core@ip-10-0-154-194 ~]$ sudo mv /var/lib/etcd/ /tmp
    7. Repeat this step on each of the other master hosts that is not the recovery host.

  5. Access the recovery control plane host.

  6. If the cluster-wide proxy is enabled, be sure that you have exported the NO_PROXY, HTTP_PROXY, and HTTPS_PROXY environment variables.

    You can check whether the proxy is enabled by reviewing the output of oc get proxy cluster -o yaml. The proxy is enabled if the httpProxy, httpsProxy, and noProxy fields have values set.

  7. Run the restore script on the recovery control plane host and pass in the path to the etcd backup directory:

    [core@ip-10-0-143-125 ~]$ sudo -E /usr/local/bin/cluster-restore.sh /home/core/backup
    Example script output
    ...stopping kube-scheduler-pod.yaml
    ...stopping kube-controller-manager-pod.yaml
    ...stopping etcd-pod.yaml
    ...stopping kube-apiserver-pod.yaml
    Waiting for container etcd to stop
    .complete
    Waiting for container etcdctl to stop
    .............................complete
    Waiting for container etcd-metrics to stop
    complete
    Waiting for container kube-controller-manager to stop
    complete
    Waiting for container kube-apiserver to stop
    ..........................................................................................complete
    Waiting for container kube-scheduler to stop
    complete
    Moving etcd data-dir /var/lib/etcd/member to /var/lib/etcd-backup
    starting restore-etcd static pod
    starting kube-apiserver-pod.yaml
    static-pod-resources/kube-apiserver-pod-7/kube-apiserver-pod.yaml
    starting kube-controller-manager-pod.yaml
    static-pod-resources/kube-controller-manager-pod-7/kube-controller-manager-pod.yaml
    starting kube-scheduler-pod.yaml
    static-pod-resources/kube-scheduler-pod-8/kube-scheduler-pod.yaml
  8. Restart the kubelet service on all master hosts.

    1. From the recovery host, run the following command:

      [core@ip-10-0-143-125 ~]$ sudo systemctl restart kubelet.service
    2. Repeat this step on all other master hosts.

  9. Verify that the single member control plane has started successfully.

    1. From the recovery host, verify that the etcd container is running.

      [core@ip-10-0-143-125 ~]$ sudo crictl ps | grep etcd
      Example output
      3ad41b7908e32       36f86e2eeaaffe662df0d21041eb22b8198e0e58abeeae8c743c3e6e977e8009                                                         About a minute ago   Running             etcd                                          0                   7c05f8af362f0
    2. From the recovery host, verify that the etcd Pod is running.

      [core@ip-10-0-143-125 ~]$ oc get pods -n openshift-etcd | grep etcd
      Example output
      NAME                                             READY   STATUS      RESTARTS   AGE
      etcd-ip-10-0-143-125.ec2.internal                1/1     Running     1          2m47s

      If the status is Pending, or the output lists more than one running etcd Pod, wait a few minutes and check again.

  10. Force etcd redeployment.

    In a terminal that has access to the cluster as a cluster-admin user, run the following command:

    $ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge (1)
    1 The forceRedeploymentReason value must be unique, which is why a timestamp is appended.

    When the etcd cluster Operator performs a redeployment, the existing nodes are started with new Pods similar to the initial bootstrap scale up.

  11. Verify all nodes are updated to the latest revision.

    In a terminal that has access to the cluster as a cluster-admin user, run the following command:

    $ oc get etcd -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'

    Review the NodeInstallerProgressing status condition for etcd to verify that all nodes are at the latest revision. The output shows AllNodesAtLatestRevision upon successful update:

    AllNodesAtLatestRevision
    3 nodes are at revision 3

    If the output shows a message such as 2 nodes are at revision 3; 1 nodes are at revision 4, this means that the update is still in progress. Wait a few minutes and try again.

  12. After etcd is redeployed, force new rollouts for the control plane. The Kubernetes API server will reinstall itself on the other nodes because the kubelet is connected to API servers using an internal load balancer.

    In a terminal that has access to the cluster as a cluster-admin user, run the following commands.

    1. Update the kubeapiserver:

      $ oc patch kubeapiserver cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge

      Verify all nodes are updated to the latest revision.

      $ oc get kubeapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'

      Review the NodeInstallerProgressing status condition to verify that all nodes are at the latest revision. The output shows AllNodesAtLatestRevision upon successful update:

      AllNodesAtLatestRevision
      3 nodes are at revision 3
    2. Update the kubecontrollermanager:

      $ oc patch kubecontrollermanager cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge

      Verify all nodes are updated to the latest revision.

      $ oc get kubecontrollermanager -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'

      Review the NodeInstallerProgressing status condition to verify that all nodes are at the latest revision. The output shows AllNodesAtLatestRevision upon successful update:

      AllNodesAtLatestRevision
      3 nodes are at revision 3
    3. Update the kubescheduler:

      $ oc patch kubescheduler cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge

      Verify all nodes are updated to the latest revision.

      $ oc get kubescheduler -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'

      Review the NodeInstallerProgressing status condition to verify that all nodes are at the latest revision. The output shows AllNodesAtLatestRevision upon successful update:

      AllNodesAtLatestRevision
      3 nodes are at revision 3
  13. Verify that all master hosts have started and joined the cluster.

    In a terminal that has access to the cluster as a cluster-admin user, run the following command:

    $ oc get pods -n openshift-etcd | grep etcd
    Example output
    etcd-ip-10-0-143-125.ec2.internal                2/2     Running     0          9h
    etcd-ip-10-0-154-194.ec2.internal                2/2     Running     0          9h
    etcd-ip-10-0-173-171.ec2.internal                2/2     Running     0          9h

Note that it might take several minutes after completing this procedure for all services to be restored. For example, authentication by using oc login might not immediately work until the OAuth server Pods are restarted.

Pod disruption budgets

Understand and configure Pod disruption budgets.

Understanding how to use Pod disruption budgets to specify the number of Pods that must be up

A pod disruption budget is part of the Kubernetes API, which can be managed with oc commands like other object types. They allow the specification of safety constraints on Pods during operations, such as draining a node for maintenance.

PodDisruptionBudget is an API object that specifies the minimum number or percentage of replicas that must be up at a time. Setting these in projects can be helpful during node maintenance (such as scaling a cluster down or a cluster upgrade) and is only honored on voluntary evictions (not on node failures).

A PodDisruptionBudget object’s configuration consists of the following key parts:

  • A label selector, which is a label query over a set of Pods.

  • An availability level, which specifies the minimum number of Pods that must be available simultaneously, either:

    • minAvailable is the number of Pods must always be available, even during a disruption.

    • maxUnavailable is the number of Pods can be unavailable during a disruption.

A maxUnavailable of 0% or 0 or a minAvailable of 100% or equal to the number of replicas is permitted but can block nodes from being drained.

You can check for Pod disruption budgets across all projects with the following:

$ oc get poddisruptionbudget --all-namespaces
Example output
NAMESPACE         NAME          MIN-AVAILABLE   SELECTOR
another-project   another-pdb   4               bar=foo
test-project      my-pdb        2               foo=bar

The PodDisruptionBudget is considered healthy when there are at least minAvailable Pods running in the system. Every Pod above that limit can be evicted.

Depending on your Pod priority and preemption settings, lower-priority Pods might be removed despite their Pod disruption budget requirements.

Specifying the number of pods that must be up with pod disruption budgets

You can use a PodDisruptionBudget object to specify the minimum number or percentage of replicas that must be up at a time.

Procedure

To configure a pod disruption budget:

  1. Create a YAML file with the an object definition similar to the following:

    apiVersion: policy/v1beta1 (1)
    kind: PodDisruptionBudget
    metadata:
      name: my-pdb
    spec:
      minAvailable: 2  (2)
      selector:  (3)
        matchLabels:
          foo: bar
    1 PodDisruptionBudget is part of the policy/v1beta1 API group.
    2 The minimum number of pods that must be available simultaneously. This can be either an integer or a string specifying a percentage, for example, 20%.
    3 A label query over a set of resources. The result of matchLabels and matchExpressions are logically conjoined.

    Or:

    apiVersion: policy/v1beta1 (1)
    kind: PodDisruptionBudget
    metadata:
      name: my-pdb
    spec:
      maxUnavailable: 25% (2)
      selector: (3)
        matchLabels:
          foo: bar
    1 PodDisruptionBudget is part of the policy/v1beta1 API group.
    2 The maximum number of pods that can be unavailable simultaneously. This can be either an integer or a string specifying a percentage, for example, 20%.
    3 A label query over a set of resources. The result of matchLabels and matchExpressions are logically conjoined.
  2. Run the following command to add the object to project:

    $ oc create -f </path/to/file> -n <project_name>