Adjust worker nodes

If you incorrectly sized the worker nodes during deployment, adjust them by creating one or more new machine sets, scale them up, then scale the original machine set down before removing them.

Understanding the difference between machine sets and the machine config pool

MachineSet objects describe OpenShift Container Platform nodes with respect to the cloud or machine provider.

The MachineConfigPool object allows MachineConfigController components to define and provide the status of machines in the context of upgrades.

The MachineConfigPool object allows users to configure how upgrades are rolled out to the OpenShift Container Platform nodes in the machine config pool.

The NodeSelector object can be replaced with a reference to the MachineSet object.

Scaling a machine set manually

If you must add or remove an instance of a machine in a machine set, you can manually scale the machine set.

This guidance is relevant to fully automated, installer-provisioned infrastructure installations. Customized, user-provisioned infrastructure installations does not have machine sets.

Prerequisites
  • Install an OpenShift Container Platform cluster and the oc command line.

  • Log in to oc as a user with cluster-admin permission.

Procedure
  1. View the machine sets that are in the cluster:

    $ oc get machinesets -n openshift-machine-api

    The machine sets are listed in the form of <clusterid>-worker-<aws-region-az>.

  2. Scale the machine set:

    $ oc scale --replicas=2 machineset <machineset> -n openshift-machine-api

    Or:

    $ oc edit machineset <machineset> -n openshift-machine-api

    You can scale the machine set up or down. It takes several minutes for the new machines to be available.

The machine set deletion policy

Random, Newest, and Oldest are the three supported deletion options. The default is Random, meaning that random machines are chosen and deleted when scaling machine sets down. The deletion policy can be set according to the use case by modifying the particular machine set:

spec:
  deletePolicy: <delete_policy>
  replicas: <desired_replica_count>

Specific machines can also be prioritized for deletion by adding the annotation machine.openshift.io/cluster-api-delete-machine to the machine of interest, regardless of the deletion policy.

By default, the OpenShift Container Platform router pods are deployed on workers. Because the router is required to access some cluster resources, including the web console, do not scale the worker machine set to 0 unless you first relocate the router pods.

Custom machine sets can be used for use cases requiring that services run on specific nodes and that those services are ignored by the controller when the worker machine sets are scaling down. This prevents service disruption.

Creating infrastructure machine sets

You can create a MachineSet object to host only infrastructure components. You apply specific Kubernetes labels to these machines and then update the infrastructure components to run on only those machines. These infrastructure nodes are not counted toward the total number of subscriptions that are required to run the environment.

OpenShift Container Platform infrastructure components

The following infrastructure workloads do not incur OpenShift Container Platform worker subscriptions:

  • Kubernetes and OpenShift Container Platform control plane services that run on masters

  • The default router

  • The integrated container image registry

  • The cluster metrics collection, or monitoring service, including components for monitoring user-defined projects

  • Cluster aggregated logging

  • Service brokers

  • Red Hat Quay

  • Red Hat OpenShift Container Storage

  • Red Hat Advanced Cluster Manager

Any node that runs any other container, pod, or component is a worker node that your subscription must cover.

Creating default cluster-wide node selectors

You can use default cluster-wide node selectors on pods together with labels on nodes to constrain all pods created in a cluster to specific nodes.

With cluster-wide node selectors, when you create a pod in that cluster, OpenShift Container Platform adds the default node selectors to the pod and schedules the pod on nodes with matching labels.

You configure cluster-wide node selectors by editing the Scheduler Operator custom resource (CR). You add labels to a node, a machine set, or a machine config. Adding the label to the machine set ensures that if the node or machine goes down, new nodes have the label. Labels added to a node or machine config do not persist if the node or machine goes down.

You can add additional key/value pairs to a pod. But you cannot add a different value for a default key.

Procedure

To add a default cluster-wide node selector:

  1. Edit the Scheduler Operator CR to add the default cluster-wide node selectors:

    $ oc edit scheduler cluster
    Example Scheduler Operator CR with a node selector
    apiVersion: config.openshift.io/v1
    kind: Scheduler
    metadata:
      name: cluster
    ...
    
    spec:
      defaultNodeSelector: type=user-node,region=east (1)
      mastersSchedulable: false
      policy:
        name: ""
    1 Add a node selector with the appropriate <key>:<value> pairs.

    After making this change, wait for the pods in the openshift-kube-apiserver project to redeploy. This can take several minutes. The default cluster-wide node selector does not take effect until the pods redeploy.

  2. Add labels to a node by using a machine set or editing the node directly:

    • Use a machine set to add labels to nodes managed by the machine set when a node is created:

      1. Run the following command to add labels to a MachineSet object:

        $ oc patch MachineSet <name> --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"<key>"="<value>","<key>"="<value>"}}]'  -n openshift-machine-api (1)
        1 Add a <key>/<value> pair for each label.

        For example:

        $ oc patch MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c --type='json' -p='[{"op":"add","path":"/spec/template/spec/metadata/labels", "value":{"type":"user-node","region":"east"}}]'  -n openshift-machine-api
      2. Verify that the labels are added to the MachineSet object by using the oc edit command:

        For example:

        $ oc edit MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api
        Example output
        apiVersion: machine.openshift.io/v1beta1
        kind: MachineSet
        metadata:
        ...
        spec:
        ...
          template:
            metadata:
        ...
            spec:
              metadata:
                labels:
                  region: east
                  type: user-node
      3. Redeploy the nodes associated with that machine set by scaling down to 0 and scaling up the nodes:

        For example:

        $ oc scale --replicas=0 MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api
        $ oc scale --replicas=1 MachineSet ci-ln-l8nry52-f76d1-hl7m7-worker-c -n openshift-machine-api
      4. When the nodes are ready and available, verify that the label is added to the nodes by using the oc get command:

        $ oc get nodes -l <key>=<value>

        For example:

        $ oc get nodes -l type=user-node
        Example output
        NAME                                       STATUS   ROLES    AGE   VERSION
        ci-ln-l8nry52-f76d1-hl7m7-worker-c-vmqzp   Ready    worker   61s   v1.18.3+002a51f
    • Add labels directly to a node:

      1. Edit the Node object for the node:

        $ oc label nodes <name> <key>=<value>

        For example, to label a node:

        $ oc label nodes ci-ln-l8nry52-f76d1-hl7m7-worker-b-tgq49 type=user-node region=east
      2. Verify that the labels are added to the node using the oc get command:

        $ oc get nodes -l <key>=<value>,<key>=<value>

        For example:

        $ oc get nodes -l type=user-node,region=east
        Example output
        NAME                                       STATUS   ROLES    AGE   VERSION
        ci-ln-l8nry52-f76d1-hl7m7-worker-b-tgq49   Ready    worker   17m   v1.18.3+002a51f

Moving resources to infrastructure machine sets

Some of the infrastructure resources are deployed in your cluster by default. You can move them to the infrastructure machine sets that you created.

Moving the router

You can deploy the router pod to a different machine set. By default, the pod is deployed to a worker node.

Prerequisites
  • Configure additional machine sets in your OpenShift Container Platform cluster.

Procedure
  1. View the IngressController custom resource for the router Operator:

    $ oc get ingresscontroller default -n openshift-ingress-operator -o yaml

    The command output resembles the following text:

    apiVersion: operator.openshift.io/v1
    kind: IngressController
    metadata:
      creationTimestamp: 2019-04-18T12:35:39Z
      finalizers:
      - ingresscontroller.operator.openshift.io/finalizer-ingresscontroller
      generation: 1
      name: default
      namespace: openshift-ingress-operator
      resourceVersion: "11341"
      selfLink: /apis/operator.openshift.io/v1/namespaces/openshift-ingress-operator/ingresscontrollers/default
      uid: 79509e05-61d6-11e9-bc55-02ce4781844a
    spec: {}
    status:
      availableReplicas: 2
      conditions:
      - lastTransitionTime: 2019-04-18T12:36:15Z
        status: "True"
        type: Available
      domain: apps.<cluster>.example.com
      endpointPublishingStrategy:
        type: LoadBalancerService
      selector: ingresscontroller.operator.openshift.io/deployment-ingresscontroller=default
  2. Edit the ingresscontroller resource and change the nodeSelector to use the infra label:

    $ oc edit ingresscontroller default -n openshift-ingress-operator

    Add the nodeSelector stanza that references the infra label to the spec section, as shown:

      spec:
        nodePlacement:
          nodeSelector:
            matchLabels:
              node-role.kubernetes.io/infra: ""
  3. Confirm that the router pod is running on the infra node.

    1. View the list of router pods and note the node name of the running pod:

      $ oc get pod -n openshift-ingress -o wide
      Example output
      NAME                              READY     STATUS        RESTARTS   AGE       IP           NODE                           NOMINATED NODE   READINESS GATES
      router-default-86798b4b5d-bdlvd   1/1      Running       0          28s       10.130.2.4   ip-10-0-217-226.ec2.internal   <none>           <none>
      router-default-955d875f4-255g8    0/1      Terminating   0          19h       10.129.2.4   ip-10-0-148-172.ec2.internal   <none>           <none>

      In this example, the running pod is on the ip-10-0-217-226.ec2.internal node.

    2. View the node status of the running pod:

      $ oc get node <node_name> (1)
      1 Specify the <node_name> that you obtained from the pod list.
      Example output
      NAME                          STATUS  ROLES         AGE   VERSION
      ip-10-0-217-226.ec2.internal  Ready   infra,worker  17h   v1.18.3

      Because the role list includes infra, the pod is running on the correct node.

Moving the default registry

You configure the registry Operator to deploy its pods to different nodes.

Prerequisites
  • Configure additional machine sets in your OpenShift Container Platform cluster.

Procedure
  1. View the config/instance object:

    $ oc get configs.imageregistry.operator.openshift.io cluster -o yaml
    Example output
    apiVersion: imageregistry.operator.openshift.io/v1
    kind: Config
    metadata:
      creationTimestamp: 2019-02-05T13:52:05Z
      finalizers:
      - imageregistry.operator.openshift.io/finalizer
      generation: 1
      name: cluster
      resourceVersion: "56174"
      selfLink: /apis/imageregistry.operator.openshift.io/v1/configs/cluster
      uid: 36fd3724-294d-11e9-a524-12ffeee2931b
    spec:
      httpSecret: d9a012ccd117b1e6616ceccb2c3bb66a5fed1b5e481623
      logging: 2
      managementState: Managed
      proxy: {}
      replicas: 1
      requests:
        read: {}
        write: {}
      storage:
        s3:
          bucket: image-registry-us-east-1-c92e88cad85b48ec8b312344dff03c82-392c
          region: us-east-1
    status:
    ...
  2. Edit the config/instance object:

    $ oc edit configs.imageregistry.operator.openshift.io cluster
  3. Add the following lines of text the spec section of the object:

      nodeSelector:
        node-role.kubernetes.io/infra: ""
  4. Verify the registry pod has been moved to the infrastructure node.

    1. Run the following command to identify the node where the registry pod is located:

      $ oc get pods -o wide -n openshift-image-registry
    2. Confirm the node has the label you specified:

      $ oc describe node <node_name>

      Review the command output and confirm that node-role.kubernetes.io/infra is in the LABELS list.

Creating infrastructure machine sets for production environments

In a production deployment, deploy at least three machine sets to hold infrastructure components. Both the logging aggregation solution and the service mesh deploy Elasticsearch, and Elasticsearch requires three instances that are installed on different nodes. For high availability, install deploy these nodes to different availability zones. Since you need different machine sets for each availability zone, create at least three machine sets.

Creating a machine set

In addition to the ones created by the installation program, you can create your own machine sets to dynamically manage the machine compute resources for specific workloads of your choice.

Prerequisites
  • Deploy an OpenShift Container Platform cluster.

  • Install the OpenShift CLI (oc).

  • Log in to oc as a user with cluster-admin permission.

Procedure
  1. Create a new YAML file that contains the machine set custom resource (CR) sample, as shown, and is named <file_name>.yaml.

    Ensure that you set the <clusterID> and <role> parameter values.

    1. If you are not sure about which value to set for a specific field, you can check an existing machine set from your cluster.

      $ oc get machinesets -n openshift-machine-api
      Example output
      NAME                                DESIRED   CURRENT   READY   AVAILABLE   AGE
      agl030519-vplxk-worker-us-east-1a   1         1         1       1           55m
      agl030519-vplxk-worker-us-east-1b   1         1         1       1           55m
      agl030519-vplxk-worker-us-east-1c   1         1         1       1           55m
      agl030519-vplxk-worker-us-east-1d   0         0                             55m
      agl030519-vplxk-worker-us-east-1e   0         0                             55m
      agl030519-vplxk-worker-us-east-1f   0         0                             55m
    2. Check values of a specific machine set:

      $ oc get machineset <machineset_name> -n \
           openshift-machine-api -o yaml
      Example output
      ...
      template:
          metadata:
            labels:
              machine.openshift.io/cluster-api-cluster: agl030519-vplxk (1)
              machine.openshift.io/cluster-api-machine-role: worker (2)
              machine.openshift.io/cluster-api-machine-type: worker
              machine.openshift.io/cluster-api-machineset: agl030519-vplxk-worker-us-east-1a
      1 The cluster ID.
      2 A default node label.
  2. Create the new MachineSet CR:

    $ oc create -f <file_name>.yaml
  3. View the list of machine sets:

    $ oc get machineset -n openshift-machine-api
    Example output
    NAME                                DESIRED   CURRENT   READY   AVAILABLE   AGE
    agl030519-vplxk-infra-us-east-1a    1         1         1       1           11m
    agl030519-vplxk-worker-us-east-1a   1         1         1       1           55m
    agl030519-vplxk-worker-us-east-1b   1         1         1       1           55m
    agl030519-vplxk-worker-us-east-1c   1         1         1       1           55m
    agl030519-vplxk-worker-us-east-1d   0         0                             55m
    agl030519-vplxk-worker-us-east-1e   0         0                             55m
    agl030519-vplxk-worker-us-east-1f   0         0                             55m

    When the new machine set is available, the DESIRED and CURRENT values match. If the machine set is not available, wait a few minutes and run the command again.

  4. After the new machine set is available, check status of the machine and the node that it references:

    $ oc describe machine <name> -n openshift-machine-api

    For example:

    $ oc describe machine agl030519-vplxk-infra-us-east-1a -n openshift-machine-api
    Example output
    status:
      addresses:
      - address: 10.0.133.18
        type: InternalIP
      - address: ""
        type: ExternalDNS
      - address: ip-10-0-133-18.ec2.internal
        type: InternalDNS
      lastUpdated: "2019-05-03T10:38:17Z"
      nodeRef:
        kind: Node
        name: ip-10-0-133-18.ec2.internal
        uid: 71fb8d75-6d8f-11e9-9ff3-0e3f103c7cd8
      providerStatus:
        apiVersion: awsproviderconfig.openshift.io/v1beta1
        conditions:
        - lastProbeTime: "2019-05-03T10:34:31Z"
          lastTransitionTime: "2019-05-03T10:34:31Z"
          message: machine successfully created
          reason: MachineCreationSucceeded
          status: "True"
          type: MachineCreation
        instanceId: i-09ca0701454124294
        instanceState: running
        kind: AWSMachineProviderStatus
  5. View the new node and confirm that the new node has the label that you specified:

    $ oc get node <node_name> --show-labels

    Review the command output and confirm that node-role.kubernetes.io/<your_label> is in the LABELS list.

Any change to a machine set is not applied to existing machines owned by the machine set. For example, labels edited or added to an existing machine set are not propagated to existing machines and nodes associated with the machine set.

Creating machine sets for different clouds

Use the sample machine set for your cloud.

Sample YAML for a machine set custom resource on AWS

This sample YAML defines a machine set that runs in the us-east-1a Amazon Web Services (AWS) zone and creates nodes that are labeled with node-role.kubernetes.io/<role>: ""

In this sample, <infrastructureID> is the infrastructure ID label that is based on the cluster ID that you set when you provisioned the cluster, and <role> is the node label to add.

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  labels:
    machine.openshift.io/cluster-api-cluster: <infrastructureID> (1)
  name: <infrastructureID>-<role>-<zone> (2)
  namespace: openshift-machine-api
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: <infrastructureID> (1)
      machine.openshift.io/cluster-api-machineset: <infrastructureID>-<role>-<zone> (2)
  template:
    metadata:
      labels:
        machine.openshift.io/cluster-api-cluster: <infrastructureID> (1)
        machine.openshift.io/cluster-api-machine-role: <role> (3)
        machine.openshift.io/cluster-api-machine-type: <role> (3)
        machine.openshift.io/cluster-api-machineset: <infrastructureID>-<role>-<zone> (2)
    spec:
      metadata:
        labels:
          node-role.kubernetes.io/<role>: "" (3)
      providerSpec:
        value:
          ami:
            id: ami-046fe691f52a953f9 (4)
          apiVersion: awsproviderconfig.openshift.io/v1beta1
          blockDevices:
            - ebs:
                iops: 0
                volumeSize: 120
                volumeType: gp2
          credentialsSecret:
            name: aws-cloud-credentials
          deviceIndex: 0
          iamInstanceProfile:
            id: <infrastructureID>-worker-profile (1)
          instanceType: m4.large
          kind: AWSMachineProviderConfig
          placement:
            availabilityZone: us-east-1a
            region: us-east-1
          securityGroups:
            - filters:
                - name: tag:Name
                  values:
                    - <infrastructureID>-worker-sg (1)
          subnet:
            filters:
              - name: tag:Name
                values:
                  - <infrastructureID>-private-us-east-1a (1)
          tags:
            - name: kubernetes.io/cluster/<infrastructureID> (1)
              value: owned
          userDataSecret:
            name: worker-user-data
1 Specify the infrastructure ID that is based on the cluster ID that you set when you provisioned the cluster. If you have the OpenShift CLI installed, you can obtain the infrastructure ID by running the following command:
$ oc get -o jsonpath='{.status.infrastructureName}{"\n"}' infrastructure cluster
2 Specify the infrastructure ID, node label, and zone.
3 Specify the node label to add.
4 Specify a valid Red Hat Enterprise Linux CoreOS (RHCOS) AMI for your AWS zone for your OpenShift Container Platform nodes.

Machine sets running on AWS support non-guaranteed Spot Instances. You can save on costs by using Spot Instances at a lower price compared to On-Demand Instances on AWS. Configure Spot Instances by adding spotMarketOptions to the machine set YAML file.

Sample YAML for a machine set custom resource on Azure

This sample YAML defines a machine set that runs in the 1 Microsoft Azure zone in the centralus region and creates nodes that are labeled with node-role.kubernetes.io/<role>: ""

In this sample, <infrastructureID> is the infrastructure ID label that is based on the cluster ID that you set when you provisioned the cluster, and <role> is the node label to add.

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  labels:
    machine.openshift.io/cluster-api-cluster: <infrastructureID> (1)
    machine.openshift.io/cluster-api-machine-role: <role> (2)
    machine.openshift.io/cluster-api-machine-type: <role> (2)
  name: <infrastructureID>-<role>-<region> (3)
  namespace: openshift-machine-api
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: <infrastructureID> (1)
      machine.openshift.io/cluster-api-machineset: <infrastructureID>-<role>-<region> (3)
  template:
    metadata:
      creationTimestamp: null
      labels:
        machine.openshift.io/cluster-api-cluster: <infrastructureID> (1)
        machine.openshift.io/cluster-api-machine-role: <role> (2)
        machine.openshift.io/cluster-api-machine-type: <role> (2)
        machine.openshift.io/cluster-api-machineset: <infrastructureID>-<role>-<region> (3)
    spec:
      metadata:
        creationTimestamp: null
        labels:
          node-role.kubernetes.io/<role>: "" (2)
      providerSpec:
        value:
          apiVersion: azureproviderconfig.openshift.io/v1beta1
          credentialsSecret:
            name: azure-cloud-credentials
            namespace: openshift-machine-api
          image:
            offer: ""
            publisher: ""
            resourceID: /resourceGroups/<infrastructureID>-rg/providers/Microsoft.Compute/images/<infrastructureID>
            sku: ""
            version: ""
          internalLoadBalancer: ""
          kind: AzureMachineProviderSpec
          location: centralus
          managedIdentity: <infrastructureID>-identity (1)
          metadata:
            creationTimestamp: null
          natRule: null
          networkResourceGroup: ""
          osDisk:
            diskSizeGB: 128
            managedDisk:
              storageAccountType: Premium_LRS
            osType: Linux
          publicIP: false
          publicLoadBalancer: ""
          resourceGroup: <infrastructureID>-rg (1)
          sshPrivateKey: ""
          sshPublicKey: ""
          subnet: <infrastructureID>-<role>-subnet  (1) (2)
          userDataSecret:
            name: <role>-user-data (2)
          vmSize: Standard_D2s_v3
          vnet: <infrastructureID>-vnet (1)
          zone: "1" (4)
1 Specify the infrastructure ID that is based on the cluster ID that you set when you provisioned the cluster. If you have the OpenShift CLI installed, you can obtain the infrastructure ID by running the following command:
$ oc get -o jsonpath='{.status.infrastructureName}{"\n"}' infrastructure cluster
2 Specify the node label to add.
3 Specify the infrastructure ID, node label, and region.
4 Specify the zone within your region to place Machines on. Be sure that your region supports the zone that you specify.

Sample YAML for a machine set custom resource on GCP

This sample YAML defines a machine set that runs in Google Cloud Platform (GCP) and creates nodes that are labeled with node-role.kubernetes.io/<role>: ""

In this sample, <infrastructureID> is the infrastructure ID label that is based on the cluster ID that you set when you provisioned the cluster, and <role> is the node label to add.

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  labels:
    machine.openshift.io/cluster-api-cluster: <infrastructureID> (1)
  name: <infrastructureID>-w-a (1)
  namespace: openshift-machine-api
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: <infrastructureID> (1)
      machine.openshift.io/cluster-api-machineset: <infrastructureID>-w-a (1)
  template:
    metadata:
      creationTimestamp: null
      labels:
        machine.openshift.io/cluster-api-cluster: <infrastructureID> (1)
        machine.openshift.io/cluster-api-machine-role: <role> (3)
        machine.openshift.io/cluster-api-machine-type: <role> (3)
        machine.openshift.io/cluster-api-machineset: <infrastructureID>-w-a (1)
    spec:
      metadata:
        labels:
          node-role.kubernetes.io/<role>: "" (3)
      providerSpec:
        value:
          apiVersion: gcpprovider.openshift.io/v1beta1
          canIPForward: false
          credentialsSecret:
            name: gcp-cloud-credentials
          deletionProtection: false
          disks:
          - autoDelete: true
            boot: true
            image: <infrastructureID>-rhcos-image (1)
            labels: null
            sizeGb: 128
            type: pd-ssd
          kind: GCPMachineProviderSpec
          machineType: n1-standard-4
          metadata:
            creationTimestamp: null
          networkInterfaces:
          - network: <infrastructureID>-network (1)
            subnetwork: <infrastructureID>-<role>-subnet (2)
          projectID: <project_name> (4)
          region: us-central1
          serviceAccounts:
          - email: <infrastructureID>-w@<project_name>.iam.gserviceaccount.com  (1) (4)
            scopes:
            - https://www.googleapis.com/auth/cloud-platform
          tags:
          - <infrastructureID>-<role> (2)
          userDataSecret:
            name: worker-user-data
          zone: us-central1-a
1 Specify the infrastructure ID that is based on the cluster ID that you set when you provisioned the cluster. If you have the OpenShift CLI installed, you can obtain the infrastructure ID by running the following command:
$ oc get -o jsonpath='{.status.infrastructureName}{"\n"}' infrastructure cluster
2 Specify the infrastructure ID and node label.
3 Specify the node label to add.
4 Specify the name of the GCP project that you use for your cluster.

Sample YAML for a machine set custom resource on vSphere

This sample YAML defines a machine set that runs on VMware vSphere and creates nodes that are labeled with node-role.kubernetes.io/<role>: "".

In this sample, <infrastructure_id> is the infrastructure ID label that is based on the cluster ID that you set when you provisioned the cluster, and <role> is the node label to add.

apiVersion: machine.openshift.io/v1beta1
kind: MachineSet
metadata:
  creationTimestamp: null
  labels:
    machine.openshift.io/cluster-api-cluster: <infrastructure_id> (1)
  name: <infrastructure_id>-<role> (2)
  namespace: openshift-machine-api
spec:
  replicas: 1
  selector:
    matchLabels:
      machine.openshift.io/cluster-api-cluster: <infrastructure_id> (1)
      machine.openshift.io/cluster-api-machineset: <infrastructure_id>-<role> (2)
  template:
    metadata:
      creationTimestamp: null
      labels:
        machine.openshift.io/cluster-api-cluster: <infrastructure_id> (1)
        machine.openshift.io/cluster-api-machine-role: <role> (3)
        machine.openshift.io/cluster-api-machine-type: <role> (3)
        machine.openshift.io/cluster-api-machineset: <infrastructure_id>-<role> (2)
    spec:
      metadata:
        creationTimestamp: null
        labels:
          node-role.kubernetes.io/<role>: "" (3)
      providerSpec:
        value:
          apiVersion: vsphereprovider.openshift.io/v1beta1
          credentialsSecret:
            name: vsphere-cloud-credentials
          diskGiB: 120
          kind: VSphereMachineProviderSpec
          memoryMiB: 8192
          metadata:
            creationTimestamp: null
          network:
            devices:
            - networkName: "<vm_network_name>" (4)
          numCPUs: 4
          numCoresPerSocket: 1
          snapshot: ""
          template: <vm_template_name> (5)
          userDataSecret:
            name: worker-user-data
          workspace:
            datacenter: <vcenter_datacenter_name> (6)
            datastore: <vcenter_datastore_name> (7)
            folder: <vcenter_vm_folder_path> (8)
            resourcepool: <vsphere_resource_pool> (9)
            server: <vcenter_server_ip> (10)
1 Specify the infrastructure ID that is based on the cluster ID that you set when you provisioned the cluster. If you have the OpenShift CLI (oc) installed, you can obtain the infrastructure ID by running the following command:
$ oc get -o jsonpath='{.status.infrastructureName}{"\n"}' infrastructure cluster
2 Specify the infrastructure ID and node label.
3 Specify the node label to add.
4 Specify the vSphere VM network to deploy the machine set to.
5 Specify the vSphere VM template to use, such as user-5ddjd-rhcos.
6 Specify the vCenter Datacenter to deploy the machine set on.
7 Specify the vCenter Datastore to deploy the machine set on.
8 Specify the path to the vSphere VM folder in vCenter, such as /dc1/vm/user-inst-5ddjd.
9 Specify the vSphere resource pool for your VMs.
10 Specify the vCenter server IP or fully qualified domain name.

Creating an infrastructure node

See Creating infrastructure machine sets for installer-provisioned infrastructure environments or for any cluster where the master nodes are managed by the machine API.

Requirements of the cluster dictate that infrastructure, also called infra nodes, be provisioned. The installer only provides provisions for master and worker nodes. Worker nodes can be designated as infrastructure nodes or application, also called app, nodes through labeling.

Procedure
  1. Add a label to the worker node that you want to act as application node:

    $ oc label node <node-name> node-role.kubernetes.io/app=""
  2. Add a label to the worker nodes that you want to act as infrastructure nodes:

    $ oc label node <node-name> node-role.kubernetes.io/infra=""
  3. Check to see if applicable nodes now have the infra role and app roles:

    $ oc get nodes
  4. Create a default node selector so that pods without a node selector are assigned a subset of nodes to be deployed on, for example by default deployment in worker nodes. As an example, the defaultNodeSelector to deploy pods on worker nodes by default would look like:

    defaultNodeSelector: node-role.kubernetes.io/app=
  5. Move infrastructure resources to the newly labeled infra nodes.

Creating infrastructure machines

If you need infrastructure machines to have dedicated configurations, then you must create an infra pool.

Procedure
  1. Create a machine config pool that contains both the worker role and your custom role as machine config selector:

    $ cat infra.mcp.yaml
    Example output
    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfigPool
    metadata:
      name: infra
    spec:
      machineConfigSelector:
        matchExpressions:
          - {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,infra]}
      nodeSelector:
        matchLabels:
          node-role.kubernetes.io/infra: ""
  2. After you have the YAML file, you can create the machine config pool:

    $ oc create -f infra.mcp.yaml
  3. Check the machine configs to ensure that the infrastructure configuration rendered successfully:

    $ oc get machineconfig
    Example output
    NAME                                                        GENERATEDBYCONTROLLER                      IGNITIONVERSION   CREATED
    00-master                                                   365c1cfd14de5b0e3b85e0fc815b0060f36ab955   2.2.0             31d
    00-worker                                                   365c1cfd14de5b0e3b85e0fc815b0060f36ab955   2.2.0             31d
    01-master-container-runtime                                 365c1cfd14de5b0e3b85e0fc815b0060f36ab955   2.2.0             31d
    01-master-kubelet                                           365c1cfd14de5b0e3b85e0fc815b0060f36ab955   2.2.0             31d
    01-worker-container-runtime                                 365c1cfd14de5b0e3b85e0fc815b0060f36ab955   2.2.0             31d
    01-worker-kubelet                                           365c1cfd14de5b0e3b85e0fc815b0060f36ab955   2.2.0             31d
    99-master-1ae2a1e0-a115-11e9-8f14-005056899d54-registries   365c1cfd14de5b0e3b85e0fc815b0060f36ab955   2.2.0             31d
    99-master-ssh                                                                                          2.2.0             31d
    99-worker-1ae64748-a115-11e9-8f14-005056899d54-registries   365c1cfd14de5b0e3b85e0fc815b0060f36ab955   2.2.0             31d
    99-worker-ssh                                                                                          2.2.0             31d
    rendered-infra-4e48906dca84ee702959c71a53ee80e7             365c1cfd14de5b0e3b85e0fc815b0060f36ab955   2.2.0             23m
    rendered-master-072d4b2da7f88162636902b074e9e28e            5b6fb8349a29735e48446d435962dec4547d3090   2.2.0             31d
    rendered-master-3e88ec72aed3886dec061df60d16d1af            02c07496ba0417b3e12b78fb32baf6293d314f79   2.2.0             31d
    rendered-master-419bee7de96134963a15fdf9dd473b25            365c1cfd14de5b0e3b85e0fc815b0060f36ab955   2.2.0             17d
    rendered-master-53f5c91c7661708adce18739cc0f40fb            365c1cfd14de5b0e3b85e0fc815b0060f36ab955   2.2.0             13d
    rendered-master-a6a357ec18e5bce7f5ac426fc7c5ffcd            365c1cfd14de5b0e3b85e0fc815b0060f36ab955   2.2.0             7d3h
    rendered-master-dc7f874ec77fc4b969674204332da037            5b6fb8349a29735e48446d435962dec4547d3090   2.2.0             31d
    rendered-worker-1a75960c52ad18ff5dfa6674eb7e533d            5b6fb8349a29735e48446d435962dec4547d3090   2.2.0             31d
    rendered-worker-2640531be11ba43c61d72e82dc634ce6            5b6fb8349a29735e48446d435962dec4547d3090   2.2.0             31d
    rendered-worker-4e48906dca84ee702959c71a53ee80e7            365c1cfd14de5b0e3b85e0fc815b0060f36ab955   2.2.0             7d3h
    rendered-worker-4f110718fe88e5f349987854a1147755            365c1cfd14de5b0e3b85e0fc815b0060f36ab955   2.2.0             17d
    rendered-worker-afc758e194d6188677eb837842d3b379            02c07496ba0417b3e12b78fb32baf6293d314f79   2.2.0             31d
    rendered-worker-daa08cc1e8f5fcdeba24de60cd955cc3            365c1cfd14de5b0e3b85e0fc815b0060f36ab955   2.2.0             13d
  4. Optional: To deploy changes to a custom pool, create a machine config that uses the custom pool name as the label, such as infra. Note that this is not required and only shown for instructional purposes. In this manner, you can apply any custom configurations specific to only your infra nodes.

    $ cat infra.mc.yaml
    Example output
    apiVersion: machineconfiguration.openshift.io/v1
    kind: MachineConfig
    metadata:
      labels:
        machineconfiguration.openshift.io/role: infra
      name: 51-infra
    spec:
      config:
        ignition:
          version: 2.2.0
        storage:
          files:
          - contents:
              source: data:,infra
            filesystem: root
            mode: 0644
            path: /etc/infratest
  5. Apply the machine config, then verify that the infra-labeled nodes are updated with the configuration file:

    $ oc create -f infra.mc.yaml

About the cluster autoscaler

The cluster autoscaler adjusts the size of an OpenShift Container Platform cluster to meet its current deployment needs. It uses declarative, Kubernetes-style arguments to provide infrastructure management that does not rely on objects of a specific cloud provider. The cluster autoscaler has a cluster scope, and is not associated with a particular namespace.

The cluster autoscaler increases the size of the cluster when there are pods that failed to schedule on any of the current nodes due to insufficient resources or when another node is necessary to meet deployment needs. The cluster autoscaler does not increase the cluster resources beyond the limits that you specify.

Ensure that the maxNodesTotal value in the ClusterAutoscaler resource definition that you create is large enough to account for the total possible number of machines in your cluster. This value must encompass the number of control plane machines and the possible number of compute machines that you might scale to.

The cluster autoscaler decreases the size of the cluster when some nodes are consistently not needed for a significant period, such as when it has low resource use and all of its important pods can fit on other nodes.

If the following types of pods are present on a node, the cluster autoscaler will not remove the node:

  • Pods with restrictive pod disruption budgets (PDBs).

  • Kube-system pods that do not run on the node by default.

  • Kube-system pods that do not have a PDB or have a PDB that is too restrictive.

  • Pods that are not backed by a controller object such as a deployment, replica set, or stateful set.

  • Pods with local storage.

  • Pods that cannot be moved elsewhere because of a lack of resources, incompatible node selectors or affinity, matching anti-affinity, and so on.

  • Unless they also have a "cluster-autoscaler.kubernetes.io/safe-to-evict": "true" annotation, pods that have a "cluster-autoscaler.kubernetes.io/safe-to-evict": "false" annotation.

If you configure the cluster autoscaler, additional usage restrictions apply:

  • Do not modify the nodes that are in autoscaled node groups directly. All nodes within the same node group have the same capacity and labels and run the same system pods.

  • Specify requests for your pods.

  • If you have to prevent pods from being deleted too quickly, configure appropriate PDBs.

  • Confirm that your cloud provider quota is large enough to support the maximum node pools that you configure.

  • Do not run additional node group autoscalers, especially the ones offered by your cloud provider.

The horizontal pod autoscaler (HPA) and the cluster autoscaler modify cluster resources in different ways. The HPA changes the deployment’s or replica set’s number of replicas based on the current CPU load. If the load increases, the HPA creates new replicas, regardless of the amount of resources available to the cluster. If there are not enough resources, the cluster autoscaler adds resources so that the HPA-created pods can run. If the load decreases, the HPA stops some replicas. If this action causes some nodes to be underutilized or completely empty, the cluster autoscaler deletes the unnecessary nodes.

The cluster autoscaler takes pod priorities into account. The Pod Priority and Preemption feature enables scheduling pods based on priorities if the cluster does not have enough resources, but the cluster autoscaler ensures that the cluster has resources to run all pods. To honor the intention of both features, the cluster autoscaler includes a priority cutoff function. You can use this cutoff to schedule "best-effort" pods, which do not cause the cluster autoscaler to increase resources but instead run only when spare resources are available.

Pods with priority lower than the cutoff value do not cause the cluster to scale up or prevent the cluster from scaling down. No new nodes are added to run the pods, and nodes running these pods might be deleted to free resources.

ClusterAutoscaler resource definition

This ClusterAutoscaler resource definition shows the parameters and sample values for the cluster autoscaler.

apiVersion: "autoscaling.openshift.io/v1"
kind: "ClusterAutoscaler"
metadata:
  name: "default"
spec:
  podPriorityThreshold: -10 (1)
  resourceLimits:
    maxNodesTotal: 24 (2)
    cores:
      min: 8 (3)
      max: 128 (4)
    memory:
      min: 4 (5)
      max: 256 (6)
    gpus:
      - type: nvidia.com/gpu (7)
        min: 0 (8)
        max: 16 (9)
      - type: amd.com/gpu (7)
        min: 0 (8)
        max: 4 (9)
  scaleDown: (10)
    enabled: true (11)
    delayAfterAdd: 10m (12)
    delayAfterDelete: 5m (13)
    delayAfterFailure: 30s (14)
    unneededTime: 60s (15)
1 Specify the priority that a pod must exceed to cause the cluster autoscaler to deploy additional nodes. Enter a 32-bit integer value. The podPriorityThreshold value is compared to the value of the PriorityClass that you assign to each pod.
2 Specify the maximum number of nodes to deploy. This value is the total number of machines that are deployed in your cluster, not just the ones that the autoscaler controls. Ensure that this value is large enough to account for all of your control plane and compute machines and the total number of replicas that you specify in your MachineAutoscaler resources.
3 Specify the minimum number of cores to deploy.
4 Specify the maximum number of cores to deploy.
5 Specify the minimum amount of memory, in GiB, per node.
6 Specify the maximum amount of memory, in GiB, per node.
7 Optionally, specify the type of GPU node to deploy. Only nvidia.com/gpu and amd.com/gpu are valid types.
8 Specify the minimum number of GPUs to deploy.
9 Specify the maximum number of GPUs to deploy.
10 In this section, you can specify the period to wait for each action by using any valid ParseDuration interval, including ns, us, ms, s, m, and h.
11 Specify whether the cluster autoscaler can remove unnecessary nodes.
12 Optionally, specify the period to wait before deleting a node after a node has recently been added. If you do not specify a value, the default value of 10m is used.
13 Specify the period to wait before deleting a node after a node has recently been deleted. If you do not specify a value, the default value of 10s is used.
14 Specify the period to wait before deleting a node after a scale down failure occurred. If you do not specify a value, the default value of 3m is used.
15 Specify the period before an unnecessary node is eligible for deletion. If you do not specify a value, the default value of 10m is used.

Deploying the cluster autoscaler

To deploy the cluster autoscaler, you create an instance of the ClusterAutoscaler resource.

Procedure
  1. Create a YAML file for the ClusterAutoscaler resource that contains the customized resource definition.

  2. Create the resource in the cluster:

    $ oc create -f <filename>.yaml (1)
    1 <filename> is the name of the resource file that you customized.

About the machine autoscaler

The machine autoscaler adjusts the number of Machines in the machine sets that you deploy in an OpenShift Container Platform cluster. You can scale both the default worker machine set and any other machine sets that you create. The machine autoscaler makes more Machines when the cluster runs out of resources to support more deployments. Any changes to the values in MachineAutoscaler resources, such as the minimum or maximum number of instances, are immediately applied to the machine set they target.

You must deploy a machine autoscaler for the cluster autoscaler to scale your machines. The cluster autoscaler uses the annotations on machine sets that the machine autoscaler sets to determine the resources that it can scale. If you define a cluster autoscaler without also defining machine autoscalers, the cluster autoscaler will never scale your cluster.

MachineAutoscaler resource definition

This MachineAutoscaler resource definition shows the parameters and sample values for the machine autoscaler.

apiVersion: "autoscaling.openshift.io/v1beta1"
kind: "MachineAutoscaler"
metadata:
  name: "worker-us-east-1a" (1)
  namespace: "openshift-machine-api"
spec:
  minReplicas: 1 (2)
  maxReplicas: 12 (3)
  scaleTargetRef: (4)
    apiVersion: machine.openshift.io/v1beta1
    kind: MachineSet (5)
    name: worker-us-east-1a (6)
1 Specify the machine autoscaler name. To make it easier to identify which machine set this machine autoscaler scales, specify or include the name of the machine set to scale. The machine set name takes the following form: <clusterid>-<machineset>-<aws-region-az>
2 Specify the minimum number machines of the specified type that must remain in the specified zone after the cluster autoscaler initiates cluster scaling. If running in AWS, GCP, or Azure, this value can be set to 0. For other providers, do not set this value to 0.
3 Specify the maximum number machines of the specified type that the cluster autoscaler can deploy in the specified AWS zone after it initiates cluster scaling. Ensure that the maxNodesTotal value in the ClusterAutoscaler resource definition is large enough to allow the machine autoscaler to deploy this number of machines.
4 In this section, provide values that describe the existing machine set to scale.
5 The kind parameter value is always MachineSet.
6 The name value must match the name of an existing machine set, as shown in the metadata.name parameter value.

Deploying the machine autoscaler

To deploy the machine autoscaler, you create an instance of the MachineAutoscaler resource.

Procedure
  1. Create a YAML file for the MachineAutoscaler resource that contains the customized resource definition.

  2. Create the resource in the cluster:

    $ oc create -f <filename>.yaml (1)
    1 <filename> is the name of the resource file that you customized.

Enabling Technology Preview features using FeatureGates

You can turn Technology Preview features on and off for all nodes in the cluster by editing the FeatureGates Custom Resource, named cluster, in the openshift-config project.

The following Technology Preview features are enabled by feature gates:

  • RotateKubeletServerCertificate

  • SupportPodPidsLimit

Turning on Technology Preview features using the FeatureGate custom resource cannot be undone and prevents upgrades.

Procedure

To turn on the Technology Preview features for the entire cluster:

  1. Create the FeatureGates instance:

    1. Switch to the AdministrationCustom Resource Definitions page.

    2. On the Custom Resource Definitions page, click FeatureGate.

    3. On the Custom Resource Definitions page, click the Actions Menu and select View Instances.

    4. On the Feature Gates page, click Create Feature Gates.

    5. Replace the code with following sample:

      apiVersion: config.openshift.io/v1
      kind: FeatureGate
      metadata:
        name: cluster
      spec: {}
    6. Click Create.

  2. To turn on the Technology Preview features, change the spec parameter to:

    apiVersion: config.openshift.io/v1
    kind: FeatureGate
    metadata:
      name: cluster
    spec:
      featureSet: TechPreviewNoUpgrade (1)
    1 Add featureSet: TechPreviewNoUpgrade to enable the Technology Preview features that are affected by FeatureGates.

etcd tasks

Back up etcd, enable or disable etcd encryption, or defragment etcd data.

About etcd encryption

By default, etcd data is not encrypted in OpenShift Container Platform. You can enable etcd encryption for your cluster to provide an additional layer of data security. For example, it can help protect the loss of sensitive data if an etcd backup is exposed to the incorrect parties.

When you enable etcd encryption, the following OpenShift API server and Kubernetes API server resources are encrypted:

  • Secrets

  • Config maps

  • Routes

  • OAuth access tokens

  • OAuth authorize tokens

When you enable etcd encryption, encryption keys are created. These keys are rotated on a weekly basis. You must have these keys in order to restore from an etcd backup.

Enabling etcd encryption

You can enable etcd encryption to encrypt sensitive resources in your cluster.

It is not recommended to take a backup of etcd until the initial encryption process is complete. If the encryption process has not completed, the backup might be only partially encrypted.

Prerequisites
  • Access to the cluster as a user with the cluster-admin role.

Procedure
  1. Modify the APIServer object:

    $ oc edit apiserver
  2. Set the encryption field type to aescbc:

    spec:
      encryption:
        type: aescbc (1)
    1 The aescbc type means that AES-CBC with PKCS#7 padding and a 32 byte key is used to perform the encryption.
  3. Save the file to apply the changes.

    The encryption process starts. It can take 20 minutes or longer for this process to complete, depending on the size of your cluster.

  4. Verify that etcd encryption was successful.

    1. Review the Encrypted status condition for the OpenShift API server to verify that its resources were successfully encrypted:

      $ oc get openshiftapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="Encrypted")]}{.reason}{"\n"}{.message}{"\n"}'

      The output shows EncryptionCompleted upon successful encryption:

      EncryptionCompleted
      All resources encrypted: routes.route.openshift.io, oauthaccesstokens.oauth.openshift.io, oauthauthorizetokens.oauth.openshift.io

      If the output shows EncryptionInProgress, this means that encryption is still in progress. Wait a few minutes and try again.

    2. Review the Encrypted status condition for the Kubernetes API server to verify that its resources were successfully encrypted:

      $ oc get kubeapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="Encrypted")]}{.reason}{"\n"}{.message}{"\n"}'

      The output shows EncryptionCompleted upon successful encryption:

      EncryptionCompleted
      All resources encrypted: secrets, configmaps

      If the output shows EncryptionInProgress, this means that encryption is still in progress. Wait a few minutes and try again.

Disabling etcd encryption

You can disable encryption of etcd data in your cluster.

Prerequisites
  • Access to the cluster as a user with the cluster-admin role.

Procedure
  1. Modify the APIServer object:

    $ oc edit apiserver
  2. Set the encryption field type to identity:

    spec:
      encryption:
        type: identity (1)
    1 The identity type is the default value and means that no encryption is performed.
  3. Save the file to apply the changes.

    The decryption process starts. It can take 20 minutes or longer for this process to complete, depending on the size of your cluster.

  4. Verify that etcd decryption was successful.

    1. Review the Encrypted status condition for the OpenShift API server to verify that its resources were successfully decrypted:

      $ oc get openshiftapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="Encrypted")]}{.reason}{"\n"}{.message}{"\n"}'

      The output shows DecryptionCompleted upon successful decryption:

      DecryptionCompleted
      Encryption mode set to identity and everything is decrypted

      If the output shows DecryptionInProgress, this means that decryption is still in progress. Wait a few minutes and try again.

    2. Review the Encrypted status condition for the Kubernetes API server to verify that its resources were successfully decrypted:

      $ oc get kubeapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="Encrypted")]}{.reason}{"\n"}{.message}{"\n"}'

      The output shows DecryptionCompleted upon successful decryption:

      DecryptionCompleted
      Encryption mode set to identity and everything is decrypted

      If the output shows DecryptionInProgress, this means that decryption is still in progress. Wait a few minutes and try again.

Backing up etcd data

Follow these steps to back up etcd data by creating an etcd snapshot and backing up the resources for the static pods. This backup can be saved and used at a later time if you need to restore etcd.

Only save a backup from a single master host. Do not take a backup from each master host in the cluster.

Prerequisites
  • You have access to the cluster as a user with the cluster-admin role.

  • You have checked whether the cluster-wide proxy is enabled.

    You can check whether the proxy is enabled by reviewing the output of oc get proxy cluster -o yaml. The proxy is enabled if the httpProxy, httpsProxy, and noProxy fields have values set.

Procedure
  1. Start a debug session for a master node:

    $ oc debug node/<node_name>
  2. Change your root directory to the host:

    sh-4.2# chroot /host
  3. If the cluster-wide proxy is enabled, be sure that you have exported the NO_PROXY, HTTP_PROXY, and HTTPS_PROXY environment variables.

  4. Run the cluster-backup.sh script and pass in the location to save the backup to.

    sh-4.4# /usr/local/bin/cluster-backup.sh /home/core/assets/backup
    Example script output
    1bf371f1b5a483927cd01bb593b0e12cff406eb8d7d0acf4ab079c36a0abd3f7
    etcdctl version: 3.3.18
    API version: 3.3
    found latest kube-apiserver-pod: /etc/kubernetes/static-pod-resources/kube-apiserver-pod-7
    found latest kube-controller-manager-pod: /etc/kubernetes/static-pod-resources/kube-controller-manager-pod-8
    found latest kube-scheduler-pod: /etc/kubernetes/static-pod-resources/kube-scheduler-pod-6
    found latest etcd-pod: /etc/kubernetes/static-pod-resources/etcd-pod-2
    Snapshot saved at /home/core/assets/backup/snapshot_2020-03-18_220218.db
    snapshot db and kube resources are successfully saved to /home/core/assets/backup

    In this example, two files are created in the /home/core/assets/backup/ directory on the master host:

    • snapshot_<datetimestamp>.db: This file is the etcd snapshot.

    • static_kuberesources_<datetimestamp>.tar.gz: This file contains the resources for the static pods. If etcd encryption is enabled, it also contains the encryption keys for the etcd snapshot.

      If etcd encryption is enabled, it is recommended to store this second file separately from the etcd snapshot for security reasons. However, this file is required in order to restore from the etcd snapshot.

      Keep in mind that etcd encryption only encrypts values, not keys. This means that resource types, namespaces, and object names are unencrypted.

Defragmenting etcd data

Manual defragmentation must be performed periodically to reclaim disk space after etcd history compaction and other events cause disk fragmentation.

History compaction is performed automatically every five minutes and leaves gaps in the back-end database. This fragmented space is available for use by etcd, but is not available to the host file system. You must defragment etcd to make this space available to the host file system.

Because etcd writes data to disk, its performance strongly depends on disk performance. Consider defragmenting etcd every month, twice a month, or as needed for your cluster. You can also monitor the etcd_db_total_size_in_bytes metric to determine whether defragmentation is necessary.

Defragmenting etcd is a blocking action. The etcd member will not response until defragmentation is complete. For this reason, wait at least one minute between defragmentation actions on each of the pods to allow the cluster to recover.

Follow this procedure to defragment etcd data on each etcd member.

Prerequisites
  • You have access to the cluster as a user with the cluster-admin role.

Procedure
  1. Determine which etcd member is the leader, because the leader should be defragmented last.

    1. Get the list of etcd pods:

      $ oc get pods -n openshift-etcd -o wide | grep etcd
      Example output
      etcd-ip-10-0-159-225.example.redhat.com                3/3     Running     0          175m   10.0.159.225   ip-10-0-159-225.example.redhat.com   <none>           <none>
      etcd-ip-10-0-191-37.example.redhat.com                 3/3     Running     0          173m   10.0.191.37    ip-10-0-191-37.example.redhat.com    <none>           <none>
      etcd-ip-10-0-199-170.example.redhat.com                3/3     Running     0          176m   10.0.199.170   ip-10-0-199-170.example.redhat.com   <none>           <none>
    2. Choose a pod and run the following command to determine which etcd member is the leader:

      $ oc rsh -n openshift-etcd etcd-ip-10-0-159-225.us-west-1.compute.internal etcdctl endpoint status --cluster -w table
      Example output
      Defaulting container name to etcdctl.
      Use 'oc describe pod/etcd-ip-10-0-159-225.example.redhat.com -n openshift-etcd' to see all of the containers in this pod.
      +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
      |         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
      +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
      |  https://10.0.191.37:2379 | 251cd44483d811c3 |   3.4.9 |  104 MB |     false |      false |         7 |      91624 |              91624 |        |
      | https://10.0.159.225:2379 | 264c7c58ecbdabee |   3.4.9 |  104 MB |     false |      false |         7 |      91624 |              91624 |        |
      | https://10.0.199.170:2379 | 9ac311f93915cc79 |   3.4.9 |  104 MB |      true |      false |         7 |      91624 |              91624 |        |
      +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

      Based on the IS LEADER column of this output, the https://10.0.199.170:2379 endpoint is the leader. Matching this endpoint with the output of the previous step, the pod name of the leader is etcd-ip-10-0-199-170.example.redhat.com.

  2. Defragment an etcd member.

    1. Connect to the running etcd container, passing in the name of a pod that is not the leader:

      $ oc rsh -n openshift-etcd etcd-ip-10-0-159-225.example.redhat.com
    2. Unset the ETCDCTL_ENDPOINTS environment variable:

      sh-4.4# unset ETCDCTL_ENDPOINTS
    3. Defragment the etcd member:

      sh-4.4# etcdctl --command-timeout=30s --endpoints=https://localhost:2379 defrag
      Example output
      Finished defragmenting etcd member[https://localhost:2379]

      If a timeout error occurs, increase the value for --command-timeout until the command succeeds.

    4. Verify that the database size was reduced:

      sh-4.4# etcdctl endpoint status -w table --cluster
      Example output
      +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
      |         ENDPOINT          |        ID        | VERSION | DB SIZE | IS LEADER | IS LEARNER | RAFT TERM | RAFT INDEX | RAFT APPLIED INDEX | ERRORS |
      +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+
      |  https://10.0.191.37:2379 | 251cd44483d811c3 |   3.4.9 |  104 MB |     false |      false |         7 |      91624 |              91624 |        |
      | https://10.0.159.225:2379 | 264c7c58ecbdabee |   3.4.9 |   41 MB |     false |      false |         7 |      91624 |              91624 |        | (1)
      | https://10.0.199.170:2379 | 9ac311f93915cc79 |   3.4.9 |  104 MB |      true |      false |         7 |      91624 |              91624 |        |
      +---------------------------+------------------+---------+---------+-----------+------------+-----------+------------+--------------------+--------+

      This example shows that the database size for this etcd member is now 41 MB as opposed to the starting size of 104 MB.

    5. Repeat these steps to connect to each of the other etcd members and defragment them. Always defragment the leader last.

      Wait at least one minute between defragmentation actions to allow the etcd pod to recover. Until the etcd pod recovers, the etcd member will not respond.

  3. If any NOSPACE alarms were triggered due to the space quota being exceeded, clear them.

    1. Check if there are any NOSPACE alarms:

      sh-4.4# etcdctl alarm list
      Example output
      memberID:12345678912345678912 alarm:NOSPACE
    2. Clear the alarms:

      sh-4.4# etcdctl alarm disarm

Restoring to a previous cluster state

You can use a saved etcd backup to restore back to a previous cluster state. You use the etcd backup to restore a single control plane host. Then the etcd cluster Operator handles scaling to the remaining master hosts.

When you restore your cluster, you must use an etcd backup that was taken from the same z-stream release. For example, an OpenShift Container Platform 4.5.2 cluster must use an etcd backup that was taken from 4.5.2.

Prerequisites
  • Access to the cluster as a user with the cluster-admin role.

  • SSH access to master hosts.

  • A backup directory containing both the etcd snapshot and the resources for the static pods, which were from the same backup. The file names in the directory must be in the following formats: snapshot_<datetimestamp>.db and static_kuberesources_<datetimestamp>.tar.gz.

Procedure
  1. Select a control plane host to use as the recovery host. This is the host that you will run the restore operation on.

  2. Establish SSH connectivity to each of the control plane nodes, including the recovery host.

    The Kubernetes API server becomes inaccessible after the restore process starts, so you cannot access the control plane nodes. For this reason, it is recommended to establish SSH connectivity to each control plane host in a separate terminal.

    If you do not complete this step, you will not be able to access the master hosts to complete the restore procedure, and you will be unable to recover your cluster from this state.

  3. Copy the etcd backup directory to the recovery control plane host.

    This procedure assumes that you copied the backup directory containing the etcd snapshot and the resources for the static pods to the /home/core/ directory of your recovery control plane host.

  4. Stop the static pods on all other control plane nodes.

    It is not required to manually stop the pods on the recovery host. The recovery script will stop the pods on the recovery host.

    1. Access a control plane host that is not the recovery host.

    2. Move the existing etcd pod file out of the kubelet manifest directory:

      [core@ip-10-0-154-194 ~]$ sudo mv /etc/kubernetes/manifests/etcd-pod.yaml /tmp
    3. Verify that the etcd pods are stopped.

      [core@ip-10-0-154-194 ~]$ sudo crictl ps | grep etcd | grep -v operator

      The output of this command should be empty. If it is not empty, wait a few minutes and check again.

    4. Move the existing Kubernetes API server pod file out of the kubelet manifest directory:

      [core@ip-10-0-154-194 ~]$ sudo mv /etc/kubernetes/manifests/kube-apiserver-pod.yaml /tmp
    5. Verify that the Kubernetes API server pods are stopped.

      [core@ip-10-0-154-194 ~]$ sudo crictl ps | grep kube-apiserver | grep -v operator

      The output of this command should be empty. If it is not empty, wait a few minutes and check again.

    6. Move the etcd data directory to a different location:

      [core@ip-10-0-154-194 ~]$ sudo mv /var/lib/etcd/ /tmp
    7. Repeat this step on each of the other master hosts that is not the recovery host.

  5. Access the recovery control plane host.

  6. If the cluster-wide proxy is enabled, be sure that you have exported the NO_PROXY, HTTP_PROXY, and HTTPS_PROXY environment variables.

    You can check whether the proxy is enabled by reviewing the output of oc get proxy cluster -o yaml. The proxy is enabled if the httpProxy, httpsProxy, and noProxy fields have values set.

  7. Run the restore script on the recovery control plane host and pass in the path to the etcd backup directory:

    [core@ip-10-0-143-125 ~]$ sudo -E /usr/local/bin/cluster-restore.sh /home/core/backup
    Example script output
    ...stopping kube-scheduler-pod.yaml
    ...stopping kube-controller-manager-pod.yaml
    ...stopping etcd-pod.yaml
    ...stopping kube-apiserver-pod.yaml
    Waiting for container etcd to stop
    .complete
    Waiting for container etcdctl to stop
    .............................complete
    Waiting for container etcd-metrics to stop
    complete
    Waiting for container kube-controller-manager to stop
    complete
    Waiting for container kube-apiserver to stop
    ..........................................................................................complete
    Waiting for container kube-scheduler to stop
    complete
    Moving etcd data-dir /var/lib/etcd/member to /var/lib/etcd-backup
    starting restore-etcd static pod
    starting kube-apiserver-pod.yaml
    static-pod-resources/kube-apiserver-pod-7/kube-apiserver-pod.yaml
    starting kube-controller-manager-pod.yaml
    static-pod-resources/kube-controller-manager-pod-7/kube-controller-manager-pod.yaml
    starting kube-scheduler-pod.yaml
    static-pod-resources/kube-scheduler-pod-8/kube-scheduler-pod.yaml
  8. Restart the kubelet service on all master hosts.

    1. From the recovery host, run the following command:

      [core@ip-10-0-143-125 ~]$ sudo systemctl restart kubelet.service
    2. Repeat this step on all other master hosts.

  9. Verify that the single member control plane has started successfully.

    1. From the recovery host, verify that the etcd container is running.

      [core@ip-10-0-143-125 ~]$ sudo crictl ps | grep etcd | grep -v operator
      Example output
      3ad41b7908e32       36f86e2eeaaffe662df0d21041eb22b8198e0e58abeeae8c743c3e6e977e8009                                                         About a minute ago   Running             etcd                                          0                   7c05f8af362f0
    2. From the recovery host, verify that the etcd pod is running.

      [core@ip-10-0-143-125 ~]$ oc get pods -n openshift-etcd | grep etcd

      If you attempt to run oc login prior to running this command and receive the following error, wait a few moments for the authentication controllers to start and try again.

      Unable to connect to the server: EOF
      Example output
      NAME                                             READY   STATUS      RESTARTS   AGE
      etcd-ip-10-0-143-125.ec2.internal                1/1     Running     1          2m47s

      If the status is Pending, or the output lists more than one running etcd pod, wait a few minutes and check again.

  10. Force etcd redeployment.

    In a terminal that has access to the cluster as a cluster-admin user, run the following command:

    $ oc patch etcd cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge (1)
    1 The forceRedeploymentReason value must be unique, which is why a timestamp is appended.

    When the etcd cluster Operator performs a redeployment, the existing nodes are started with new pods similar to the initial bootstrap scale up.

  11. Verify all nodes are updated to the latest revision.

    In a terminal that has access to the cluster as a cluster-admin user, run the following command:

    $ oc get etcd -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'

    Review the NodeInstallerProgressing status condition for etcd to verify that all nodes are at the latest revision. The output shows AllNodesAtLatestRevision upon successful update:

    AllNodesAtLatestRevision
    3 nodes are at revision 7 (1)
    
    1 In this example, the latest revision number is 7.

    If the output includes multiple revision numbers, such as 2 nodes are at revision 6; 1 nodes are at revision 7, this means that the update is still in progress. Wait a few minutes and try again.

  12. After etcd is redeployed, force new rollouts for the control plane. The Kubernetes API server will reinstall itself on the other nodes because the kubelet is connected to API servers using an internal load balancer.

    In a terminal that has access to the cluster as a cluster-admin user, run the following commands.

    1. Update the kubeapiserver:

      $ oc patch kubeapiserver cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge

      Verify all nodes are updated to the latest revision.

      $ oc get kubeapiserver -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'

      Review the NodeInstallerProgressing status condition to verify that all nodes are at the latest revision. The output shows AllNodesAtLatestRevision upon successful update:

      AllNodesAtLatestRevision
      3 nodes are at revision 7 (1)
      
      1 In this example, the latest revision number is 7.

      If the output includes multiple revision numbers, such as 2 nodes are at revision 6; 1 nodes are at revision 7, this means that the update is still in progress. Wait a few minutes and try again.

    2. Update the kubecontrollermanager:

      $ oc patch kubecontrollermanager cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge

      Verify all nodes are updated to the latest revision.

      $ oc get kubecontrollermanager -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'

      Review the NodeInstallerProgressing status condition to verify that all nodes are at the latest revision. The output shows AllNodesAtLatestRevision upon successful update:

      AllNodesAtLatestRevision
      3 nodes are at revision 7 (1)
      
      1 In this example, the latest revision number is 7.

      If the output includes multiple revision numbers, such as 2 nodes are at revision 6; 1 nodes are at revision 7, this means that the update is still in progress. Wait a few minutes and try again.

    3. Update the kubescheduler:

      $ oc patch kubescheduler cluster -p='{"spec": {"forceRedeploymentReason": "recovery-'"$( date --rfc-3339=ns )"'"}}' --type=merge

      Verify all nodes are updated to the latest revision.

      $ oc get kubescheduler -o=jsonpath='{range .items[0].status.conditions[?(@.type=="NodeInstallerProgressing")]}{.reason}{"\n"}{.message}{"\n"}'

      Review the NodeInstallerProgressing status condition to verify that all nodes are at the latest revision. The output shows AllNodesAtLatestRevision upon successful update:

      AllNodesAtLatestRevision
      3 nodes are at revision 7 (1)
      
      1 In this example, the latest revision number is 7.

      If the output includes multiple revision numbers, such as 2 nodes are at revision 6; 1 nodes are at revision 7, this means that the update is still in progress. Wait a few minutes and try again.

  13. Verify that all master hosts have started and joined the cluster.

    In a terminal that has access to the cluster as a cluster-admin user, run the following command:

    $ oc get pods -n openshift-etcd | grep etcd
    Example output
    etcd-ip-10-0-143-125.ec2.internal                2/2     Running     0          9h
    etcd-ip-10-0-154-194.ec2.internal                2/2     Running     0          9h
    etcd-ip-10-0-173-171.ec2.internal                2/2     Running     0          9h

Note that it might take several minutes after completing this procedure for all services to be restored. For example, authentication by using oc login might not immediately work until the OAuth server pods are restarted.

Pod disruption budgets

Understand and configure pod disruption budgets.

Understanding how to use pod disruption budgets to specify the number of pods that must be up

A pod disruption budget is part of the Kubernetes API, which can be managed with oc commands like other object types. They allow the specification of safety constraints on pods during operations, such as draining a node for maintenance.

PodDisruptionBudget is an API object that specifies the minimum number or percentage of replicas that must be up at a time. Setting these in projects can be helpful during node maintenance (such as scaling a cluster down or a cluster upgrade) and is only honored on voluntary evictions (not on node failures).

A PodDisruptionBudget object’s configuration consists of the following key parts:

  • A label selector, which is a label query over a set of pods.

  • An availability level, which specifies the minimum number of pods that must be available simultaneously, either:

    • minAvailable is the number of pods must always be available, even during a disruption.

    • maxUnavailable is the number of pods can be unavailable during a disruption.

A maxUnavailable of 0% or 0 or a minAvailable of 100% or equal to the number of replicas is permitted but can block nodes from being drained.

You can check for pod disruption budgets across all projects with the following:

$ oc get poddisruptionbudget --all-namespaces
Example output
NAMESPACE         NAME          MIN-AVAILABLE   SELECTOR
another-project   another-pdb   4               bar=foo
test-project      my-pdb        2               foo=bar

The PodDisruptionBudget is considered healthy when there are at least minAvailable pods running in the system. Every pod above that limit can be evicted.

Depending on your pod priority and preemption settings, lower-priority pods might be removed despite their pod disruption budget requirements.

Specifying the number of pods that must be up with pod disruption budgets

You can use a PodDisruptionBudget object to specify the minimum number or percentage of replicas that must be up at a time.

Procedure

To configure a pod disruption budget:

  1. Create a YAML file with the an object definition similar to the following:

    apiVersion: policy/v1beta1 (1)
    kind: PodDisruptionBudget
    metadata:
      name: my-pdb
    spec:
      minAvailable: 2  (2)
      selector:  (3)
        matchLabels:
          foo: bar
    1 PodDisruptionBudget is part of the policy/v1beta1 API group.
    2 The minimum number of pods that must be available simultaneously. This can be either an integer or a string specifying a percentage, for example, 20%.
    3 A label query over a set of resources. The result of matchLabels and matchExpressions are logically conjoined.

    Or:

    apiVersion: policy/v1beta1 (1)
    kind: PodDisruptionBudget
    metadata:
      name: my-pdb
    spec:
      maxUnavailable: 25% (2)
      selector: (3)
        matchLabels:
          foo: bar
    1 PodDisruptionBudget is part of the policy/v1beta1 API group.
    2 The maximum number of pods that can be unavailable simultaneously. This can be either an integer or a string specifying a percentage, for example, 20%.
    3 A label query over a set of resources. The result of matchLabels and matchExpressions are logically conjoined.
  2. Run the following command to add the object to project:

    $ oc create -f </path/to/file> -n <project_name>