# subscription-manager register --username=<user_name> --password=<password>
After installing OpenShift Container Platform, you can further expand and customize your cluster to your requirements through certain node tasks.
Understand and work with RHEL compute nodes.
In OpenShift Container Platform 4.9, you have the option of using Red Hat Enterprise Linux (RHEL) machines as compute machines, which are also known as worker machines, in your cluster if you use a user-provisioned infrastructure installation. You must use Red Hat Enterprise Linux CoreOS (RHCOS) machines for the control plane, or master, machines in your cluster.
As with all installations that use user-provisioned infrastructure, if you choose to use RHEL compute machines in your cluster, you take responsibility for all operating system life cycle management and maintenance, including performing system updates, applying patches, and completing all other required tasks.
Because removing OpenShift Container Platform from a machine in the cluster requires destroying the operating system, you must use dedicated hardware for any RHEL machines that you add to the cluster. |
Swap memory is disabled on all RHEL machines that you add to your OpenShift Container Platform cluster. You cannot enable swap memory on these machines. |
You must add any RHEL compute machines to the cluster after you initialize the control plane.
The Red Hat Enterprise Linux (RHEL) compute, or worker, machine hosts in your OpenShift Container Platform environment must meet the following minimum hardware specifications and system-level requirements:
You must have an active OpenShift Container Platform subscription on your Red Hat account. If you do not, contact your sales representative for more information.
Production environments must provide compute machines to support your expected workloads. As a cluster administrator, you must calculate the expected workload and add about 10% for overhead. For production environments, allocate enough resources so that a node host failure does not affect your maximum capacity.
Each system must meet the following hardware requirements:
Physical or virtual system, or an instance running on a public or private IaaS.
Base OS: RHEL 7.9 or RHEL 7.9 through 8.7 with "Minimal" installation option.
Adding RHEL 7 compute machines to an OpenShift Container Platform cluster is deprecated. Deprecated functionality is still included in OpenShift Container Platform and continues to be supported; however, it will be removed in a future release of this product and is not recommended for new deployments. In addition, you cannot upgrade your RHEL 7 compute machines to RHEL 8. You must deploy new RHEL 8 hosts, and the old RHEL 7 hosts should be removed. See the "Deleting nodes" section for more information. For the most recent list of major functionality that has been deprecated or removed within OpenShift Container Platform, refer to the Deprecated and removed features section of the OpenShift Container Platform release notes. |
If you deployed OpenShift Container Platform in FIPS mode, you must enable FIPS on the RHEL machine before you boot it. See Enabling FIPS Mode in the RHEL 7 documentation.
The use of FIPS Validated / Modules in Process cryptographic libraries is only supported on OpenShift Container Platform deployments on the |
NetworkManager 1.0 or later.
1 vCPU.
Minimum 8 GB RAM.
Minimum 15 GB hard disk space for the file system containing /var/
.
Minimum 1 GB hard disk space for the file system containing /usr/local/bin/
.
Minimum 1 GB hard disk space for the file system containing its temporary directory. The temporary system directory is determined according to the rules defined in the tempfile module in the Python standard library.
Each system must meet any additional requirements for your system provider. For example, if you installed your cluster on VMware vSphere, your disks must be configured according to its storage guidelines and the disk.enableUUID=true
attribute must be set.
Each system must be able to access the cluster’s API endpoints by using DNS-resolvable hostnames. Any network security access control that is in place must allow system access to the cluster’s API service endpoints.
Because your cluster has limited access to automatic machine management when you use infrastructure that you provision, you must provide a mechanism for approving cluster certificate signing requests (CSRs) after installation. The kube-controller-manager
only approves the kubelet client CSRs. The machine-approver
cannot guarantee the validity of a serving certificate that is requested by using kubelet credentials because it cannot confirm that the correct machine issued the request. You must determine and implement a method of verifying the validity of the kubelet serving certificate requests and approving them.
Before you can add compute machines that use Red Hat Enterprise Linux (RHEL) as the operating system to an OpenShift Container Platform 4.9 cluster, you must prepare a RHEL 7 machine to run an Ansible playbook that adds the new node to the cluster. This machine is not part of the cluster but must be able to access it.
Install the OpenShift CLI (oc
) on the machine that you run the playbook on.
Log in as a user with cluster-admin
permission.
Ensure that the kubeconfig
file for the cluster and the installation program that you used to install the cluster are on the RHEL 7 machine. One way to accomplish this is to use the same machine that you used to install the cluster.
Configure the machine to access all of the RHEL hosts that you plan to use as compute machines. You can use any method that your company allows, including a bastion with an SSH proxy or a VPN.
Configure a user on the machine that you run the playbook on that has SSH access to all of the RHEL hosts.
If you use SSH key-based authentication, you must manage the key with an SSH agent. |
If you have not already done so, register the machine with RHSM and attach a pool with an OpenShift
subscription to it:
Register the machine with RHSM:
# subscription-manager register --username=<user_name> --password=<password>
Pull the latest subscription data from RHSM:
# subscription-manager refresh
List the available subscriptions:
# subscription-manager list --available --matches '*OpenShift*'
In the output for the previous command, find the pool ID for an OpenShift Container Platform subscription and attach it:
# subscription-manager attach --pool=<pool_id>
Enable the repositories required by OpenShift Container Platform 4.9:
# subscription-manager repos \
--enable="rhel-7-server-rpms" \
--enable="rhel-7-server-extras-rpms" \
--enable="rhel-7-server-ansible-2.9-rpms" \
--enable="rhel-7-server-ose-4.9-rpms"
Install the required packages, including openshift-ansible
:
# yum install openshift-ansible openshift-clients jq
The openshift-ansible
package provides installation program utilities and pulls in other packages that you require to add a RHEL compute node to your cluster, such as Ansible, playbooks, and related configuration files. The openshift-clients
provides the oc
CLI, and the jq
package improves the display of JSON output on your command line.
Before you add a Red Hat Enterprise Linux (RHEL) machine to your OpenShift Container Platform cluster, you must register each host with Red Hat Subscription Manager (RHSM), attach an active OpenShift Container Platform subscription, and enable the required repositories.
On each host, register with RHSM:
# subscription-manager register --username=<user_name> --password=<password>
Pull the latest subscription data from RHSM:
# subscription-manager refresh
List the available subscriptions:
# subscription-manager list --available --matches '*OpenShift*'
In the output for the previous command, find the pool ID for an OpenShift Container Platform subscription and attach it:
# subscription-manager attach --pool=<pool_id>
Disable all yum repositories:
Disable all the enabled RHSM repositories:
# subscription-manager repos --disable="*"
List the remaining yum repositories and note their names under repo id
, if any:
# yum repolist
Use yum-config-manager
to disable the remaining yum repositories:
# yum-config-manager --disable <repo_id>
Alternatively, disable all repositories:
# yum-config-manager --disable \*
Note that this might take a few minutes if you have a large number of available repositories
Enable only the repositories required by OpenShift Container Platform 4.9.
For RHEL 7 nodes, you must enable the following repositories:
# subscription-manager repos \
--enable="rhel-7-server-rpms" \
--enable="rhel-7-fast-datapath-rpms" \
--enable="rhel-7-server-extras-rpms" \
--enable="rhel-7-server-optional-rpms" \
--enable="rhel-7-server-ose-4.9-rpms"
Use of RHEL 7 nodes is deprecated and planned for removal in a future release of OpenShift Container Platform 4. |
For RHEL 8 nodes, you must enable the following repositories:
# subscription-manager repos \
--enable="rhel-8-for-x86_64-baseos-rpms" \
--enable="rhel-8-for-x86_64-appstream-rpms" \
--enable="rhocp-4.9-for-rhel-8-x86_64-rpms" \
--enable="fast-datapath-for-rhel-8-x86_64-rpms"
Stop and disable firewalld on the host:
# systemctl disable --now firewalld.service
You must not enable firewalld later. If you do, you cannot access OpenShift Container Platform logs on the worker. |
You can add compute machines that use Red Hat Enterprise Linux as the operating system to an OpenShift Container Platform 4.9 cluster.
You installed the required packages and performed the necessary configuration on the machine that you run the playbook on.
You prepared the RHEL hosts for installation.
Perform the following steps on the machine that you prepared to run the playbook:
Create an Ansible inventory file that is named /<path>/inventory/hosts
that defines your compute machine hosts and required variables:
[all:vars] ansible_user=root (1) #ansible_become=True (2) openshift_kubeconfig_path="~/.kube/config" (3) [new_workers] (4) mycluster-rhel8-0.example.com mycluster-rhel8-1.example.com
1 | Specify the user name that runs the Ansible tasks on the remote compute machines. |
2 | If you do not specify root for the ansible_user , you must set ansible_become to True and assign the user sudo permissions. |
3 | Specify the path and file name of the kubeconfig file for your cluster. |
4 | List each RHEL machine to add to your cluster. You must provide the fully-qualified domain name for each host. This name is the hostname that the cluster uses to access the machine, so set the correct public or private name to access the machine. |
Navigate to the Ansible playbook directory:
$ cd /usr/share/ansible/openshift-ansible
Run the playbook:
$ ansible-playbook -i /<path>/inventory/hosts playbooks/scaleup.yml (1)
1 | For <path> , specify the path to the Ansible inventory file that you created. |
You must define the following parameters in the Ansible hosts file before you add Red Hat Enterprise Linux (RHEL) compute machines to your cluster.
Paramter | Description | Values |
---|---|---|
|
The SSH user that allows SSH-based authentication without requiring a password. If you use SSH key-based authentication, then you must manage the key with an SSH agent. |
A user name on the system. The default value is |
|
If the values of |
|
|
Specifies a path and file name to a local directory that contains the |
The path and name of the configuration file. |
After you add the Red Hat Enterprise Linux (RHEL) compute machines to your cluster, you can optionally remove the Red Hat Enterprise Linux CoreOS (RHCOS) compute machines to free up resources.
You have added RHEL compute machines to your cluster.
View the list of machines and record the node names of the RHCOS compute machines:
$ oc get nodes -o wide
For each RHCOS compute machine, delete the node:
Mark the node as unschedulable by running the oc adm cordon
command:
$ oc adm cordon <node_name> (1)
1 | Specify the node name of one of the RHCOS compute machines. |
Drain all the pods from the node:
$ oc adm drain <node_name> --force --delete-emptydir-data --ignore-daemonsets (1)
1 | Specify the node name of the RHCOS compute machine that you isolated. |
Delete the node:
$ oc delete nodes <node_name> (1)
1 | Specify the node name of the RHCOS compute machine that you drained. |
Review the list of compute machines to ensure that only the RHEL nodes remain:
$ oc get nodes -o wide
Remove the RHCOS machines from the load balancer for your cluster’s compute machines. You can delete the virtual machines or reimage the physical hardware for the RHCOS compute machines.
You can add more Red Hat Enterprise Linux CoreOS (RHCOS) compute machines to your OpenShift Container Platform cluster on bare metal.
Before you add more compute machines to a cluster that you installed on bare metal infrastructure, you must create RHCOS machines for it to use. You can either use an ISO image or network PXE booting to create the machines.
You installed a cluster on bare metal.
You have installation media and Red Hat Enterprise Linux CoreOS (RHCOS) images that you used to create your cluster. If you do not have these files, you must obtain them by following the instructions in the installation procedure.
You can create more Red Hat Enterprise Linux CoreOS (RHCOS) compute machines for your bare metal cluster by using an ISO image to create the machines.
Obtain the URL of the Ignition config file for the compute machines for your cluster. You uploaded this file to your HTTP server during installation.
Use the ISO file to install RHCOS on more compute machines. Use the same method that you used when you created machines before you installed the cluster:
Burn the ISO image to a disk and boot it directly.
Use ISO redirection with a LOM interface.
After the instance boots, press the TAB
or E
key to edit the kernel command line.
Add the parameters to the kernel command line:
coreos.inst.install_dev=sda (1)
coreos.inst.ignition_url=http://example.com/worker.ign (2)
1 | Specify the block device of the system to install to. |
2 | Specify the URL of the compute Ignition config file. Only HTTP and HTTPS protocols are supported. |
Press Enter
to complete the installation. After RHCOS installs, the system reboots. After the system reboots, it applies the Ignition config file that you specified.
Continue to create more compute machines for your cluster.
You can create more Red Hat Enterprise Linux CoreOS (RHCOS) compute machines for your bare metal cluster by using PXE or iPXE booting.
Obtain the URL of the Ignition config file for the compute machines for your cluster. You uploaded this file to your HTTP server during installation.
Obtain the URLs of the RHCOS ISO image, compressed metal BIOS, kernel
, and initramfs
files that you uploaded to your HTTP server during cluster installation.
You have access to the PXE booting infrastructure that you used to create the machines for your OpenShift Container Platform cluster during installation. The machines must boot from their local disks after RHCOS is installed on them.
If you use UEFI, you have access to the grub.conf
file that you modified during OpenShift Container Platform installation.
Confirm that your PXE or iPXE installation for the RHCOS images is correct.
For PXE:
DEFAULT pxeboot TIMEOUT 20 PROMPT 0 LABEL pxeboot KERNEL http://<HTTP_server>/rhcos-<version>-live-kernel-<architecture> (1) APPEND initrd=http://<HTTP_server>/rhcos-<version>-live-initramfs.<architecture>.img coreos.inst.install_dev=/dev/sda coreos.inst.ignition_url=http://<HTTP_server>/worker.ign coreos.live.rootfs_url=http://<HTTP_server>/rhcos-<version>-live-rootfs.<architecture>.img (2)
1 | Specify the location of the live kernel file that you uploaded to your HTTP server. |
2 | Specify locations of the RHCOS files that you uploaded to your HTTP server. The initrd parameter value is the location of the live initramfs file, the coreos.inst.ignition_url parameter value is the location of the worker Ignition config file, and the coreos.live.rootfs_url parameter value is the location of the live rootfs file. The coreos.inst.ignition_url and coreos.live.rootfs_url parameters only support HTTP and HTTPS. |
This configuration does not enable serial console access on machines with a graphical console. To configure a different console, add one or more console=
arguments to the APPEND
line. For example, add console=tty0 console=ttyS0
to set the first PC serial port as the primary console and the graphical console as a secondary console. For more information, see How does one set up a serial terminal and/or console in Red Hat Enterprise Linux?.
For iPXE:
kernel http://<HTTP_server>/rhcos-<version>-live-kernel-<architecture> initrd=main coreos.inst.install_dev=/dev/sda coreos.inst.ignition_url=http://<HTTP_server>/worker.ign coreos.live.rootfs_url=http://<HTTP_server>/rhcos-<version>-live-rootfs.<architecture>.img (1) initrd --name main http://<HTTP_server>/rhcos-<version>-live-initramfs.<architecture>.img (2)
1 | Specify locations of the RHCOS files that you uploaded to your HTTP server. The kernel parameter value is the location of the kernel file, the initrd=main argument is needed for booting on UEFI systems, the coreos.inst.ignition_url parameter value is the location of the worker Ignition config file, and the coreos.live.rootfs_url parameter value is the location of the live rootfs file. The coreos.inst.ignition_url and coreos.live.rootfs_url parameters only support HTTP and HTTPS. |
2 | Specify the location of the initramfs file that you uploaded to your HTTP server. |
This configuration does not enable serial console access on machines with a graphical console. To configure a different console, add one or more console=
arguments to the kernel
line. For example, add console=tty0 console=ttyS0
to set the first PC serial port as the primary console and the graphical console as a secondary console. For more information, see How does one set up a serial terminal and/or console in Red Hat Enterprise Linux?.
Use the PXE or iPXE infrastructure to create the required compute machines for your cluster.
When you add machines to a cluster, two pending certificate signing requests (CSRs) are generated for each machine that you added. You must confirm that these CSRs are approved or, if necessary, approve them yourself. The client requests must be approved first, followed by the server requests.
You added machines to your cluster.
Confirm that the cluster recognizes the machines:
$ oc get nodes
NAME STATUS ROLES AGE VERSION
master-0 Ready master 63m v1.22.1
master-1 Ready master 63m v1.22.1
master-2 Ready master 64m v1.22.1
The output lists all of the machines that you created.
The preceding output might not include the compute nodes, also known as worker nodes, until some CSRs are approved. |
Review the pending CSRs and ensure that you see the client requests with the Pending
or Approved
status for each machine that you added to the cluster:
$ oc get csr
NAME AGE REQUESTOR CONDITION
csr-8b2br 15m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
csr-8vnps 15m system:serviceaccount:openshift-machine-config-operator:node-bootstrapper Pending
...
In this example, two machines are joining the cluster. You might see more approved CSRs in the list.
If the CSRs were not approved, after all of the pending CSRs for the machines you added are in Pending
status, approve the CSRs for your cluster machines:
Because the CSRs rotate automatically, approve your CSRs within an hour of adding the machines to the cluster. If you do not approve them within an hour, the certificates will rotate, and more than two certificates will be present for each node. You must approve all of these certificates. After the client CSR is approved, the Kubelet creates a secondary CSR for the serving certificate, which requires manual approval. Then, subsequent serving certificate renewal requests are automatically approved by the |
For clusters running on platforms that are not machine API enabled, such as bare metal and other user-provisioned infrastructure, you must implement a method of automatically approving the kubelet serving certificate requests (CSRs). If a request is not approved, then the |
To approve them individually, run the following command for each valid CSR:
$ oc adm certificate approve <csr_name> (1)
1 | <csr_name> is the name of a CSR from the list of current CSRs. |
To approve all pending CSRs, run the following command:
$ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs --no-run-if-empty oc adm certificate approve
Some Operators might not become available until some CSRs are approved. |
Now that your client requests are approved, you must review the server requests for each machine that you added to the cluster:
$ oc get csr
NAME AGE REQUESTOR CONDITION
csr-bfd72 5m26s system:node:ip-10-0-50-126.us-east-2.compute.internal Pending
csr-c57lv 5m26s system:node:ip-10-0-95-157.us-east-2.compute.internal Pending
...
If the remaining CSRs are not approved, and are in the Pending
status, approve the CSRs for your cluster machines:
To approve them individually, run the following command for each valid CSR:
$ oc adm certificate approve <csr_name> (1)
1 | <csr_name> is the name of a CSR from the list of current CSRs. |
To approve all pending CSRs, run the following command:
$ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
After all client and server CSRs have been approved, the machines have the Ready
status. Verify this by running the following command:
$ oc get nodes
NAME STATUS ROLES AGE VERSION
master-0 Ready master 73m v1.22.1
master-1 Ready master 73m v1.22.1
master-2 Ready master 74m v1.22.1
worker-0 Ready worker 11m v1.22.1
worker-1 Ready worker 11m v1.22.1
It can take a few minutes after approval of the server CSRs for the machines to transition to the |
For more information on CSRs, see Certificate Signing Requests.
Understand and deploy machine health checks.
This process is not applicable for clusters with manually provisioned machines. You can use the advanced machine management and scaling capabilities only in clusters where the Machine API is operational. |
Machine health checks automatically repair unhealthy machines in a particular machine pool.
To monitor machine health, create a resource to define the configuration for a controller. Set a condition to check, such as staying in the NotReady
status for five minutes or displaying a permanent condition in the node-problem-detector, and a label for the set of machines to monitor.
You cannot apply a machine health check to a machine with the master role. |
The controller that observes a MachineHealthCheck
resource checks for the defined condition. If a machine fails the health check, the machine is automatically deleted and one is created to take its place. When a machine is deleted, you see a machine deleted
event.
To limit disruptive impact of the machine deletion, the controller drains and deletes only one node at a time. If there are more unhealthy machines than the maxUnhealthy
threshold allows for in the targeted pool of machines, remediation stops and therefore enables manual intervention.
Consider the timeouts carefully, accounting for workloads and requirements.
|
To stop the check, remove the resource.
There are limitations to consider before deploying a machine health check:
Only machines owned by a machine set are remediated by a machine health check.
Control plane machines are not currently supported and are not remediated if they are unhealthy.
If the node for a machine is removed from the cluster, a machine health check considers the machine to be unhealthy and remediates it immediately.
If the corresponding node for a machine does not join the cluster after the nodeStartupTimeout
, the machine is remediated.
A machine is remediated immediately if the Machine
resource phase is Failed
.
The MachineHealthCheck
resource for all cloud-based installation types, and other than bare metal, resembles the following YAML file:
apiVersion: machine.openshift.io/v1beta1
kind: MachineHealthCheck
metadata:
name: example (1)
namespace: openshift-machine-api
spec:
selector:
matchLabels:
machine.openshift.io/cluster-api-machine-role: <role> (2)
machine.openshift.io/cluster-api-machine-type: <role> (2)
machine.openshift.io/cluster-api-machineset: <cluster_name>-<label>-<zone> (3)
unhealthyConditions:
- type: "Ready"
timeout: "300s" (4)
status: "False"
- type: "Ready"
timeout: "300s" (4)
status: "Unknown"
maxUnhealthy: "40%" (5)
nodeStartupTimeout: "10m" (6)
1 | Specify the name of the machine health check to deploy. |
2 | Specify a label for the machine pool that you want to check. |
3 | Specify the machine set to track in <cluster_name>-<label>-<zone> format. For example, prod-node-us-east-1a . |
4 | Specify the timeout duration for a node condition. If a condition is met for the duration of the timeout, the machine will be remediated. Long timeouts can result in long periods of downtime for a workload on an unhealthy machine. |
5 | Specify the amount of machines allowed to be concurrently remediated in the targeted pool. This can be set as a percentage or an integer. If the number of unhealthy machines exceeds the limit set by maxUnhealthy , remediation is not performed. |
6 | Specify the timeout duration that a machine health check must wait for a node to join the cluster before a machine is determined to be unhealthy. |
The |
Short circuiting ensures that machine health checks remediate machines only when the cluster is healthy.
Short-circuiting is configured through the maxUnhealthy
field in the MachineHealthCheck
resource.
If the user defines a value for the maxUnhealthy
field,
before remediating any machines, the MachineHealthCheck
compares the value of maxUnhealthy
with the number of machines within its target pool that it has determined to be unhealthy.
Remediation is not performed if the number of unhealthy machines exceeds the maxUnhealthy
limit.
If |
The appropriate maxUnhealthy
value depends on the scale of the cluster you deploy and how many machines the MachineHealthCheck
covers. For example, you can use the maxUnhealthy
value to cover multiple machine sets across multiple availability zones so that if you lose an entire zone, your maxUnhealthy
setting prevents further remediation within the cluster.
The maxUnhealthy
field can be set as either an integer or percentage.
There are different remediation implementations depending on the maxUnhealthy
value.
maxUnhealthy
by using an absolute valueIf maxUnhealthy
is set to 2
:
Remediation will be performed if 2 or fewer nodes are unhealthy
Remediation will not be performed if 3 or more nodes are unhealthy
These values are independent of how many machines are being checked by the machine health check.
maxUnhealthy
by using percentagesIf maxUnhealthy
is set to 40%
and there are 25 machines being checked:
Remediation will be performed if 10 or fewer nodes are unhealthy
Remediation will not be performed if 11 or more nodes are unhealthy
If maxUnhealthy
is set to 40%
and there are 6 machines being checked:
Remediation will be performed if 2 or fewer nodes are unhealthy
Remediation will not be performed if 3 or more nodes are unhealthy
The allowed number of machines is rounded down when the percentage of |
You can create a MachineHealthCheck
resource for all MachineSets
in your cluster.
You should not create a MachineHealthCheck
resource that targets control plane machines.
Install the oc
command line interface.
Create a healthcheck.yml
file that contains the definition of your machine health check.
Apply the healthcheck.yml
file to your cluster:
$ oc apply -f healthcheck.yml
To add or remove an instance of a machine in a machine set, you can manually scale the machine set.
This guidance is relevant to fully automated, installer-provisioned infrastructure installations. Customized, user-provisioned infrastructure installations do not have machine sets.
Install an OpenShift Container Platform cluster and the oc
command line.
Log in to oc
as a user with cluster-admin
permission.
View the machine sets that are in the cluster:
$ oc get machinesets -n openshift-machine-api
The machine sets are listed in the form of <clusterid>-worker-<aws-region-az>
.
View the machines that are in the cluster:
$ oc get machine -n openshift-machine-api
Set the annotation on the machine that you want to delete:
$ oc annotate machine/<machine_name> -n openshift-machine-api machine.openshift.io/cluster-api-delete-machine="true"
Cordon and drain the node that you want to delete:
$ oc adm cordon <node_name>
$ oc adm drain <node_name>
Scale the machine set:
$ oc scale --replicas=2 machineset <machineset> -n openshift-machine-api
Or:
$ oc edit machineset <machineset> -n openshift-machine-api
You can alternatively apply the following YAML to scale the machine set:
|
You can scale the machine set up or down. It takes several minutes for the new machines to be available.
Verify the deletion of the intended machine:
$ oc get machines
MachineSet
objects describe OpenShift Container Platform nodes with respect to the cloud or machine provider.
The MachineConfigPool
object allows MachineConfigController
components to define and provide the status of machines in the context of upgrades.
The MachineConfigPool
object allows users to configure how upgrades are rolled out to the OpenShift Container Platform nodes in the machine config pool.
The NodeSelector
object can be replaced with a reference to the MachineSet
object.
The OpenShift Container Platform node configuration file contains important options. For
example, two parameters control the maximum number of pods that can be scheduled
to a node: podsPerCore
and maxPods
.
When both options are in use, the lower of the two values limits the number of pods on a node. Exceeding these values can result in:
Increased CPU utilization.
Slow pod scheduling.
Potential out-of-memory scenarios, depending on the amount of memory in the node.
Exhausting the pool of IP addresses.
Resource overcommitting, leading to poor user application performance.
In Kubernetes, a pod that is holding a single container actually uses two containers. The second container is used to set up networking prior to the actual container starting. Therefore, a system running 10 pods will actually have 20 containers running. |
Disk IOPS throttling from the cloud provider might have an impact on CRI-O and kubelet. They might get overloaded when there are large number of I/O intensive pods running on the nodes. It is recommended that you monitor the disk I/O on the nodes and use volumes with sufficient throughput for the workload. |
podsPerCore
sets the number of pods the node can run based on the number of
processor cores on the node. For example, if podsPerCore
is set to 10
on a
node with 4 processor cores, the maximum number of pods allowed on the node will
be 40
.
kubeletConfig:
podsPerCore: 10
Setting podsPerCore
to 0
disables this limit. The default is 0
.
podsPerCore
cannot exceed maxPods
.
maxPods
sets the number of pods the node can run to a fixed value, regardless
of the properties of the node.
kubeletConfig:
maxPods: 250
The kubelet configuration is currently serialized as an Ignition configuration, so it can be directly edited. However, there is also a new kubelet-config-controller
added to the Machine Config Controller (MCC). This lets you use a KubeletConfig
custom resource (CR) to edit the kubelet parameters.
As the fields in the |
Consider the following guidance:
Create one KubeletConfig
CR for each machine config pool with all the config changes you want for that pool. If you are applying the same content to all of the pools, you need only one KubeletConfig
CR for all of the pools.
Edit an existing KubeletConfig
CR to modify existing settings or add new settings, instead of creating a CR for each change. It is recommended that you create a CR only to modify a different machine config pool, or for changes that are intended to be temporary, so that you can revert the changes.
As needed, create multiple KubeletConfig
CRs with a limit of 10 per cluster. For the first KubeletConfig
CR, the Machine Config Operator (MCO) creates a machine config appended with kubelet
. With each subsequent CR, the controller creates another kubelet
machine config with a numeric suffix. For example, if you have a kubelet
machine config with a -2
suffix, the next kubelet
machine config is appended with -3
.
If you want to delete the machine configs, delete them in reverse order to avoid exceeding the limit. For example, you delete the kubelet-3
machine config before deleting the kubelet-2
machine config.
If you have a machine config with a |
KubeletConfig
CR$ oc get kubeletconfig
NAME AGE
set-max-pods 15m
KubeletConfig
machine config$ oc get mc | grep kubelet
...
99-worker-generated-kubelet-1 b5c5119de007945b6fe6fb215db3b8e2ceb12511 3.2.0 26m
...
The following procedure is an example to show how to configure the maximum number of pods per node on the worker nodes.
Obtain the label associated with the static MachineConfigPool
CR for the type of node you want to configure.
Perform one of the following steps:
View the machine config pool:
$ oc describe machineconfigpool <name>
For example:
$ oc describe machineconfigpool worker
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
creationTimestamp: 2019-02-08T14:52:39Z
generation: 1
labels:
custom-kubelet: set-max-pods (1)
1 | If a label has been added it appears under labels . |
If the label is not present, add a key/value pair:
$ oc label machineconfigpool worker custom-kubelet=set-max-pods
View the available machine configuration objects that you can select:
$ oc get machineconfig
By default, the two kubelet-related configs are 01-master-kubelet
and 01-worker-kubelet
.
Check the current value for the maximum pods per node:
$ oc describe node <node_name>
For example:
$ oc describe node ci-ln-5grqprb-f76d1-ncnqq-worker-a-mdv94
Look for value: pods: <value>
in the Allocatable
stanza:
Allocatable:
attachable-volumes-aws-ebs: 25
cpu: 3500m
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 15341844Ki
pods: 250
Set the maximum pods per node on the worker nodes by creating a custom resource file that contains the kubelet configuration:
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: set-max-pods
spec:
machineConfigPoolSelector:
matchLabels:
custom-kubelet: set-max-pods (1)
kubeletConfig:
maxPods: 500 (2)
1 | Enter the label from the machine config pool. |
2 | Add the kubelet configuration. In this example, use maxPods to set the maximum pods per node. |
The rate at which the kubelet talks to the API server depends on queries per second (QPS) and burst values. The default values,
|
Update the machine config pool for workers with the label:
$ oc label machineconfigpool worker custom-kubelet=large-pods
Create the KubeletConfig
object:
$ oc create -f change-maxPods-cr.yaml
Verify that the KubeletConfig
object is created:
$ oc get kubeletconfig
NAME AGE
set-max-pods 15m
Depending on the number of worker nodes in the cluster, wait for the worker nodes to be rebooted one by one. For a cluster with 3 worker nodes, this could take about 10 to 15 minutes.
Verify that the changes are applied to the node:
Check on a worker node that the maxPods
value changed:
$ oc describe node <node_name>
Locate the Allocatable
stanza:
...
Allocatable:
attachable-volumes-gce-pd: 127
cpu: 3500m
ephemeral-storage: 123201474766
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 14225400Ki
pods: 500 (1)
...
1 | In this example, the pods parameter should report the value you set in the KubeletConfig object. |
Verify the change in the KubeletConfig
object:
$ oc get kubeletconfigs set-max-pods -o yaml
This should show a status of True
and type:Success
, as shown in the following example:
spec:
kubeletConfig:
maxPods: 500
machineConfigPoolSelector:
matchLabels:
custom-kubelet: set-max-pods
status:
conditions:
- lastTransitionTime: "2021-06-30T17:04:07Z"
message: Success
status: "True"
type: Success
By default, only one machine is allowed to be unavailable when applying the kubelet-related configuration to the available worker nodes. For a large cluster, it can take a long time for the configuration change to be reflected. At any time, you can adjust the number of machines that are updating to speed up the process.
Edit the worker
machine config pool:
$ oc edit machineconfigpool worker
Set maxUnavailable
to the value that you want:
spec:
maxUnavailable: <node_count>
When setting the value, consider the number of worker nodes that can be unavailable without affecting the applications running on the cluster. |
The control plane node resource requirements depend on the number of nodes in the cluster. The following control plane node size recommendations are based on the results of control plane density focused testing. The control plane tests create the following objects across the cluster in each of the namespaces depending on the node counts:
12 image streams
3 build configurations
6 builds
1 deployment with 2 pod replicas mounting two secrets each
2 deployments with 1 pod replica mounting two secrets
3 services pointing to the previous deployments
3 routes pointing to the previous deployments
10 secrets, 2 of which are mounted by the previous deployments
10 config maps, 2 of which are mounted by the previous deployments
Number of worker nodes | Cluster load (namespaces) | CPU cores | Memory (GB) |
---|---|---|---|
25 |
500 |
4 |
16 |
100 |
1000 |
8 |
32 |
250 |
4000 |
16 |
96 |
On a large and dense cluster with three masters or control plane nodes, the CPU and memory usage will spike up when one of the nodes is stopped, rebooted or fails. The failures can be due to unexpected issues with power, network or underlying infrastructure in addition to intentional cases where the cluster is restarted after shutting it down to save costs. The remaining two control plane nodes must handle the load in order to be highly available which leads to increase in the resource usage. This is also expected during upgrades because the masters are cordoned, drained, and rebooted serially to apply the operating system updates, as well as the control plane Operators update. To avoid cascading failures, keep the overall CPU and memory resource usage on the control plane nodes to at most 60% of all available capacity to handle the resource usage spikes. Increase the CPU and memory on the control plane nodes accordingly to avoid potential downtime due to lack of resources.
The node sizing varies depending on the number of nodes and object counts in the cluster. It also depends on whether the objects are actively being created on the cluster. During object creation, the control plane is more active in terms of resource usage compared to when the objects are in the |
Operator Lifecycle Manager (OLM ) runs on the control plane nodes and it’s memory footprint depends on the number of namespaces and user installed operators that OLM needs to manage on the cluster. Control plane nodes need to be sized accordingly to avoid OOM kills. Following data points are based on the results from cluster maximums testing.
Number of namespaces | OLM memory at idle state (GB) | OLM memory with 5 user operators installed (GB) |
---|---|---|
500 |
0.823 |
1.7 |
1000 |
1.2 |
2.5 |
1500 |
1.7 |
3.2 |
2000 |
2 |
4.4 |
3000 |
2.7 |
5.6 |
4000 |
3.8 |
7.6 |
5000 |
4.2 |
9.02 |
6000 |
5.8 |
11.3 |
7000 |
6.6 |
12.9 |
8000 |
6.9 |
14.8 |
9000 |
8 |
17.7 |
10,000 |
9.9 |
21.6 |
If you used an installer-provisioned infrastructure installation method, you cannot modify the control plane node size in a running OpenShift Container Platform 4.9 cluster. Instead, you must estimate your total node count and use the suggested control plane node size during installation. |
The recommendations are based on the data points captured on OpenShift Container Platform clusters with OpenShift SDN as the network plugin. |
In OpenShift Container Platform 4.9, half of a CPU core (500 millicore) is now reserved by the system by default compared to OpenShift Container Platform 3.11 and previous versions. The sizes are determined taking that into consideration. |
Optional: Label a node:
# oc label node perf-node.example.com cpumanager=true
Edit the MachineConfigPool
of the nodes where CPU Manager should be enabled. In this example, all workers have CPU Manager enabled:
# oc edit machineconfigpool worker
Add a label to the worker machine config pool:
metadata:
creationTimestamp: 2020-xx-xxx
generation: 3
labels:
custom-kubelet: cpumanager-enabled
Create a KubeletConfig
, cpumanager-kubeletconfig.yaml
, custom resource (CR). Refer to the label created in the previous step to have the correct nodes updated with the new kubelet config. See the machineConfigPoolSelector
section:
apiVersion: machineconfiguration.openshift.io/v1
kind: KubeletConfig
metadata:
name: cpumanager-enabled
spec:
machineConfigPoolSelector:
matchLabels:
custom-kubelet: cpumanager-enabled
kubeletConfig:
cpuManagerPolicy: static (1)
cpuManagerReconcilePeriod: 5s (2)
1 | Specify a policy:
|
2 | Optional. Specify the CPU Manager reconcile frequency. The default is 5s . |
Create the dynamic kubelet config:
# oc create -f cpumanager-kubeletconfig.yaml
This adds the CPU Manager feature to the kubelet config and, if needed, the Machine Config Operator (MCO) reboots the node. To enable CPU Manager, a reboot is not needed.
Check for the merged kubelet config:
# oc get machineconfig 99-worker-XXXXXX-XXXXX-XXXX-XXXXX-kubelet -o json | grep ownerReference -A7
"ownerReferences": [
{
"apiVersion": "machineconfiguration.openshift.io/v1",
"kind": "KubeletConfig",
"name": "cpumanager-enabled",
"uid": "7ed5616d-6b72-11e9-aae1-021e1ce18878"
}
]
Check the worker for the updated kubelet.conf
:
# oc debug node/perf-node.example.com
sh-4.2# cat /host/etc/kubernetes/kubelet.conf | grep cpuManager
cpuManagerPolicy: static (1)
cpuManagerReconcilePeriod: 5s (1)
1 | These settings were defined when you created the KubeletConfig CR. |
Create a pod that requests a core or multiple cores. Both limits and requests must have their CPU value set to a whole integer. That is the number of cores that will be dedicated to this pod:
# cat cpumanager-pod.yaml
apiVersion: v1
kind: Pod
metadata:
generateName: cpumanager-
spec:
containers:
- name: cpumanager
image: gcr.io/google_containers/pause-amd64:3.0
resources:
requests:
cpu: 1
memory: "1G"
limits:
cpu: 1
memory: "1G"
nodeSelector:
cpumanager: "true"
Create the pod:
# oc create -f cpumanager-pod.yaml
Verify that the pod is scheduled to the node that you labeled:
# oc describe pod cpumanager
Name: cpumanager-6cqz7
Namespace: default
Priority: 0
PriorityClassName: <none>
Node: perf-node.example.com/xxx.xx.xx.xxx
...
Limits:
cpu: 1
memory: 1G
Requests:
cpu: 1
memory: 1G
...
QoS Class: Guaranteed
Node-Selectors: cpumanager=true
Verify that the cgroups
are set up correctly. Get the process ID (PID) of the pause
process:
# ├─init.scope
│ └─1 /usr/lib/systemd/systemd --switched-root --system --deserialize 17
└─kubepods.slice
├─kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice
│ ├─crio-b5437308f1a574c542bdf08563b865c0345c8f8c0b0a655612c.scope
│ └─32706 /pause
Pods of quality of service (QoS) tier Guaranteed
are placed within the kubepods.slice
. Pods of other QoS tiers end up in child cgroups
of kubepods
:
# cd /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-pod69c01f8e_6b74_11e9_ac0f_0a2b62178a22.slice/crio-b5437308f1ad1a7db0574c542bdf08563b865c0345c86e9585f8c0b0a655612c.scope
# for i in `ls cpuset.cpus tasks` ; do echo -n "$i "; cat $i ; done
cpuset.cpus 1
tasks 32706
Check the allowed CPU list for the task:
# grep ^Cpus_allowed_list /proc/32706/status
Cpus_allowed_list: 1
Verify that another pod (in this case, the pod in the burstable
QoS tier) on the system cannot run on the core allocated for the Guaranteed
pod:
# cat /sys/fs/cgroup/cpuset/kubepods.slice/kubepods-besteffort.slice/kubepods-besteffort-podc494a073_6b77_11e9_98c0_06bba5c387ea.slice/crio-c56982f57b75a2420947f0afc6cafe7534c5734efc34157525fa9abbf99e3849.scope/cpuset.cpus
0
# oc describe node perf-node.example.com
...
Capacity:
attachable-volumes-aws-ebs: 39
cpu: 2
ephemeral-storage: 124768236Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 8162900Ki
pods: 250
Allocatable:
attachable-volumes-aws-ebs: 39
cpu: 1500m
ephemeral-storage: 124768236Ki
hugepages-1Gi: 0
hugepages-2Mi: 0
memory: 7548500Ki
pods: 250
------- ---- ------------ ---------- --------------- ------------- ---
default cpumanager-6cqz7 1 (66%) 1 (66%) 1G (12%) 1G (12%) 29m
Allocated resources:
(Total limits may be over 100 percent, i.e., overcommitted.)
Resource Requests Limits
-------- -------- ------
cpu 1440m (96%) 1 (66%)
This VM has two CPU cores. The system-reserved
setting reserves 500 millicores, meaning that half of one core is subtracted from the total capacity of the node to arrive at the Node Allocatable
amount. You can see that Allocatable CPU
is 1500 millicores. This means you can run one of the CPU Manager pods since each will take one whole core. A whole core is equivalent to 1000 millicores. If you try to schedule a second pod, the system will accept the pod, but it will never be scheduled:
NAME READY STATUS RESTARTS AGE
cpumanager-6cqz7 1/1 Running 0 33m
cpumanager-7qc2t 0/1 Pending 0 11s
Understand and configure huge pages.
Memory is managed in blocks known as pages. On most systems, a page is 4Ki. 1Mi of memory is equal to 256 pages; 1Gi of memory is 256,000 pages, and so on. CPUs have a built-in memory management unit that manages a list of these pages in hardware. The Translation Lookaside Buffer (TLB) is a small hardware cache of virtual-to-physical page mappings. If the virtual address passed in a hardware instruction can be found in the TLB, the mapping can be determined quickly. If not, a TLB miss occurs, and the system falls back to slower, software-based address translation, resulting in performance issues. Since the size of the TLB is fixed, the only way to reduce the chance of a TLB miss is to increase the page size.
A huge page is a memory page that is larger than 4Ki. On x86_64 architectures, there are two common huge page sizes: 2Mi and 1Gi. Sizes vary on other architectures. To use huge pages, code must be written so that applications are aware of them. Transparent Huge Pages (THP) attempt to automate the management of huge pages without application knowledge, but they have limitations. In particular, they are limited to 2Mi page sizes. THP can lead to performance degradation on nodes with high memory utilization or fragmentation due to defragmenting efforts of THP, which can lock memory pages. For this reason, some applications may be designed to (or recommend) usage of pre-allocated huge pages instead of THP.
Nodes must pre-allocate huge pages in order for the node to report its huge page capacity. A node can only pre-allocate huge pages for a single size.
Huge pages can be consumed through container-level resource requirements using the
resource name hugepages-<size>
, where size is the most compact binary
notation using integer values supported on a particular node. For example, if a
node supports 2048KiB page sizes, it exposes a schedulable resource
hugepages-2Mi
. Unlike CPU or memory, huge pages do not support over-commitment.
apiVersion: v1
kind: Pod
metadata:
generateName: hugepages-volume-
spec:
containers:
- securityContext:
privileged: true
image: rhel7:latest
command:
- sleep
- inf
name: example
volumeMounts:
- mountPath: /dev/hugepages
name: hugepage
resources:
limits:
hugepages-2Mi: 100Mi (1)
memory: "1Gi"
cpu: "1"
volumes:
- name: hugepage
emptyDir:
medium: HugePages
1 | Specify the amount of memory for hugepages as the exact amount to be
allocated. Do not specify this value as the amount of memory for hugepages
multiplied by the size of the page. For example, given a huge page size of 2MB,
if you want to use 100MB of huge-page-backed RAM for your application, then you
would allocate 50 huge pages. OpenShift Container Platform handles the math for you. As in
the above example, you can specify 100MB directly. |
Allocating huge pages of a specific size
Some platforms support multiple huge page sizes. To allocate huge pages of a
specific size, precede the huge pages boot command parameters with a huge page
size selection parameter hugepagesz=<size>
. The <size>
value must be
specified in bytes with an optional scale suffix [kKmMgG
]. The default huge
page size can be defined with the default_hugepagesz=<size>
boot parameter.
Huge page requirements
Huge page requests must equal the limits. This is the default if limits are specified, but requests are not.
Huge pages are isolated at a pod scope. Container isolation is planned in a future iteration.
EmptyDir
volumes backed by huge pages must not consume more huge page memory
than the pod request.
Applications that consume huge pages via shmget()
with SHM_HUGETLB
must run
with a supplemental group that matches proc/sys/vm/hugetlb_shm_group.
Nodes must pre-allocate huge pages used in an OpenShift Container Platform cluster. There are two ways of reserving huge pages: at boot time and at run time. Reserving at boot time increases the possibility of success because the memory has not yet been significantly fragmented. The Node Tuning Operator currently supports boot time allocation of huge pages on specific nodes.
To minimize node reboots, the order of the steps below needs to be followed:
Label all nodes that need the same huge pages setting by a label.
$ oc label node <node_using_hugepages> node-role.kubernetes.io/worker-hp=
Create a file with the following content and name it hugepages-tuned-boottime.yaml
:
apiVersion: tuned.openshift.io/v1
kind: Tuned
metadata:
name: hugepages (1)
namespace: openshift-cluster-node-tuning-operator
spec:
profile: (2)
- data: |
[main]
summary=Boot time configuration for hugepages
include=openshift-node
[bootloader]
cmdline_openshift_node_hugepages=hugepagesz=2M hugepages=50 (3)
name: openshift-node-hugepages
recommend:
- machineConfigLabels: (4)
machineconfiguration.openshift.io/role: "worker-hp"
priority: 30
profile: openshift-node-hugepages
1 | Set the name of the Tuned resource to hugepages . |
2 | Set the profile section to allocate huge pages. |
3 | Note the order of parameters is important as some platforms support huge pages of various sizes. |
4 | Enable machine config pool based matching. |
Create the Tuned hugepages
object
$ oc create -f hugepages-tuned-boottime.yaml
Create a file with the following content and name it hugepages-mcp.yaml
:
apiVersion: machineconfiguration.openshift.io/v1
kind: MachineConfigPool
metadata:
name: worker-hp
labels:
worker-hp: ""
spec:
machineConfigSelector:
matchExpressions:
- {key: machineconfiguration.openshift.io/role, operator: In, values: [worker,