×

Review and action cluster notifications

Cluster notifications are messages about the status, health, or performance of your cluster.

Cluster notifications are the primary way that Red Hat Site Reliability Engineering (SRE) communicates with you about the health of your managed cluster. SRE may also use cluster notifications to prompt you to perform an action in order to resolve or prevent an issue with your cluster.

Cluster owners and administrators must regularly review and action cluster notifications to ensure clusters remain healthy and supported.

You can view cluster notifications in the Red Hat Hybrid Cloud Console, in the Cluster history tab for your cluster. By default, only the cluster owner receives cluster notifications as emails. If other users need to receive cluster notification emails, add each user as a notification contact for your cluster.

Cluster notification policy

Cluster notifications are designed to keep you informed about the health of your cluster and high impact events that affect it.

Most cluster notifications are generated and sent automatically to ensure that you are immediately informed of problems or important changes to the state of your cluster.

In certain situations, Red Hat Site Reliability Engineering (SRE) creates and sends cluster notifications to provide additional context and guidance for a complex issue.

Cluster notifications are not sent for low-impact events, low-risk security updates, routine operations and maintenance, or minor, transient issues that are quickly resolved by SRE.

Red Hat services automatically send notifications when:

  • Remote health monitoring or environment verification checks detect an issue in your cluster, for example, when a worker node has low disk space.

  • Significant cluster life cycle events occur, for example, when scheduled maintenance or upgrades begin, or cluster operations are impacted by an event, but do not require customer intervention.

  • Significant cluster management changes occur, for example, when cluster ownership or administrative control is transferred from one user to another.

  • Your cluster subscription is changed or updated, for example, when Red Hat makes updates to subscription terms or features available to your cluster.

SRE creates and sends notifications when:

  • An incident results in a degradation or outage that impacts your cluster’s availability or performance, for example, your cloud provider has a regional outage. SRE sends subsequent notifications to inform you of incident resolution progress, and when the incident is resolved.

  • A security vulnerability, security breach, or unusual activity is detected on your cluster.

  • Red Hat detects that changes you have made are creating or may result in cluster instability.

  • Red Hat detects that your workloads are causing performance degradation or instability in your cluster.

Incident and operations management

This documentation details the Red Hat responsibilities for the OpenShift Dedicated managed service. The cloud provider is responsible for protecting the hardware infrastructure that runs the services offered by the cloud provider. The customer is responsible for incident and operations management of customer application data and any custom networking the customer has configured for the cluster network or virtual network.

Platform monitoring

A Red Hat Site Reliability Engineer (SRE) maintains a centralized monitoring and alerting system for all OpenShift Dedicated cluster components, SRE services, and underlying cloud provider accounts. Platform audit logs are securely forwarded to a centralized SIEM (Security Information and Event Monitoring) system, where they might trigger configured alerts to the SRE team and are also subject to manual review. Audit logs are retained in the SIEM for one year. Audit logs for a given cluster are not deleted at the time the cluster is deleted.

Incident management

An incident is an event that results in a degradation or outage of one or more Red Hat services. An incident can be raised by a customer or Customer Experience and Engagement (CEE) member through a support case, directly by the centralized monitoring and alerting system, or directly by a member of the SRE team.

Depending on the impact on the service and customer, the incident is categorized in terms of severity.

The general workflow of how a new incident is managed by Red Hat:

  1. An SRE first responder is alerted to a new incident, and begins an initial investigation.

  2. After the initial investigation, the incident is assigned an incident lead, who coordinates the recovery efforts.

  3. The incident lead manages all communication and coordination around recovery, including any relevant notifications or support case updates.

  4. The incident is recovered.

  5. The incident is documented and a root cause analysis is performed within 5 business days of the incident.

  6. A root cause analysis (RCA) draft document is shared with the customer within 7 business days of the incident.

Backup and recovery

All OpenShift Dedicated clusters are backed up using cloud provider snapshots. Notably, this does not include customer data stored on persistent volumes (PVs). All snapshots are taken using the appropriate cloud provider snapshot APIs and are uploaded to a secure object storage bucket (S3 in AWS, and GCS in Google Cloud) in the same account as the cluster.

Component Snapshot frequency Retention Notes

Full object store backup

Daily

7 days

This is a full backup of all Kubernetes objects like etcd. No PVs are backed up in this backup schedule.

Weekly

30 days

Full object store backup

Hourly

24 hour

This is a full backup of all Kubernetes objects like etcd. No PVs are backed up in this backup schedule.

Node root volume

Never

N/A

Nodes are considered to be short-term. Nothing critical should be stored on a node’s root volume.

  • Red Hat does not commit to any Recovery Point Objective (RPO) or Recovery Time Objective (RTO).

  • Customers are responsible for taking regular backups of their data

  • Customers should deploy multi-AZ clusters with workloads that follow Kubernetes best practices to ensure high availability within a region.

  • If an entire cloud region is unavailable, customers must install a new cluster in a different region and restore their apps using their backup data.

Cluster capacity

Evaluating and managing cluster capacity is a responsibility that is shared between Red Hat and the customer. Red Hat SRE is responsible for the capacity of all control plane and infrastructure nodes on the cluster.

Red Hat SRE also evaluates cluster capacity during upgrades and in response to cluster alerts. The impact of a cluster upgrade on capacity is evaluated as part of the upgrade testing process to ensure that capacity is not negatively impacted by new additions to the cluster. During a cluster upgrade, additional worker nodes are added to make sure that total cluster capacity is maintained during the upgrade process.

Capacity evaluations by SRE staff also happen in response to alerts from the cluster, once usage thresholds are exceeded for a certain period of time. Such alerts can also result in a notification to the customer.

Change management

This section describes the policies about how cluster and configuration changes, patches, and releases are managed.

Customer-initiated changes

You can initiate changes using self-service capabilities such as cluster deployment, worker node scaling, or cluster deletion.

Change history is captured in the Cluster History section in the OpenShift Cluster Manager Overview tab, and is available for you to view. The change history includes, but is not limited to, logs from the following changes:

  • Adding or removing identity providers

  • Adding or removing users to or from the dedicated-admins group

  • Scaling the cluster compute nodes

  • Scaling the cluster load balancer

  • Scaling the cluster persistent storage

  • Upgrading the cluster

You can implement a maintenance exclusion by avoiding changes in OpenShift Cluster Manager for the following components:

  • Deleting a cluster

  • Adding, modifying, or removing identity providers

  • Adding, modifying, or removing a user from an elevated group

  • Installing or removing add-ons

  • Modifying cluster networking configurations

  • Adding, modifying, or removing machine pools

  • Enabling or disabling user workload monitoring

  • Initiating an upgrade

To enforce the maintenance exclusion, ensure machine pool autoscaling or automatic upgrade policies have been disabled. After the maintenance exclusion has been lifted, proceed with enabling machine pool autoscaling or automatic upgrade policies as desired.

Red Hat-initiated changes

Red Hat site reliability engineering (SRE) manages the infrastructure, code, and configuration of OpenShift Dedicated using a GitOps workflow and fully automated CI/CD pipelines. This process ensures that Red Hat can safely introduce service improvements on a continuous basis without negatively impacting customers.

Every proposed change undergoes a series of automated verifications immediately upon check-in. Changes are then deployed to a staging environment where they undergo automated integration testing. Finally, changes are deployed to the production environment. Each step is fully automated.

An authorized SRE reviewer must approve advancement to each step. The reviewer cannot be the same individual who proposed the change. All changes and approvals are fully auditable as part of the GitOps workflow.

Some changes are released to production incrementally, using feature flags to control availability of new features to specified clusters or customers.

Patch management

OpenShift Container Platform software and the underlying immutable Red Hat Enterprise Linux CoreOS (RHCOS) operating system image are patched for bugs and vulnerabilities in regular z-stream upgrades. Read more about RHCOS architecture in the OpenShift Container Platform documentation.

Release management

Red Hat does not automatically upgrade your clusters. You can schedule to upgrade the clusters at regular intervals (recurring upgrade) or just once (individual upgrade) using the OpenShift Cluster Manager web console. Red Hat might forcefully upgrade a cluster to a new z-stream version only if the cluster is affected by a critical impact CVE. You can review the history of all cluster upgrade events in the OpenShift Cluster Manager web console. For more information about releases, see the Life Cycle policy.

Security and regulation compliance

Security and regulation compliance includes tasks, such as the implementation of security controls and compliance certification.

Data classification

Red Hat defines and follows a data classification standard to determine the sensitivity of data and highlight inherent risk to the confidentiality and integrity of that data while it is collected, used, transmitted stored, and processed. Customer-owned data is classified at the highest level of sensitivity and handling requirements.

Data management

OpenShift Dedicated uses cloud provider services such as AWS Key Management Service (KMS) and Google Cloud KMS to help securely manage encryption keys for persistent data. These keys are used for encrypting all control plane, infrastructure, and worker node root volumes. Customers can specify their own KMS key for encrypting root volumes at installation time. Persistent volumes (PVs) also use KMS for key management. Customers can specify their own KMS key for encrypting PVs by creating a new StorageClass referencing the KMS key Amazon Resource Name (ARN) or ID.

When a customer deletes their OpenShift Dedicated cluster, all cluster data is permanently deleted, including control plane data volumes and customer application data volumes, such a persistent volumes (PV).

Vulnerability management

Red Hat performs periodic vulnerability scanning of OpenShift Dedicated using industry standard tools. Identified vulnerabilities are tracked to their remediation according to timelines based on severity. Vulnerability scanning and remediation activities are documented for verification by third-party assessors in the course of compliance certification audits.

Network security

Firewall and DDoS protection

Each OpenShift Dedicated cluster is protected by a secure network configuration at the cloud infrastructure level using firewall rules (AWS Security Groups or Google Cloud Compute Engine firewall rules). OpenShift Dedicated customers on AWS are also protected against DDoS attacks with AWS Shield Standard. Similarly, all GCP load balancers and public IP addresses used by OpenShift Dedicated on GCP are protected against DDoS attacks with Google Cloud Armor Standard.

Private clusters and network connectivity

Customers can optionally configure their OpenShift Dedicated cluster endpoints (web console, API, and application router) to be made private so that the cluster control plane or applications are not accessible from the Internet.

For AWS, customers can configure a private network connection to their OpenShift Dedicated cluster through AWS VPC peering, AWS VPN, or AWS Direct Connect.

At this time, private clusters are not supported for OpenShift Dedicated clusters on Google Cloud.

Cluster network access controls

Fine-grained network access control rules can be configured by customers per project by using NetworkPolicy objects and the OpenShift SDN.

Penetration testing

Red Hat performs periodic penetration tests against OpenShift Dedicated. Tests are performed by an independent internal team using industry standard tools and best practices.

Any issues that are discovered are prioritized based on severity. Any issues found belonging to open source projects are shared with the community for resolution.

Compliance

OpenShift Dedicated follows common industry best practices for security and controls. The certifications are outlined in the following table.

Table 1. Security and control certifications for OpenShift Dedicated
Compliance OpenShift Dedicated on AWS OpenShift Dedicated on GCP

HIPAA Qualified

Yes (Only Customer Cloud Subscriptions)

Yes (Only Customer Cloud Subscriptions)

ISO 27001

Yes

Yes

PCI DSS

Yes

Yes

SOC 2 Type 2

Yes

Yes

Additional resources

Disaster recovery

OpenShift Dedicated provides disaster recovery for failures that occur at the pod, worker node, infrastructure node, control plane node, and availability zone levels.

All disaster recovery requires that the customer use best practices for deploying highly available applications, storage, and cluster architecture (for example, single-zone deployment vs. multi-zone deployment) to account for the level of desired availability.

One single-zone cluster will not provide disaster avoidance or recovery in the event of an availability zone or region outage. Multiple single-zone clusters with customer-maintained failover can account for outages at the zone or region levels.

One multi-zone cluster will not provide disaster avoidance or recovery in the event of a full region outage. Multiple multi-zone clusters with customer-maintained failover can account for outages at the region level.

Additional resources