This document details the Red Hat responsibilities for the managed Red Hat OpenShift Service on AWS (ROSA).
AWS - Amazon Web Services
CEE - Customer Experience and Engagement (Red Hat Support)
CI/CD - Continuous Integration / Continuous Delivery
CVE - Common Vulnerabilities and Exposures
PVs - Persistent Volumes
ROSA - Red Hat OpenShift Service on AWS
SRE - Red Hat Site Reliability Engineering
VPC - Virtual Private Cloud
This documentation details the Red Hat responsibilities for the Red Hat OpenShift Service on AWS (ROSA) managed service.
Red Hat site reliability engineers (SREs) maintain a centralized monitoring and alerting system for all ROSA cluster components, the SRE services, and underlying AWS accounts. Platform audit logs are securely forwarded to a centralized security information and event monitoring (SIEM) system, where they may trigger configured alerts to the SRE team and are also subject to manual review. Audit logs are retained in the SIEM system for one year. Audit logs for a given cluster are not deleted at the time the cluster is deleted.
An incident is an event that results in a degradation or outage of one or more Red Hat services. An incident can be raised by a customer or a Customer Experience and Engagement (CEE) member through a support case, directly by the centralized monitoring and alerting system, or directly by a member of the SRE team.
Depending on the impact on the service and customer, the incident is categorized in terms of severity.
When managing a new incident, Red Hat uses the following general workflow:
An SRE first responder is alerted to a new incident and begins an initial investigation.
After the initial investigation, the incident is assigned an incident lead, who coordinates the recovery efforts.
An incident lead manages all communication and coordination around recovery, including any relevant notifications and support case updates.
The incident is recovered.
The incident is documented and a root cause analysis (RCA) is performed within 5 business days of the incident.
An RCA draft document will be shared with the customer within 7 business days of the incident.
Platform notifications are configured using email. Some customer notifications are also sent to an account’s corresponding Red Hat account team, including a Technical Account Manager, if applicable.
The following activities can trigger notifications:
Platform incident
Performance degradation
Cluster capacity warnings
Critical vulnerabilities and resolution
Upgrade scheduling
Customers are responsible for taking regular backups of their data and should deploy multi-AZ clusters with workloads that follow Kubernetes best practices to ensure high availability within a region. If an entire cloud region is unavailable, customers must install a new cluster in a different region and restore their apps using their backup data.
There is no Red Hat-provided backup method available for ROSA clusters with STS. Red Hat does not commit to any Recovery Point Objective (RPO) or Recovery Time Objective (RTO).
Evaluating and managing cluster capacity is a responsibility that is shared between Red Hat and the customer. Red Hat SRE is responsible for the capacity of all control plane and infrastructure nodes on the cluster.
Red Hat SRE also evaluates cluster capacity during upgrades and in response to cluster alerts. The impact of a cluster upgrade on capacity is evaluated as part of the upgrade testing process to ensure that capacity is not negatively impacted by new additions to the cluster. During a cluster upgrade, additional worker nodes are added to make sure that total cluster capacity is maintained during the upgrade process.
Capacity evaluations by the Red Hat SRE staff also happen in response to alerts from the cluster, after usage thresholds are exceeded for a certain period of time. Such alerts can also result in a notification to the customer.
This section describes the policies about how cluster and configuration changes, patches, and releases are managed.
You can initiate changes using self-service capabilities such as cluster deployment, worker node scaling, or cluster deletion.
Change history is captured in the Cluster History section in the OpenShift Cluster Manager Overview tab, and is available for you to view. The change history includes, but is not limited to, logs from the following changes:
Adding or removing identity providers
Adding or removing users to or from the dedicated-admins
group
Scaling the cluster compute nodes
Scaling the cluster load balancer
Scaling the cluster persistent storage
Upgrading the cluster
You can implement a maintenance exclusion by avoiding changes in OpenShift Cluster Manager for the following components:
Deleting a cluster
Adding, modifying, or removing identity providers
Adding, modifying, or removing a user from an elevated group
Installing or removing add-ons
Modifying cluster networking configurations
Adding, modifying, or removing machine pools
Enabling or disabling user workload monitoring
Initiating an upgrade
To enforce the maintenance exclusion, ensure machine pool autoscaling or automatic upgrade policies have been disabled. After the maintenance exclusion has been lifted, proceed with enabling machine pool autoscaling or automatic upgrade policies as desired. |
Red Hat site reliability engineering (SRE) manages the infrastructure, code, and configuration of Red Hat OpenShift Service on AWS using a GitOps workflow and fully automated CI/CD pipelines. This process ensures that Red Hat can safely introduce service improvements on a continuous basis without negatively impacting customers.
Every proposed change undergoes a series of automated verifications immediately upon check-in. Changes are then deployed to a staging environment where they undergo automated integration testing. Finally, changes are deployed to the production environment. Each step is fully automated.
An authorized SRE reviewer must approve advancement to each step. The reviewer cannot be the same individual who proposed the change. All changes and approvals are fully auditable as part of the GitOps workflow.
Some changes are released to production incrementally, using feature flags to control availability of new features to specified clusters or customers.
OpenShift Container Platform software and the underlying immutable Red Hat CoreOS (RHCOS) operating system image are patched for bugs and vulnerabilities in regular z-stream upgrades. Read more about RHCOS architecture in the OpenShift Container Platform documentation.
Red Hat does not automatically upgrade your clusters. You can schedule to upgrade the clusters at regular intervals (recurring upgrade) or just once (individual upgrade) using the OpenShift Cluster Manager web console. Red Hat might forcefully upgrade a cluster to a new z-stream version only if the cluster is affected by a critical impact CVE.
Because the required permissions can change between y-stream releases, the policies might have to be updated before an upgrade can be performed. Therefore, you cannot schedule a recurring upgrade on ROSA clusters with STS. |
You can review the history of all cluster upgrade events in the OpenShift Cluster Manager web console. For more information about releases, see the Life Cycle policy.
Most access by Red Hat site reliability engineering (SRE) teams is done by using cluster Operators through automated configuration management.
For a list of the available subprocessors, see the Red Hat Subprocessor List on the Red Hat Customer Portal.
SREs access Red Hat OpenShift Service on AWS clusters through the web console or command-line tools. Authentication requires multi-factor authentication (MFA) with industry-standard requirements for password complexity and account lockouts. SREs must authenticate as individuals to ensure auditability. All authentication attempts are logged to a Security Information and Event Management (SIEM) system.
SREs access private clusters using an encrypted HTTP connection. Connections are permitted only from a secured Red Hat network using either an IP allowlist or a private cloud provider link.
SRE adheres to the principle of least privilege when accessing Red Hat OpenShift Service on AWS and AWS components. There are four basic categories of manual SRE access:
SRE admin access through the Red Hat Portal with normal two-factor authentication and no privileged elevation.
SRE admin access through the Red Hat corporate SSO with normal two-factor authentication and no privileged elevation.
OpenShift elevation, which is a manual elevation using Red Hat SSO. Access is limited to 2 hours, is fully audited, and requires management approval.
AWS access or elevation, which is a manual elevation for AWS console or CLI access. Access is limited to 60 minutes and is fully audited.
Each of these access types have different levels of access to components:
Component | Typical SRE admin access (Red Hat Portal) | Typical SRE admin access (Red Hat SSO) | OpenShift elevation | Cloud provider access or elevation |
---|---|---|---|---|
OpenShift Cluster Manager |
R/W |
No access |
No access |
No access |
OpenShift console |
No access |
R/W |
R/W |
No access |
Node operating system |
No access |
A specific list of elevated OS and network permissions. |
A specific list of elevated OS and network permissions. |
No access |
AWS Console |
No access |
No access, but this is the account used to request cloud provider access. |
No access |
All cloud provider permissions using the SRE identity. |
Red Hat personnel do not access AWS accounts in the course of routine Red Hat OpenShift Service on AWS operations. For emergency troubleshooting purposes, the SREs have well-defined and auditable procedures to access cloud infrastructure accounts.
SREs generate a short-lived AWS access token for a reserved role using the AWS Security Token Service (STS). Access to the STS token is audit-logged and traceable back to individual users. Both STS and non-STS clusters use the AWS STS service for SRE access. For non-STS clusters, the BYOCAdminAccess
role has the AdministratorAccess
IAM policy attached, and this role is used for administration. For STS clusters, the ManagedOpenShift-Support-Role
has the ManagedOpenShift-Support-Access
policy attached, and this role is used for administration.
Members of the Red Hat Customer Experience and Engagement (CEE) team typically have read-only access to parts of the cluster. Specifically, CEE has limited access to the core and product namespaces and does not have access to the customer namespaces.
Role | Core namespace | Layered product namespace | Customer namespace | AWS account* |
---|---|---|---|---|
OpenShift SRE |
Read: All Write: Very limited [1] |
Read: All Write: None |
Read: None[2] Write: None |
Read: All [3] Write: All [3] |
CEE |
Read: All Write: None |
Read: All Write: None |
Read: None[2] Write: None |
Read: None Write: None |
Customer administrator |
Read: None Write: None |
Read: None Write: None |
Read: All Write: All |
Read: All Write: All |
Customer user |
Read: None Write: None |
Read: None Write: None |
Read: Limited[4] Write: Limited[4] |
Read: None Write: None |
Everybody else |
Read: None Write: None |
Read: None Write: None |
Read: None Write: None |
Read: None Write: None |
Limited to addressing common use cases such as failing deployments, upgrading a cluster, and replacing bad worker nodes.
Red Hat associates have no access to customer data by default.
SRE access to the AWS account is an emergency procedure for exceptional troubleshooting during a documented incident.
Limited to what is granted through RBAC by the Customer Administrator, as well as namespaces created by the user.
Customer access is limited to namespaces created by the customer and permissions that are granted using RBAC by the Customer Administrator role. Access to the underlying infrastructure or product namespaces is generally not permitted without cluster-admin
access. More information on customer access and authentication can be found in the "Understanding Authentication" section of the documentation.
Security and regulation compliance includes tasks such as the implementation of security controls and compliance certification.
Red Hat defines and follows a data classification standard to determine the sensitivity of data and highlight inherent risk to the confidentiality and integrity of that data while it is collected, used, transmitted, stored, and processed. Customer-owned data is classified at the highest level of sensitivity and handling requirements.
Red Hat OpenShift Service on AWS (ROSA) uses AWS Key Management Service (KMS) to help securely manage keys for encrypted data. These keys are used for control plane, infrastructure, and worker data volumes that are encrypted by default. Persistent volumes (PVs) for customer applications also use AWS KMS for key management.
When a customer deletes their ROSA cluster, all cluster data is permanently deleted, including control plane data volumes and customer application data volumes, such as persistent volumes (PV).
Red Hat performs periodic vulnerability scanning of ROSA using industry standard tools. Identified vulnerabilities are tracked to their remediation according to timelines based on severity. Vulnerability scanning and remediation activities are documented for verification by third-party assessors in the course of compliance certification audits.
Each ROSA cluster is protected by a secure network configuration using firewall rules for AWS Security Groups. ROSA customers are also protected against DDoS attacks with AWS Shield Standard.
Customers can optionally configure their ROSA cluster endpoints, such as web console, API, and application router, to be made private so that the cluster control plane and applications are not accessible from the Internet. Red Hat SRE still requires Internet-accessible endpoints that are protected with IP allow-lists.
AWS customers can configure a private network connection to their ROSA cluster through technologies such as AWS VPC peering, AWS VPN, or AWS Direct Connect.
Red Hat performs periodic penetration tests against ROSA. Tests are performed by an independent internal team by using industry standard tools and best practices.
Any issues that may be discovered are prioritized based on severity. Any issues found belonging to open source projects are shared with the community for resolution.
Red Hat OpenShift Service on AWS follows common industry best practices for security and controls. The certifications are outlined in the following table.
Certification | Red Hat OpenShift Service on AWS |
---|---|
HIPAA |
Yes |
ISO 27001 |
Yes |
ISO 27017 |
Yes |
ISO 27018 |
Yes |
PCI DSS |
Yes |
SOC 2 Type 2 |
Yes |
See Red Hat Subprocessor List for information on SRE residency.
Red Hat OpenShift Service on AWS (ROSA) provides disaster recovery for failures that occur at the pod, worker node, infrastructure node, control plane node, and availability zone levels.
All disaster recovery requires that the customer use best practices for deploying highly available applications, storage, and cluster architecture, such as single-zone deployment or multi-zone deployment, to account for the level of desired availability.
One single-zone cluster will not provide disaster avoidance or recovery in the event of an availability zone or region outage. Multiple single-zone clusters with customer-maintained failover can account for outages at the zone or at the regional level.
One multi-zone cluster will not provide disaster avoidance or recovery in the event of a full region outage. Multiple multi-zone clusters with customer-maintained failover can account for outages at the regional level.
For more information about customer or shared responsibilities, see the ROSA Responsibilities document.
For more information about ROSA and its components, see the ROSA Service Definition.