×

Network Observability Operator 1.5.0

The following advisory is available for the Network Observability Operator 1.5.0:

New features and enhancements

DNS tracking enhancements

In 1.5, the TCP protocol is now supported in addition to UDP. New dashboards are also added to the Overview view of the Network Traffic page. For more information, see Configuring DNS tracking and Working with DNS tracking.

Round-trip time (RTT)

You can use TCP handshake Round-Trip Time (RTT) captured from the fentry/tcp_rcv_established Extended Berkeley Packet Filter (eBPF) hookpoint to read smoothed round-trip time (SRTT) and analyze network flows. In the Overview, Network Traffic, and Topology pages in web console, you can monitor network traffic and troubleshoot with RTT metrics, filtering, and edge labeling. For more information, see RTT Overview and Working with RTT.

Metrics, dashboards, and alerts enhancements

The Network Observability metrics dashboards in ObserveDashboardsNetObserv have new metrics types you can use to create Prometheus alerts. You can now define available metrics in the includeList specification. In previous releases, these metrics were defined in the ignoreTags specification. For a complete list of these metrics, see Network Observability Metrics.

Improvements for Network Observability without Loki

You can create Prometheus alerts for the Netobserv dashboard using DNS and RTT metrics, even if you don’t use Loki. In the previous version of Network Observability, 1.4, these metrics were only available for querying and analysis in the Network Traffic, Overview, and Topology views, which are not available without Loki. For more information, see Network Observability Metrics.

Availability zones

You can configure the FlowCollector resource to collect information about the cluster availability zones. This configuration enriches the network flow data with the topology.kubernetes.io/zone label value applied to the nodes. For more information, see Working with availability zones.

Notable enhancements

The 1.5 release of the Network Observability Operator adds improvements and new capabilities to the OpenShift Container Platform web console plugin and the Operator configuration.

Performance enhancements
  • The spec.agent.ebpf.kafkaBatchSize default is changed from 10MB to 1MB to enhance eBPF performance when using Kafka.

    When upgrading from an existing installation, this new value is not set automatically in the configuration. If you monitor a performance regression with the eBPF Agent memory consumption after upgrading, you might consider reducing the kafkaBatchSize to the new value.

Web console enhancements:
  • There are new panels added to the Overview view for DNS and RTT: Min, Max, P90, P99.

  • There are new panel display options added:

    • Focus on one panel while keeping others viewable but with smaller focus.

    • Switch graph type.

    • Show Top and Overall.

  • A collection latency warning is shown in the Custom time range pop-up window.

  • There is enhanced visibility for the contents of the Manage panels and Manage columns pop-up windows.

  • The Differentiated Services Code Point (DSCP) field for egress QoS is available for filtering QoS DSCP in the web console Network Traffic page.

Configuration enhancements:
  • The LokiStack mode in the spec.loki.mode specification simplifies installation by automatically setting URLs, TLS, cluster roles and a cluster role binding, as well as the authToken value. The Manual mode allows more control over configuration of these settings.

  • The API version changes from flows.netobserv.io/v1beta1 to flows.netobserv.io/v1beta2.

Bug fixes

  • Previously, it was not possible to register the console plugin manually in the web console interface if the automatic registration of the console plugin was disabled. If the spec.console.register value was set to false in the FlowCollector resource, the Operator would override and erase the plugin registration. With this fix, setting the spec.console.register value to false does not impact the console plugin registration or registration removal. As a result, the plugin can be safely registered manually. (NETOBSERV-1134)

  • Previously, using the default metrics settings, the NetObserv/Health dashboard was showing an empty graph named Flows Overhead. This metric was only available by removing "namespaces-flows" and "namespaces" from the ignoreTags list. With this fix, this metric is visible when you use the default metrics setting. (NETOBSERV-1351)

  • Previously, the node on which the eBPF Agent was running would not resolve with a specific cluster configuration. This resulted in cascading consequences that culminated in a failure to provide some of the traffic metrics. With this fix, the eBPF agent’s node IP is safely provided by the Operator, inferred from the pod status. Now, the missing metrics are restored. (NETOBSERV-1430)

  • Previously, the Loki error 'Input size too long' error for the Loki Operator did not include additional information to troubleshoot the problem. With this fix, help is directly displayed in the web console next to the error with a direct link for more guidance. (NETOBSERV-1464)

  • Previously, the console plugin read timeout was forced to 30s. With the FlowCollector v1beta2 API update, you can configure the spec.loki.readTimeout specification to update this value according to the Loki Operator queryTimeout limit. (NETOBSERV-1443)

  • Previously, the Operator bundle did not display some of the supported features by CSV annotations as expected, such as features.operators.openshift.io/…​ With this fix, these annotations are set in the CSV as expected. (NETOBSERV-1305)

  • Previously, the FlowCollector status sometimes oscillated between DeploymentInProgress and Ready states during reconciliation. With this fix, the status only becomes Ready when all the underlying components are fully ready.(NETOBSERV-1293)

Known issues

  • When trying to access the web console, cache issues on OCP 4.14.10 prevent access to the Observe view. The web console shows the error message: Failed to get a valid plugin manifest from /api/plugins/monitoring-plugin/. The recommended workaround is to update the cluster to the latest minor version. If this does not work, you need to apply the workarounds described in this Red Hat Knowledgebase article.(NETOBSERV-1493)

Network Observability Operator 1.4.2

The following advisory is available for the Network Observability Operator 1.4.2:

Network Observability Operator 1.4.1

The following advisory is available for the Network Observability Operator 1.4.1:

Bug fixes

  • In 1.4, there was a known issue when sending network flow data to Kafka. The Kafka message key was ignored, causing an error with connection tracking. Now the key is used for partitioning, so each flow from the same connection is sent to the same processor. (NETOBSERV-926)

  • In 1.4, the Inner flow direction was introduced to account for flows between pods running on the same node. Flows with the Inner direction were not taken into account in the generated Prometheus metrics derived from flows, resulting in under-evaluated bytes and packets rates. Now, derived metrics are including flows with the Inner direction, providing correct bytes and packets rates. (NETOBSERV-1344)

Network Observability Operator 1.4.0

The following advisory is available for the Network Observability Operator 1.4.0:

Channel removal

You must switch your channel from v1.0.x to stable to receive the latest Operator updates. The v1.0.x channel is now removed.

New features and enhancements

Notable enhancements

The 1.4 release of the Network Observability Operator adds improvements and new capabilities to the OpenShift Container Platform web console plugin and the Operator configuration.

Web console enhancements:
  • In the Query Options, the Duplicate flows checkbox is added to choose whether or not to show duplicated flows.

  • You can now filter source and destination traffic with arrow up long solid One-way, arrow up long solid arrow down long solid Back-and-forth, and Swap filters.

  • The Network Observability metrics dashboards in ObserveDashboardsNetObserv and NetObserv / Health are modified as follows:

    • The NetObserv dashboard shows top bytes, packets sent, packets received per nodes, namespaces, and workloads. Flow graphs are removed from this dashboard.

    • The NetObserv / Health dashboard shows flows overhead as well as top flow rates per nodes, namespaces, and workloads.

    • Infrastructure and Application metrics are shown in a split-view for namespaces and workloads.

For more information, see Network Observability metrics and Quick filters.

Configuration enhancements:
  • You now have the option to specify different namespaces for any configured ConfigMap or Secret reference, such as in certificates configuration.

  • The spec.processor.clusterName parameter is added so that the name of the cluster appears in the flows data. This is useful in a multi-cluster context. When using OpenShift Container Platform, leave empty to make it automatically determined.

Network Observability without Loki

The Network Observability Operator is now functional and usable without Loki. If Loki is not installed, it can only export flows to KAFKA or IPFIX format and provide metrics in the Network Observability metrics dashboards. For more information, see Network Observability without Loki.

DNS tracking

In 1.4, the Network Observability Operator makes use of eBPF tracepoint hooks to enable DNS tracking. You can monitor your network, conduct security analysis, and troubleshoot DNS issues in the Network Traffic and Overview pages in the web console.

For more information, see Configuring DNS tracking and Working with DNS tracking.

SR-IOV support

You can now collect traffic from a cluster with Single Root I/O Virtualization (SR-IOV) device. For more information, see Configuring the monitoring of SR-IOV interface traffic.

IPFIX exporter support

You can now export eBPF-enriched network flows to the IPFIX collector. For more information, see Export enriched network flow data.

s390x architecture support

Network Observability Operator can now run on s390x architecture. Previously it ran on amd64, ppc64le, or arm64.

Bug fixes

  • Previously, the Prometheus metrics exported by Network Observability were computed out of potentially duplicated network flows. In the related dashboards, from ObserveDashboards, this could result in potentially doubled rates. Note that dashboards from the Network Traffic view were not affected. Now, network flows are filtered to eliminate duplicates prior to metrics calculation, which results in correct traffic rates displayed in the dashboards. (NETOBSERV-1131)

  • Previously, the Network Observability Operator agents were not able to capture traffic on network interfaces when configured with Multus or SR-IOV, non-default network namespaces. Now, all available network namespaces are recognized and used for capturing flows, allowing capturing traffic for SR-IOV. There are configurations needed for the FlowCollector and SRIOVnetwork custom resource to collect traffic. (NETOBSERV-1283)

  • Previously, in the Network Observability Operator details from OperatorsInstalled Operators, the FlowCollector Status field might have reported incorrect information about the state of the deployment. The status field now shows the proper conditions with improved messages. The history of events is kept, ordered by event date. (NETOBSERV-1224)

  • Previously, during spikes of network traffic load, certain eBPF pods were OOM-killed and went into a CrashLoopBackOff state. Now, the eBPF agent memory footprint is improved, so pods are not OOM-killed and entering a CrashLoopBackOff state. (NETOBSERV-975)

  • Previously when processor.metrics.tls was set to PROVIDED the insecureSkipVerify option value was forced to be true. Now you can set insecureSkipVerify to true or false, and provide a CA certificate if needed. (NETOBSERV-1087)

Known issues

  • Since the 1.2.0 release of the Network Observability Operator, using Loki Operator 5.6, a Loki certificate change periodically affects the flowlogs-pipeline pods and results in dropped flows rather than flows written to Loki. The problem self-corrects after some time, but it still causes temporary flow data loss during the Loki certificate change. This issue has only been observed in large-scale environments of 120 nodes or greater. (NETOBSERV-980)

  • Currently, when spec.agent.ebpf.features includes DNSTracking, larger DNS packets require the eBPF agent to look for DNS header outside of the 1st socket buffer (SKB) segment. A new eBPF agent helper function needs to be implemented to support it. Currently, there is no workaround for this issue. (NETOBSERV-1304)

  • Currently, when spec.agent.ebpf.features includes DNSTracking, DNS over TCP packets requires the eBPF agent to look for DNS header outside of the 1st SKB segment. A new eBPF agent helper function needs to be implemented to support it. Currently, there is no workaround for this issue. (NETOBSERV-1245)

  • Currently, when using a KAFKA deployment model, if conversation tracking is configured, conversation events might be duplicated across Kafka consumers, resulting in inconsistent tracking of conversations, and incorrect volumetric data. For that reason, it is not recommended to configure conversation tracking when deploymentModel is set to KAFKA. (NETOBSERV-926)

  • Currently, when the processor.metrics.server.tls.type is configured to use a PROVIDED certificate, the operator enters an unsteady state that might affect its performance and resource consumption. It is recommended to not use a PROVIDED certificate until this issue is resolved, and instead using an auto-generated certificate, setting processor.metrics.server.tls.type to AUTO. (NETOBSERV-1293

Network Observability Operator 1.3.0

The following advisory is available for the Network Observability Operator 1.3.0:

Channel deprecation

You must switch your channel from v1.0.x to stable to receive future Operator updates. The v1.0.x channel is deprecated and planned for removal in the next release.

New features and enhancements

Multi-tenancy in Network Observability

Flow-based metrics dashboard

  • This release adds a new dashboard, which provides an overview of the network flows in your OpenShift Container Platform cluster. For more information, see Network Observability metrics.

Troubleshooting with the must-gather tool

  • Information about the Network Observability Operator can now be included in the must-gather data for troubleshooting. For more information, see Network Observability must-gather.

Multiple architectures now supported

  • Network Observability Operator can now run on an amd64, ppc64le, or arm64 architectures. Previously, it only ran on amd64.

Deprecated features

Deprecated configuration parameter setting

The release of Network Observability Operator 1.3 deprecates the spec.Loki.authToken HOST setting. When using the Loki Operator, you must now only use the FORWARD setting.

Bug fixes

  • Previously, when the Operator was installed from the CLI, the Role and RoleBinding that are necessary for the Cluster Monitoring Operator to read the metrics were not installed as expected. The issue did not occur when the operator was installed from the web console. Now, either way of installing the Operator installs the required Role and RoleBinding. (NETOBSERV-1003)

  • Since version 1.2, the Network Observability Operator can raise alerts when a problem occurs with the flows collection. Previously, due to a bug, the related configuration to disable alerts, spec.processor.metrics.disableAlerts was not working as expected and sometimes ineffectual. Now, this configuration is fixed so that it is possible to disable the alerts. (NETOBSERV-976)

  • Previously, when Network Observability was configured with spec.loki.authToken set to DISABLED, only a kubeadmin cluster administrator was able to view network flows. Other types of cluster administrators received authorization failure. Now, any cluster administrator is able to view network flows. (NETOBSERV-972)

  • Previously, a bug prevented users from setting spec.consolePlugin.portNaming.enable to false. Now, this setting can be set to false to disable port-to-service name translation. (NETOBSERV-971)

  • Previously, the metrics exposed by the console plugin were not collected by the Cluster Monitoring Operator (Prometheus), due to an incorrect configuration. Now the configuration has been fixed so that the console plugin metrics are correctly collected and accessible from the OpenShift Container Platform web console. (NETOBSERV-765)

  • Previously, when processor.metrics.tls was set to AUTO in the FlowCollector, the flowlogs-pipeline servicemonitor did not adapt the appropriate TLS scheme, and metrics were not visible in the web console. Now the issue is fixed for AUTO mode. (NETOBSERV-1070)

  • Previously, certificate configuration, such as used for Kafka and Loki, did not allow specifying a namespace field, implying that the certificates had to be in the same namespace where Network Observability is deployed. Moreover, when using Kafka with TLS/mTLS, the user had to manually copy the certificate(s) to the privileged namespace where the eBPF agent pods are deployed and manually manage certificate updates, such as in the case of certificate rotation. Now, Network Observability setup is simplified by adding a namespace field for certificates in the FlowCollector resource. As a result, users can now install Loki or Kafka in different namespaces without needing to manually copy their certificates in the Network Observability namespace. The original certificates are watched so that the copies are automatically updated when needed. (NETOBSERV-773)

  • Previously, the SCTP, ICMPv4 and ICMPv6 protocols were not covered by the Network Observability agents, resulting in a less comprehensive network flows coverage. These protocols are now recognized to improve the flows coverage. (NETOBSERV-934)

Known issues

  • When processor.metrics.tls is set to PROVIDED in the FlowCollector, the flowlogs-pipeline servicemonitor is not adapted to the TLS scheme. (NETOBSERV-1087)

  • Since the 1.2.0 release of the Network Observability Operator, using Loki Operator 5.6, a Loki certificate change periodically affects the flowlogs-pipeline pods and results in dropped flows rather than flows written to Loki. The problem self-corrects after some time, but it still causes temporary flow data loss during the Loki certificate change. This issue has only been observed in large-scale environments of 120 nodes or greater.(NETOBSERV-980)

Network Observability Operator 1.2.0

The following advisory is available for the Network Observability Operator 1.2.0:

Preparing for the next update

The subscription of an installed Operator specifies an update channel that tracks and receives updates for the Operator. Until the 1.2 release of the Network Observability Operator, the only channel available was v1.0.x. The 1.2 release of the Network Observability Operator introduces the stable update channel for tracking and receiving updates. You must switch your channel from v1.0.x to stable to receive future Operator updates. The v1.0.x channel is deprecated and planned for removal in a following release.

New features and enhancements

Histogram in Traffic Flows view

  • You can now choose to show a histogram bar chart of flows over time. The histogram enables you to visualize the history of flows without hitting the Loki query limit. For more information, see Using the histogram.

Conversation tracking

  • You can now query flows by Log Type, which enables grouping network flows that are part of the same conversation. For more information, see Working with conversations.

Network Observability health alerts

  • The Network Observability Operator now creates automatic alerts if the flowlogs-pipeline is dropping flows because of errors at the write stage or if the Loki ingestion rate limit has been reached. For more information, see Viewing health information.

Bug fixes

  • Previously, after changing the namespace value in the FlowCollector spec, eBPF agent pods running in the previous namespace were not appropriately deleted. Now, the pods running in the previous namespace are appropriately deleted. (NETOBSERV-774)

  • Previously, after changing the caCert.name value in the FlowCollector spec (such as in Loki section), FlowLogs-Pipeline pods and Console plug-in pods were not restarted, therefore they were unaware of the configuration change. Now, the pods are restarted, so they get the configuration change. (NETOBSERV-772)

  • Previously, network flows between pods running on different nodes were sometimes not correctly identified as being duplicates because they are captured by different network interfaces. This resulted in over-estimated metrics displayed in the console plug-in. Now, flows are correctly identified as duplicates, and the console plug-in displays accurate metrics. (NETOBSERV-755)

  • The "reporter" option in the console plug-in is used to filter flows based on the observation point of either source node or destination node. Previously, this option mixed the flows regardless of the node observation point. This was due to network flows being incorrectly reported as Ingress or Egress at the node level. Now, the network flow direction reporting is correct. The "reporter" option filters for source observation point, or destination observation point, as expected. (NETOBSERV-696)

  • Previously, for agents configured to send flows directly to the processor as gRPC+protobuf requests, the submitted payload could be too large and is rejected by the processors' GRPC server. This occurred under very-high-load scenarios and with only some configurations of the agent. The agent logged an error message, such as: grpc: received message larger than max. As a consequence, there was information loss about those flows. Now, the gRPC payload is split into several messages when the size exceeds a threshold. As a result, the server maintains connectivity. (NETOBSERV-617)

Known issue

  • In the 1.2.0 release of the Network Observability Operator, using Loki Operator 5.6, a Loki certificate transition periodically affects the flowlogs-pipeline pods and results in dropped flows rather than flows written to Loki. The problem self-corrects after some time, but it still causes temporary flow data loss during the Loki certificate transition. (NETOBSERV-980)

Notable technical changes

  • Previously, you could install the Network Observability Operator using a custom namespace. This release introduces the conversion webhook which changes the ClusterServiceVersion. Because of this change, all the available namespaces are no longer listed. Additionally, to enable Operator metrics collection, namespaces that are shared with other Operators, like the openshift-operators namespace, cannot be used. Now, the Operator must be installed in the openshift-netobserv-operator namespace. You cannot automatically upgrade to the new Operator version if you previously installed the Network Observability Operator using a custom namespace. If you previously installed the Operator using a custom namespace, you must delete the instance of the Operator that was installed and re-install your operator in the openshift-netobserv-operator namespace. It is important to note that custom namespaces, such as the commonly used netobserv namespace, are still possible for the FlowCollector, Loki, Kafka, and other plug-ins. (NETOBSERV-907)(NETOBSERV-956)

Network Observability Operator 1.1.0

The following advisory is available for the Network Observability Operator 1.1.0:

The Network Observability Operator is now stable and the release channel is upgraded to v1.1.0.

Bug fix

  • Previously, unless the Loki authToken configuration was set to FORWARD mode, authentication was no longer enforced, allowing any user who could connect to the OpenShift Container Platform console in an OpenShift Container Platform cluster to retrieve flows without authentication. Now, regardless of the Loki authToken mode, only cluster administrators can retrieve flows. (BZ#2169468)