OpenShift SDN - Additional Concepts | Architecture

Overview
Design on Masters
Design on Nodes
Packet Flow
Network Isolation

Overview

OpenShift Enterprise uses a software-defined networking (SDN) approach to provide a unified cluster network that enables communication between pods across the OpenShift Enterprise cluster. This pod network is established and maintained by the OpenShift Enterprise SDN, which configures an overlay network using Open vSwitch (OVS).

OpenShift Enterprise SDN provides two SDN plug-ins for configuring the pod network:

The ovs-subnet plug-in is the original plug-in which provides a "flat" pod network where every pod can communicate with every other pod and service.
The ovs-multitenant plug-in provides OpenShift Enterprise project level isolation for pods and services. Each project receives a unique Virtual Network ID (VNID) that identifies traffic from pods assigned to the project. Pods from different projects cannot send packets to or receive packets from pods and services of a different project.

However, projects which receive VNID 0 are more privileged in that they are allowed to communicate with all other pods, and all other pods can communicate with them. In OpenShift Enterprise clusters, the default project has VNID 0. This facilitates certain services like the load balancer, etc. to communicate with all other pods in the cluster and vice versa.

Following is a detailed discussion of the design and operation of OpenShift Enterprise SDN, which may be useful for troubleshooting.

Information on configuring the SDN on masters and nodes is available in Configuring the SDN.

Design on Masters

On an OpenShift Enterprise master, OpenShift Enterprise SDN maintains a registry of nodes, stored in etcd. When the system administrator registers a node, OpenShift Enterprise SDN allocates an unused subnet from the cluster network and stores this subnet in the registry. When a node is deleted, OpenShift Enterprise SDN deletes the subnet from the registry and considers the subnet available to be allocated again.

In the default configuration, the cluster network is the 10.1.0.0/16 class B network, and nodes are allocated /24 subnets (i.e., 10.1.0.0/24, 10.1.1.0/24, 10.1.2.0/24, and so on). This means that the cluster network has 256 subnets available to assign to nodes, and a given node is allocated 254 addresses that it can assign to the containers running on it. The size and address range of the cluster network are configurable, as is the host subnet size.

Note that OpenShift Enterprise SDN on a master does not configure the local (master) host to have access to any cluster network. Consequently, a master host does not have access to pods via the cluster network, unless it is also running as a node.

When using the ovs-multitenant plug-in, the OpenShift Enterprise SDN master also watches for the creation and deletion of projects, and assigns VXLAN VNIDs to them, which will be used later by the nodes to isolate traffic correctly.

Design on Nodes

On a node, OpenShift Enterprise SDN first registers the local host with the SDN master in the aforementioned registry so that the master allocates a subnet to the node.

Next, OpenShift Enterprise SDN creates and configures six network devices:

br0, the OVS bridge device that pod containers will be attached to. OpenShift Enterprise SDN also configures a set of non-subnet-specific flow rules on this bridge. The ovs-multitenant plug-in does this immediately.
lbr0, a Linux bridge device, which is configured as Docker’s bridge and given the cluster subnet gateway address (eg, 10.1.x.1/24).
tun0, an OVS internal port (port 2 on br0). This also gets assigned the cluster subnet gateway address, and is used for external network access. OpenShift Enterprise SDN configures netfilter and routing rules to enable access from the cluster subnet to the external network via NAT.
vlinuxbr and vovsbr, two Linux peer virtual Ethernet interfaces. vlinuxbr is added to lbr0 and vovsbr is added to br0 (port 9 with the ovs-subnet plug-in and port 3 with the ovs-multitenant plug-in) to provide connectivity for containers created directly with Docker outside of OpenShift Enterprise.
vxlan0, the OVS VXLAN device (port 1 on br0), which provides access to containers on remote nodes.

Each time a pod is started on the host, OpenShift Enterprise SDN:

moves the host side of the pod’s veth interface pair from the lbr0 bridge (where Docker placed it when starting the container) to the OVS bridge br0.
adds OpenFlow rules to the OVS database to route traffic addressed to the new pod to the correct OVS port.
in the case of the ovs-multitenant plug-in, adds OpenFlow rules to tag traffic coming from the pod with the pod’s VNID, and to allow traffic into the pod if the traffic’s VNID matches the pod’s VNID (or is the privileged VNID 0). Non-matching traffic is filtered out by a generic rule.

The pod is allocated an IP address in the cluster subnet by Docker itself because Docker is told to use the lbr0 bridge, which OpenShift Enterprise SDN has assigned the cluster gateway address (eg. 10.1.x.1/24). Note that the tun0 is also assigned the cluster gateway IP address because it is the default gateway for all traffic destined for external networks, but these two interfaces do not conflict because the lbr0 interface is only used for IPAM and no OpenShift Enterprise SDN pods are connected to it.

OpenShift SDN nodes also watch for subnet updates from the SDN master. When a new subnet is added, the node adds OpenFlow rules on br0 so that packets with a destination IP address the remote subnet go to vxlan0 (port 1 on br0) and thus out onto the network. The ovs-subnet plug-in sends all packets across the VXLAN with VNID 0, but the ovs-multitenant plug-in uses the appropriate VNID for the source container.

Packet Flow

Suppose you have two containers, A and B, where the peer virtual Ethernet device for container A’s eth0 is named vethA and the peer for container B’s eth0 is named vethB.

If Docker’s use of peer virtual Ethernet devices is not already familiar to you, review Docker’s advanced networking documentation.

Now suppose first that container A is on the local host and container B is also on the local host. Then the flow of packets from container A to container B is as follows:

eth0 (in A’s netns) → vethA → br0 → vethB → eth0 (in B’s netns)

Next, suppose instead that container A is on the local host and container B is on a remote host on the cluster network. Then the flow of packets from container A to container B is as follows:

eth0 (in A’s netns) → vethA → br0 → vxlan0 → network ^[1] → vxlan0 → br0 → vethB → eth0 (in B’s netns)

Finally, if container A connects to an external host, the traffic looks like:

eth0 (in A’s netns) → vethA → br0 → tun0 → (NAT) → eth0 (physical device) → Internet

Almost all packet delivery decisions are performed with OpenFlow rules in the OVS bridge br0, which simplifies the plug-in network architecture and provides flexible routing. In the case of the ovs-multitenant plug-in, this also provides enforceable network isolation.

Network Isolation

You can use the ovs-multitenant plug-in to achieve network isolation. When a packet exits a pod assigned to a non-default project, the OVS bridge br0 tags that packet with the project’s assigned VNID. If the packet is directed to another IP address in the node’s cluster subnet, the OVS bridge only allows the packet to be delivered to the destination pod if the VNIDs match.

If a packet is received from another node via the VXLAN tunnel, the Tunnel ID is used as the VNID, and the OVS bridge only allows the packet to be delivered to a local pod if the tunnel ID matches the destination pod’s VNID.

Packets destined for other cluster subnets are tagged with their VNID and delivered to the VXLAN tunnel with a tunnel destination address of the node owning the cluster subnet.

As described before, VNID 0 is privileged in that traffic with any VNID is allowed to enter any pod assigned VNID 0, and traffic with VNID 0 is allowed to enter any pod. Only the default OpenShift Enterprise project is assigned VNID 0; all other projects are assigned unique, isolation-enabled VNIDs. Cluster administrators can optionally control the pod network for the project using the administrator CLI.

1. After this point, device names refer to devices on container B’s host.