Moving Link Planning to the Portal

ry · October 26, 2020, 5:58pm

Problem

Currently there is a separation of responsibility between the Merge portal and testbed facilities where the Portal computes an abstract embedding of an experiment into a network of resources - and it’s the facility’s responsibility to translate that abstract embedding to a concrete representation that defines how the networks that interconnect nodes within a materialization will actually be constructed. The ideal in play here is separation of responsibility. The portal does not need to know about the specifics of the technologies in play in a particular facility, and it’s up to the Cogs to figure out how to take an embedding specification and materialize it.

However, this separation results in a sort of false distinction. If we look at the code that implements realization and materialization in the portal, and then look at how link planning is done by the fabric library in the Cogs there is a lot of logical duplication between the two. Yes the realization engine provides hints like the number of lanes facilitated by any particular virtual network path and this information is used to cut down on the amount of logic, computation and complexity taken on by the Cogs fabric library, none the less the fabric library is doing a lot of the same things the realization engine is doing. This results in a weak complexity boundary, complexity duplication and a large opportunity for the introduction of bugs and performance degradation.

Solution

This RFC partitions complexity between the portal and testbed facilities with regard to network materialization along a different axis. The act of planning a network embedding, all the way down to the protocol specific specifications such as VXLAN/VLAN/EVPN is handled by the portal as a part of realization. The act of provisioning a virtual network plan remains the responsibility of the Cogs.

At first glance, this may seem to make Merge more monolithic, lose some of it’s principally designed distributed nature and degrade the experiment-space/resource-space boundary.

However, Merge has always been a model driven architecture, and the realization service has always used information in testbed resource models to determine an embedding. We are now simply taking the additional step of incorporating protocol information into realizations (such as VXLAN VTEP placement) that are determined using resource capability specifications from the model itself. So the propagation of information upwards has not changed with respect to the resource-space/experiment space boundary. What has changed is the addition of protocol information downwards. This information is not tied to particular resources, but to classes of resources that carry a capability specification sufficient to support a given protocol. Thus, we’re not getting into resource specifics in the portal, but we are providing materialization information based on resource classes.

Details

API Additions

The Realization API now includes network virtualization protocol objects as a part of Link realizations.

Link Realizations as Blueprints

With the addition of virtual network protocol data, link realizations now provide sufficient information to act as blueprints for experiment network materialization. These blueprints define what a link embedding is in terms of 3 core concepts.

Segmentation
Termination
Transit

Segmentation

Segmentation defines how an link is broken up into isolated virtual overlay segments. The primary reason for needing more than one segment for a given link is to accommodate network emulation. Consider the following diagram.

On the left, basic links are planned as single segments. For a VXLAN based network this means the same VNI, likewise for VLAN this means the same VID. The realization engine chooses what isolation protocol to use based on the capabilities of the network elements between the nodes chosen in the realization. This brings up a critical aspect of this approach, if the network encapsulation mechanism is indeed a part of the experiment specification (as in some cases it may be) then treating protocol level details at realization time is a fundamental requirement.

On the right, emulated links are planned as a segment per endpoint. This is required to make traffic flow through a network emulator. When the realization engine sees that a link is emulated, it will build a 1:1 segment/endpoint plan to the emulator that was chosen by the realization. Another benefit here is that we can effectively use emulator placement in concert with protocol path viability as a degree of freedom in embedding at realization time.

Termination

Link termination deals with what the endpoints are on a virtual link and what protocols are used to bring end-user testbed resources onto a virtual link. The following cases are considered

Physical Access Port
VLAN Multiplexed
VXLAN Hypervisor
VXLAN Container
VXLAN Basic

Physical Access Port

This sort of termination is used when

The experiment requires use of the physical port on a testbed resource.
The experiment requires use of the physical resource and the fan-out of the experiment node is <= than that of the physical resource.
The physical resource does not support virtualization and the fan-out of the experiment node is <= than that of the physical resource.

VLAN Multiplexed

This sort of termination is used when

The experiment requires use of the physical resource and the fan-out of the experiment node is > than that of the physical resource.
The physical resource does not support virtualization and the fan-out of the experiment node is > than that of the physical resource.

VXLAN Hypervisor

This sort of termination is used when

The node is realized as a virtual machine using a tap-backed virtual network interface

VXLAN Container

This sort of termination is used when

The node is realized as a container using a veth-backed virtual network interface
Currently the only nodes that use this sort of termination are CPS Sensors and Actuators

VXLAN Basic

This sort of termination is not currently used, but may come into play similar to VLAN Multiplexed in concert with VRF on an attached switch.

Transit

Link transit deals with how traffic flows between link terminals. 3 cases are consiered

VLAN Trunking
VXLAN Overlay
Hybrid VLAN VXLAN

VLAN Trunking

This case deals with a pure VLAN overlay. Physical access terminations are denoted as VLAN untagged and VLAN multiplexed terminations are denoted as VLAN tagged. In this transit scenario leaf switches connect to other leaf switches through intermediary switch connections on trunked VLAN ports. The intermediary switches may have an arbitrarily complex turnk interconnection topology that is shown as a stacked set of switches for simplicity in the diagram.

VXLAN Overlay

This case deals with a pure VXLAN overlay. The diagram shows two hypervisors connected to a network of routers. Each hypervisor has a set of VTEPs whose remote tunnel endpoint goes through it’s first hop router. The routers in the core of the network need not know or understand VXLAN as there is no encap/decap going on there. The routing network can be any routed network. In Merge we typically use a BGP underlay and route VXLAN over the top with EVPN.

Hybrid VLAN VXLAN

The hybrid case combines the previous two cases by supporting VLAN access networks at the edge of the overall network and interconnects everything through a routed VXLAN core. In the diagram a special edge device is shown that performs VXLAN encap/decap on behalf of the lower VLAN based network. This is a useful setup for supporting physical device provisioning where the first hop switch does not support VXLAN encap/decap in hardware (common situation in 1G switching platforms)

Summary

This RFC for moving network planning from the Cogs to the Portal

Eliminates duplicated complexity between the realization engine and the Cogs fabric library.
Takes a more structured approach to network planning by defining protocol objects in terms of resource capability classes and exposes those protocol objects as first class API elements.
Defines a different complexity boundary between the portal and facilities that focuses on planning vs provisioning instead of abstract vs concrete embedding.
Allows for network isolation protocol requirements to play a first class role in realizations and expose constraints to users allowing for greater control.

lincoln · October 26, 2020, 11:02pm

I like the idea of moving link planning into the portal. When i describe the merge architecture I also say the portal is the brains, so pushing the computation to the portal makes sense. This alleviates any emphasis on site owners from having to compute. The counter point here is also that the site owner may just want to have control over the network in how the implementation is done. I’m not advocating that, just mentioning it. I dont have a solution to it either as I think it may make multi-site experiments painful if we allow site owners to manage their own networks and making sure that vtep tunnels or encapsulation is done properly.
It seems that this proposal is calling for the use of “edge devices”. Why cant we utilize the emulation boxes for this purpose? We may need to plumb out the vlan to them, but it seems orthogonally connected to the notion of managing the network.

Small comment- can you make the images with black instead of gray text?

ry · October 26, 2020, 11:17pm

Thanks for the feedback.

Yes I spent a lot of time thinking about the potential detriment to a facility operator that may want to join a Merge network but wants to build experiment networks in their own way - instead of use Merge software to build networks the Merge way. So far no such facility operator exists. And I think the potential uptake is far greater for people who want a solution that ‘just works’ with the Cogs rather than go through the very heavy lift of building a testbed automation system that implements the Merge materialization APIs or conform an existing facility to implement those APIs.
I’m using the word edge here to define networks at the periphery of the overal testbed network that are VLAN capable only. They enter the core VXLAN based network through an ‘edge’ device that is neither a switch nor a router, it’s kinda both and it does translation across the VLAN/VXLAN boundary. But it’s not an ‘edge device’ in the sense of the buzzword - whatever that means.

I’ll look into the text issue. Not sure what your forum settings are, I made the diagrams with a white background, and my preferences for the forum are dark mode, and then somehow automagically the diagrams became dark mode themselves…

glawler · October 27, 2020, 3:23pm

Given the things that we now (or will soon support) like VMs, containers (apparently), network emulation, and cross-site experiments I don’t know how else a realization could be done. I suppose we could develop a site to portal to site protocol to coordinate all the site-specific details, but putting the smarts into the portal, which has all the information needed to plan the materialization, seems like the best way.

I very much appreciate the level of detail given here and the diagrams.

ry · October 27, 2020, 3:55pm

I believe the gray issues in the diagrams have been fixed.

ry · November 2, 2020, 3:46am

Something interesting happened at the intersection of the pathfinder alrogithm and the TPA routing model. A route path between any two given TPA endpoints is always optimal, in some cases multiple equal length paths may exist, but when a TPA based route path is calculated, only the shortest path or set of paths results.

In some cases this may not be what’s wanted. Say for example nodes a and b need to talk to each other through a rendezvous r that is not along the shortest path between a and b, then asking the current TPA set of algorithms for a route path between a and b will never include r and thus the pathfinder algorithms will never see or be able to consider r when planning.

A concrete situation in which this arises is when we have two nodes under the same switch connecting to each other with different protocols. Suppose node a has a physical port allocated that uplinks through a leaf switch access port, and node b has a VTEP and connects to the leaf through BGP peering. Let’s further suppose that the leaf switch does not support hardware VXLAN encap/decap so the two nodes cannot talk through the leaf switch they are directly connected to. The virtual link needs to extend to a higher rendezvous point that can translate between the VLAN and VXLAN domain in hardware.

For now I’m dealing with this problem by avoding it, as in the case I was dealing with, there is no reason the hypervisor cannot just use a VLAN trunk to uplink to the leaf switch in liu of a VXLAN segment.

The general solution to this would seem to be to have the TPA path algorithms provide a via option for callers to use to ensure the path between two endpoints includes some node (or set of nodes). This could be used at realization time to find protocol viable paths. However, the question is, should path length be increased to satisfy an arbitrary protocol decision … or should the protocol decision be driven by the shortest path. Clearly some situations will arise where a sub-optimal path is preferred to support a given set of virtual link protocls - for example if the experiment really needs full encap at some node to do something like protocol nesting, then taking an extra hop or two is worth it as otherwise the embedding becomes infeasible.