Node emulation and virtualization support

At a high level, Merge is a platform to describe networked system experiments, allocate sufficient resources to instantiate experiments, automate the materialization of an experiment, support the execution of experiments within a materialization and collect data.

At the time of writing, all Merge testbeds in production deal with physical nodes. This topic will describe the layers in the Merge stack that will need to be enhanced to support node emulation through virtualization. Support in some layers already exists as the Merge architecture and was designed to handle this capability from the outset.

Layers

Expression

There are many questions that arise at the intersection of experiment expression and virtualization.

  • How are emulated node characteristics expressed by the experimenter?
  • How does the experimenter influence the choice of virtualization techniques used to ultimately materialize nodes?
  • How does the experimenter explicitly constrain experiments away from virtualization?

Much of this already exists. For example, in an experiment model the experimenter can say

x = topo.device('nodeX', cores == 4, memory == gb(8))

Here, the only underlying resource that will satisfy this experiment node is one with exactly 4 cores and 8 GB of memory. This may happen to exist in one of the resource pools the experimenter has access to, or it may not. In the case that it does not, a resource that supports virtualization may be used to create a satisfying virtual machine.

The experimenter should be able to explicitly opt-out of virtualization if they wish, or for that matter opt-in.

x = topo.device('nodeX', cores == 4, virt == true)
y = topo.device('nodeY', cores == 4, virt == false)

The experimenter should also be able to control specific parameters of interest as not all virtualization techniques are created equal. Take virtual network cards as an example. There are a wide variety out there that have different capabilities and performance profiles, virtio, e1000, vfio, passthrou, ve etc. In the following example the experimenter is specifying a requirement that one node uses a virtio virtual NIC and another uses a Mellanox ConnectX4 or newer physical NIC.

x = topo.device('x')
y = topo.device('y')
link = topo.connect([x,y])
link[x].spec(nic == virtio)
link[y].spec(nic >= connect4)

Compilation

In Merge the compilation phase accomplishes a few things

  • Ensures the users model is syntactically correct
  • Checks for some basic issues like islanded networks, malformed IP address specifications.

It’s not totally clear to me what ground the experiment compiler will need to cover specifically with respect to node emulation / virtualization. So this section is more of a place holder.

Realization

This is one place where some support for virtualization already exists. When the realization engine (Bonsai) runs. For each resource candidate it has to allocate to a node it first checks whether that resource has the VirtAlloc tag set. If it does the realization engine will allocate the experiment node as a slice of the resources owned by the resource, and an allocation is made in the allocation table. When subsequent realizations consider this node, the allocation table entries for this resource are fetched and the total available capacity of the resource is the difference between the base capability minus the current allocations against it. If the remaining capacity on the resource is not sufficient, the realization engine moves on.

The realization engine is generic in nature. It does not intrinsically understand the composition of resources or the specific elements within resources e.g. it has no idea what are core is or what a network card is. What it does understand is the experiment intermediate representation (XIR) schema, how that schema exposes properties in a semi-structured way and how to match experiment specifications to resource specifications. So the extent to which the realization engine is able to find what the experiment calls for, is the extent to which the underlying resource models contain sufficient information for doing so.

Materialization

Materialization is the process by which the resources allocated in the realization phase are turned into a ticking breathing experiment. In this phase all of the node and link specifications from an experiment are delivered to a testbed facility in the form of materialization fragments. These materialization fragments are generic containers of information that are self describing in terms of the type of information they carry and can be unpacked by testbed facility automation systems.

Automation

The prevailing testbed automation system that runs all current Merge testbed facilities is the Cogs. The cogs take these materialization fragments and turn them into a directed acyclic graph (DAG) of tasks that need to be done to perform the materialization. There is a pool of replicated workers (the worker is referred to as rex) that watch for new task DAGs in the Cogs data store and execute the graphs in an optimally concurrent way.

Rex does not intrinsically understand many of the tasks it does. For example, a common task is setting up DNS and DHCP for an experiment, or setting up VXLAN/EVPN domains for experiment links. Rex understands neither of these things. What Rex does understand is how to use the APIs of the various testbed subsystems that can do these things. This is a fundamental design of the Cogs automation system, the automation system itself focuses on fast and reliable automation and interacts with external systems through well defined APIs to implement specific capabilities. This is precisely how virtualization support will enter into Merge testbed facilities.

We have already begun work on utilizing the Sandia Minimega technology as a control plane for virtual machines. The idea here is that testbed nodes that support virtualization, that are currently in a virtual allocation state (some nodes support dual personality, and can be used as either a hypervisor host, or a bare metal machine) will run Minimega, and the Cogs will use the Minimega API to spawn, configure and generally control virtual machines.

Implementation

Underneath the covers of automating virtual machine provisioning through the cogs, is the fact that these virtual machines must be implemented in a way that honors the constraints laid out by the experiment specifications. This is similar in spirit to the network emulation systems we have - while the automation of network plumbing and general provisioning for network emulation is complex on it’s own - the implementation of correct network emulations is a field on it’s own. The node emulation facilities, implemented through virtualization have just as much complexity under the hood. One of the aims architecturally is to decouple automation and provisioning of node emulations from node emulator implementation.

The virtual machines in Merge will need to support advanced constraints and notions of fairness to be viable for rigorous experimentation. The biggest difference in the way testbeds use virtualization as opposed to most other platforms is that we want a virtual machine to run with a specific performance profile and not ‘as fast as possible up to some limit’. This includes but is not limited to.

  • CPU bandwidth scheduling: ‘Give me a node with 4 cores at 2.2 ghz’
  • Inter-component I/O: 'Give me a DDR 2400 memory bus and a PCIe 3.0 bus with 16 lanes).
  • Network cards: ‘Give me a NIC with the following capabilities {…}’
  • Time dilation: ‘Give me a VM that runs 10x slower than real time’
  • Sender side emulation for virtual NICs: ‘Emulating wireless NICs, Emulating optical link dynamics’.
  • Support for GPUs in pass-through and virtualized modes. The latter is quite interesting and is quite new so could be fun.
  • Emulation on non x86 platforms. We can always use QEMU TSC for non-x86, but QEMU/KVM also works great on other platforms with full hardware virtualization support like ARM and to some extent RISC-V.

Virtual Network Plumbing

In this post I’ll lay out the VM connectivity cases I’m looking to cover in the initial implementation of virtualization support in Merge. One of the primary implementation details I’ll be focused on in this post is what parts of the network are provisioned and managed centrally by the Cogs testbed automation platform, and what parts of the network are managed individually by hypervisors. This distinction essentially boils down to what virtual network identifiers need to be global, and what virtual network identifiers can be determined locally by a single hypervisor without broader coordination. In the diagrams that follow this distinction will be denoted by a horizontal dotted line bisecting the network elements.

The mechanisms for attaching virtual machines to an experiment network in the initial MergeTB release of virtualization capabilities will be the following,

  • virtio
  • vfio
  • PCI passthrough

and their combinations.

Virtio

Virtio is in many ways the simplest model to support from a testbed-level networking perspective. In the absence of experimental constraints that preclude the use of virtio, this is the default mechanism that will be used. It’s also the most flexible as it allows for an arbitrary number of interfaces to be created for a given VM or set of VMs.

I say that this is the simplest model from the testbed-level networking perspective, because we can place the VTEPs directly on the hypervisor itself. Thus the hypervisor becomes a BGP peer in the the layer-3 testbed underlay network and no special plumbing is required in the core of the testbed network. In the limit, if a testbed is composed of entirely hypervisors, the testbed switching mesh is pure underlay, with zero VLAN, VXLAN or explicit EVPN configuration.

Global / Local Boundary

For the virtio mechanism, the cogs system will place VXLAN tunnel endpoints (VTEP) on the hypervisor before the virtual machines are set up. Then the cogs send a request to the hypervisor manager for virtual machines to be set up containing a mapping between the virtual NICs inside the virtual machine and the VTEP they are to be associated with.

It is the hypervisor manager’s responsibility to create the plumbing between the virtual NIC inside the VM and the associated VTEP on the machine that has been put in place by the cogs. In this example we have shown a strategy where the hypervisor sets up a bridge and a set of taps, and adds the provided VTEPs to the bridge. This bridge would necessarily be filtering bridge to prevent the VMs from talking directly over the bridge (this could be allowed in the case that it is explicitly desired, but is unlikely to be the general case for a network testbed, where the links between nodes are commonly emulated).

The nice thing about this design is it allows the machine level plumbing, which can have a significant impact on overall performance and introduction of artifacts, to evolve independently of the testbed level network plumbing. The stable point is that the touch point with the testbed at large will be a VTEP device.

Vfio

Vfio allows hardware virtualized segments of a NIC to be passed through to a virtual machine. In some cases this results an an increase in fidelity over the virtio device as the vfio device can be more representative of the feature set found on physical NICs, and the performance may be more desirable under certain conditions. None of these are hard and fast rules, it depends on the experimental situation at hand and what specific aspects of fidelity are important to maintain.

Vfio is a bit less flexible than virtio as there are limits on the number of devices that can be supported and it pushes the entry point into the testbed L3 underlay up to the connecting switch.

When a virtual function (VF) device is created based on a physical device that supports single-root input/output virtualization (SR-IOV), there are a few ways that we can isolate packets onto the appropriate testbed-level virtual network. I’m leaning toward the first option (VST) at this time.

VLAN switch tagging

VF devices on many Mellanox and Intel NICs allow for the VF to be transparently tagged with a VLAN ID. This is known as vlan switch tagging (VST). This makes the VF device act more or less like an access (untagged PVID) port on a switch. On egress from the VF the tag is applied so all outgoing traffic is tagged and may be handled appropriately upstream. On ingress, the VLAN tag is stripped so the virtual NIC inside the VM does not see these tags.

TODO: at this time it is not clear to me if VLAN stacking will work in this context e.g. if the experiment is using tagged VLAN packets itself, will the tags get stacked QinQ style, or will they be obliterated?

spoofcheck

Enabling spoofcheck on the VF device means that the guest inside the virtual machine cannot change the MAC address of the virtual NIC inside the VM, otherwise packets will be dropped. This allows us to strongly tie traffic coming from a specific VM interface to a particular logical interface as it pertains to the experiment’s network topology model. What this allows us to do, is set up the forwarding data base (FDB) on the bridge of the connected leaf switch to forward all traffic sourced from this MAC onto the desired VTEP which in effect puts the traffic in the correct global experiment level segment. The obvious disadvantage here is that changing the MAC is not available to the experimenter. This would almost certainly need to be done within the context of some sort of VLAN at the switch level to prevent MAC hijacking as well.

Global / Local Boundary

This discussion assumes that VST is the local isolation mechanism being used. In this case the VLAN tag selected must be viable for both the hypervisor and the leaf switch it’s connected to. For this reason, the VF devices and their configuration will be managed by the cogs. Prior to launching any virtual machines, the cogs will calculate the VLAN tags for all needed VFs, create the VF devices on the appropriate hypervisors and apply the VLAN tags. After the network setup phase, the cogs will send out virtual machines requests to the hypervisor managers that map VM virtual NICs to the appropriate VF devices.

Hybrid virtio + vfio

Virtio and vfio should be able to coexist. A point of interest is whether or not virtio and vfio can exist on the same NIC, and if they can is it a good idea to do so? If yes to both then the two may be freely intermixed up to the PF carnality limit for a given NIC. If no then once either mechanism is put to use on a particular NIC, the physical NIC is pinned to that mode of operation until the virtual NICs it is serving have been dematerialized.

Global / Local Boundary

The hybrid state does not change the global / local boundary, it’s simply spread across two distinct mechanisms.

PCI Passthrough

In some cases use of an entire physical NIC with particular features is required. In this case PCI passthrough may be used.

Similar to the VF case, encapsulation of traffic onto the appropriate testbed level virtual network is handled at the leaf switch. However, in this case it is not possible for the testbed to enforce a VLAN tag like we can for a VF. Thus the connecting switch port is operated in access mode - all ingress traffic to the switch port is tagged as it goes upstream to the experiment network at large, and all egress traffic is untagged as it goes downstream to the experiment node.

Virtual multiplexing of passthrough devices will not be supported. If testbed-based multiplexing of the device is required, virtio or vfio must be used. More generally, testbed automated multiplexing of virtual NICs within virtual machines through things like VLAN sub-interfaces that we currently provide for physical devices will not be supported - this is just the wrong way to do things. VLAN multiplexing for physical nodes is a necessary evil.

One additional thought is whether we’ll want to leverage a DPDK based OVS implementation on the hypervisor nodes. This may be required to maximize performance in virtio NIC configurations that desire >= 10G BW or O(1 us) latencies. I imagine this would likely fall under the “hybrid virtio + vfio” configuration. The reason being that, we would still retain the ability to configure an arbitrary number of vNICs on the hypervisor that would be bridged to the local VTEP via the DPDK NICs, but a certain number of physical NICs would need to be reserved by the host DPDK and so would be unavailable for other purposes, such as direct passthrough or networking for other (possibly bare-metal) experimental endpoints on the node.

To the extent possible I’d like to stick to vanilla Linux networking. Recognizing that DPDK and OVS can provide performance advantages compared to traditional approaches, and, that in today’s kernel there are first class subsystems such as XDP and eBPF which may allow us to get the same benefit with more streamlined systems. I’d like to pick apart the use cases we have and how these technology alternatives come into play in that context.

Let’s consider the ways in which experiment nodes (physical or virtual) are connected in Merge topologies. At the most basic level it’s a 2x2 matrix.

Link Type

  • Point to point links (P2P)
  • Multipoint links (MPL)

Emulation

  • Direct links
  • Emulated links

when we bring hypervisors into play, there is one more dimension to the matrix

Locality

  • Colocated
  • Dispersed

Let’s now consider the Link Type vs Locality matrix within the context of emulated on non-emulated cases

Direct

P2P MPL
Colocated Linux Bridge, OVS, BFP redirect-map Linux Bridge, OVS, XDP/Click, BPF multi-map-redirect
Dispersed Linux Filtering Bridge + VTEP, OVS, tap/XDP/VTEP <~~ same

Local

P2P links are fairly simple, as no switching behavior is required. In this case direct connections between VMs through a XDP/BPF redirect map is by far the simplest solution. We do this in Raven and it works well. Something that needs to be managed when using virtio interfaces is checksum offloading, as virtio seems to assume it will be connected to a traditional Linux bridge that will correct checksums.

MPL links are more complicated as one may want either ‘hub’ behavior, or actual switching based on learning or pre-cooked forwarding tables. I think this is the case where OVS may have a performance advantage over a traditional Linux bridge, but I am not sure what that delta looks like today. An alternative here is to use a Click based bridge with XDP. We could also look into the use of OVS in a XDP mode of operation to avoid DPDK - which is something I know they’ve been working on. For the hub behavior we could implement a simple XDP/BPF multimap redirect, which is not a big jump from the P2P we already do in Raven. A simple learning/hashing FIB may also be possible here, but requires further investigation.

Dispersed

In the dispersed case we are always pushing packets into VTEPs, so P2P and MPL are really the same problem locally on the hypervisor. I’m inclined to agree that for a large number of interfaces spread across a bunch of virtual machines a Linux filtering bridge may not perform spectacularly (although it’s really been getting a lot better over the last several kernel releases, and it’s worth a look how it does in 5.6). I really like the solution of mapping the tap devices from the VM directly into the VTEPs through XDP/BPF. This skips the entire kernel network stack and reduces (eliminates?) the need for an BGP/EVPN managed FIB on the hypervisor - which is great.

Emulated

This is where things get interesting, as we start to toy with the idea of on-hypervisor and even sender-side (read in the virtual NIC) emulation. The simple answer here is that for the time being, in a world where we are not doing network emulation on the hypervisor and all emulated packets go to an emulator that is elsewhere, the choice of how to do on-hypervisor plumbing decays to the dispersed decision space above as packets always go off-node through a VTEP.

In a world where we start to consider on-hypervisor emulation - this essentially means we are running Moa on the hypervisor that potentially works in conjunction with a global Moa. This is more or less how the Steam testbed works, albeit in an ad-hoc way that has not been formalized and cooked into fully automated Merge context.

The biggest concern here is managing the resources needed for the emulation of networks vs the emulation of nodes, and not letting them step on each other. Lots of options here including hosting the network emulator itself as a VM and curtailing it’s resource allocation that way - or using something like Linux cgroups. I’ll stop here and wait for feedback.

P2P MPL
Colocated Linux Filtering Bridge + VTEP, OVS, tap/XDP/VTEP, Local Moa <~~ same
Dispersed ^~~ same <~~ same