Node emulation and virtualization support

To the extent possible I’d like to stick to vanilla Linux networking. Recognizing that DPDK and OVS can provide performance advantages compared to traditional approaches, and, that in today’s kernel there are first class subsystems such as XDP and eBPF which may allow us to get the same benefit with more streamlined systems. I’d like to pick apart the use cases we have and how these technology alternatives come into play in that context.

Let’s consider the ways in which experiment nodes (physical or virtual) are connected in Merge topologies. At the most basic level it’s a 2x2 matrix.

Link Type

  • Point to point links (P2P)
  • Multipoint links (MPL)

Emulation

  • Direct links
  • Emulated links

when we bring hypervisors into play, there is one more dimension to the matrix

Locality

  • Colocated
  • Dispersed

Let’s now consider the Link Type vs Locality matrix within the context of emulated on non-emulated cases

Direct

P2P MPL
Colocated Linux Bridge, OVS, BFP redirect-map Linux Bridge, OVS, XDP/Click, BPF multi-map-redirect
Dispersed Linux Filtering Bridge + VTEP, OVS, tap/XDP/VTEP <~~ same

Local

P2P links are fairly simple, as no switching behavior is required. In this case direct connections between VMs through a XDP/BPF redirect map is by far the simplest solution. We do this in Raven and it works well. Something that needs to be managed when using virtio interfaces is checksum offloading, as virtio seems to assume it will be connected to a traditional Linux bridge that will correct checksums.

MPL links are more complicated as one may want either ‘hub’ behavior, or actual switching based on learning or pre-cooked forwarding tables. I think this is the case where OVS may have a performance advantage over a traditional Linux bridge, but I am not sure what that delta looks like today. An alternative here is to use a Click based bridge with XDP. We could also look into the use of OVS in a XDP mode of operation to avoid DPDK - which is something I know they’ve been working on. For the hub behavior we could implement a simple XDP/BPF multimap redirect, which is not a big jump from the P2P we already do in Raven. A simple learning/hashing FIB may also be possible here, but requires further investigation.

Dispersed

In the dispersed case we are always pushing packets into VTEPs, so P2P and MPL are really the same problem locally on the hypervisor. I’m inclined to agree that for a large number of interfaces spread across a bunch of virtual machines a Linux filtering bridge may not perform spectacularly (although it’s really been getting a lot better over the last several kernel releases, and it’s worth a look how it does in 5.6). I really like the solution of mapping the tap devices from the VM directly into the VTEPs through XDP/BPF. This skips the entire kernel network stack and reduces (eliminates?) the need for an BGP/EVPN managed FIB on the hypervisor - which is great.

Emulated

This is where things get interesting, as we start to toy with the idea of on-hypervisor and even sender-side (read in the virtual NIC) emulation. The simple answer here is that for the time being, in a world where we are not doing network emulation on the hypervisor and all emulated packets go to an emulator that is elsewhere, the choice of how to do on-hypervisor plumbing decays to the dispersed decision space above as packets always go off-node through a VTEP.

In a world where we start to consider on-hypervisor emulation - this essentially means we are running Moa on the hypervisor that potentially works in conjunction with a global Moa. This is more or less how the Steam testbed works, albeit in an ad-hoc way that has not been formalized and cooked into fully automated Merge context.

The biggest concern here is managing the resources needed for the emulation of networks vs the emulation of nodes, and not letting them step on each other. Lots of options here including hosting the network emulator itself as a VM and curtailing it’s resource allocation that way - or using something like Linux cgroups. I’ll stop here and wait for feedback.

P2P MPL
Colocated Linux Filtering Bridge + VTEP, OVS, tap/XDP/VTEP, Local Moa <~~ same
Dispersed ^~~ same <~~ same