Next Generation Network Interface Specification

ry · August 7, 2020, 6:26pm

Advanced network testbed users requite specific capabilities from the network interfaces they have. This post outlines how we will provide an avenue for specifying important details that are needed from the network interfaces for an experiment so successfully execute.

Given the following topology, lets say for my pX nodes in the core of the network, I have some specific network requirements.

These requirements are

DPDK capable network interfaces
At least 8 hardware TX/RX lanes on each interface

A sketch of how this can be specified in an experiment topology is as follows (bottom of file)

from mergexp import *

net = Network('triforce')

def px(name):
    return net.node(
        name,
        memory.capacity >= gb(1),
        proc.cores >= 1,
        disk.capacity >= gb(100),
        metal == True
    )

def vx(name):
    return net.node(
        name,
        memory.capacity >= gb(1),
        proc.cores >= 4,
        disk.capacity >= gb(100),
    )

p = [px('p%d'%i) for i in range(3)]

net.connect([p[0], p[1]], link.capacity == mbps(30))
net.connect([p[0], p[2]], link.capacity == mbps(30))
net.connect([p[1], p[2]], link.capacity == mbps(30))

v = [vx('v%d'%i) for i in range(9)]
net.connect([p[0]] + v[0:3], link.capacity == gbps(1))
net.connect([p[1]] + v[3:6], link.capacity == gbps(1))
net.connect([p[2]] + v[6:9], link.capacity == gbps(1))

for machine in p:
    for socket in machine.spec.sockets:
        socket.require(
            link.dpdk == True,
            link.queues >= 8,
        )

experiment(net)

This is clearly not an exhaustive list of interface properties. More will be added over time, the topic for discussion here is on the general approach of specifying things this way and if there are use cases that may not be covered.

bkocolos · August 10, 2020, 3:38pm

Some jumbled thoughts below.

I. There are at least a couple of levels of abstraction that users might wish to express constraints/requirements in:

high level traffic flow characteristics (e.g., x Gbps TCP throughput per flow, M parallel flows per link, no packet coalescing)
low level device characteristics (e.g., x Gbps-capable full-duplex link, virtio/vfio driver, disable/enable of offload features, M-queue tx/rx ports, link MTUs)
others (DPDK capable)

You could imagine – and maybe we are already doing this – modeling the experiment realization as a constraint satisfaction problem. e.g., if you want 10 Gbps TCP, you (probably) cannot specify virtio+no packet cooalescing+1500 MTU. This may be challenging to get 100% right though, because things like CPU frequency (both in the physical nodes and user’s virtual CPUs) also play into the equation.

Eliminating the highest level makes things a bit easier, because we wouldn’t need to go through the steps of understating how low level constraints map to high level end-to-end flow characteristics. However, I imagine at least some users would like to operate at that high level, so it would be cool if we could figure out how to do it.

II. Unclear things

How do you specify that a set of links – either virtual or physical – should or should not have layer-2 connectivity? Is the assumption that each link is on the same vlan unless the user specifies otherwise?
Can users specify IP subnets?
Do users need to specify a virtual->physical topology? Do we instead have a model that is fully virtual? e.g., in this case, you would just say that you have 3 sets of 3 vms, where inner-group has 1gbps links and cross-groups have 30mbps.

III. Searchlight/VXLAN related stuff

Is the default to map each VLAN:VNI 1:1? This is the easiest to reason about, but it limits us to 4K VNIs
Do VTEPs/host bridges have mechanism to discover traffic endpoints without relying on learning? I imagine this is not something we want to expose to the user, but if we do something like enable MAC learning as the default model, some users would need to tell us not to do this, because things like asymmetric flows can break learning mechanisms. Such users may want to do things like program FDBs a priori

ry · August 10, 2020, 4:05pm

Thanks for the feedback.

Yes we do consider the realization problem as a constraint satisfaction problem. This has been one of the core themes for Merge from the beginning. Modeling an experiment as a set of constraints that define a validity domain, and the realization engine is a constraint satisfaction solver over a network of coupled constraints (very difficult problem).

The approach that we have taken thus far for high level constraints is via library functions that generate lower level constraints that the realization engine operates on.

The concept of a link in a Merge topology description is always a layer-2 segment. Every member of the link, whether it’s a P2P link or a multi-point link is in the same L2 broadcast domain. How a link is implemented is an implementation detail of the facility, however in some cases it does matter. For these cases there do need to be constraints that ensure validity, for example if the experiment itself is using VLANs it should state that on the links it’s using to ensure that the links can carry experiment VLAN traffic (or VXLAN for that matter), as there are some configurations such as provisioning links as VLAN sub-interfaces on nodes that would impede this use case. An example of such a constraint could be

net.connect([a, b] link.VLAN == True)

Yes https://www.mergetb.org/docs/routes

I’m not sure I fully grok the question. If by virtual->physical topology you mean users speficy the mapping? They do not do that.

The realization engine figures out how to map the experiment topology onto the testbed topology based on the connectivity model of the experiment and the testbed, and the constraints in the experiment relative to the specifications of resources in the testbed. The user only needs to specify if they explicitly want a node to be virtual or physical. In the absence of any specific constraint the realization engine will find the “lightest weight” embedding that it can.

For virtual machines, the default is to not use VLANs at all in the 802.1Q sense of the word. VTEPs will be placed directly on the hypervisor and a VLAN aware bridge will connect VM taps to VTEPs and then EVPN takes over to transit VXLAN packets to their destinations.

For physical machines, we use VLAN access ports in the case that the number of physical interfaces on the node is >= the number of logical interfaces on the experiment node occupying the physical host. On the leaf switch connected to the testbed node, the VLAN is then pushed into a VXLAN segment. So the 4k (3k in practice) limitation is per leaf switch which is not that limiting. For switches that do not have hardware support for VXLAN encap, the encap gets moved up a layer, so the 4k limit is for a group of leaf switches underneath a fabric switch.

Yes, this is precisely why we use EVPN, to front load VXLAN forwarding tables.

bkocolos · August 10, 2020, 4:51pm

Ok, sounds right. I think there are 2 ways in which this is challenging. It is computationally “hard” to implement the realization once we have low-level constraints. But even deriving the constraints can be conceptually hard if the user only gives high level constraints, and those constraints capture high fidelity characteristics such as presence of coalescing, protocol specific throughputs, maximum acceptable inter-packet arrival times, etc. But I like the idea of trying to do this.

OK thanks, I misunderstood the figure. I assumed that v were mapped to p, but I see that that isn’t implied by just having a link.

Ok, I assume that the following is also true: every non member of a link is not part of that links’ L2 broadcast domain. If true, I suspect this would obviate the need to carry VLAN traffic in most experiments.