A Brain Dump of the Current State of Realization

christra · April 24, 2025, 12:14pm

The basic algorithm for realization is:

Building routing tables between testbed resources
Figure out which nodes we want to use
Build the Infranet/Xpnet/Emu Net
- Interface selection
- Path selection
- Endpoint/waypoint configuration

Routing Tables

Routing tables are constructed via breadth first search from a node’s interface.
Because for a routing table, you need the next hop, to find all of a node’s destinations, we do breadth first search from each dist 1 neighbor of a node at once
With the routing tables, we’re essentially pre-computing all possible neighbor/destination pairs for a given node.
We store:

A neighbor’s dist 1 paths
All possible destinations from a given interface and the minimum hops to reach the destination interface from that neighbor

We store the destinations as TPAs, which is sort of like a 64 bit IPv4 address, where:

bits 0-16 correspond to the facility id
bits 16-32 correspond to the network the resource belongs under
- this starts at two constants, one of them for the infranet and xpnet
- each switch then takes the current value, then increments
bits 32-40 correspond to the specific interface of that switch
- these are copied to the other port that the interface is connected to
bits 40-64 are unused, currently. the code suggests that it is meant for guests

What this means is that you can create the shortest path from a node to another node by first routing to the switch its under (by masking off bits 32-64 and repeatedly taking the interface that has your masked destination as a route) and then once you get to the switch, routing directly to the node.

THIS IS IMPORTANT AND ANNOYING:

Because we store interfaces in the routing table, not nodes, it means that your destination has to be an interface that you’ve pre-selected in order for you to use the routing table.
In particularly, it’s really messy to route from a node to another node when you don’t care about the interface that you use or that you cannot pre-determine the interface that you should use.

This, in theory, works, and is an optimized representation of storing every possible route to a node under a spine-leaf topology.

But things get complicated if your resource can belong to multiple networks (like on both xpnet and infranet because it’s a regular testbed node) or if you don’t actually have a spine-leaf topology.

It would probably be cleaner to not do these representations and to just have a table of every resource to every other resource. But, that could be messy depending on how multi-facility realization is implemented.

Constructing these routing tables takes very, very little time, usually a few milliseconds, so we are lazy and compute these every time for realization.

It would be really nice if did something more general, but whatever.

Node Embedding

Currently, this is done really greedily.
We sort the hosts in this order:

“Least” amount of total resources (not available resources)
- “Least” is in quotations, as we essentially assign a weight to:
  - CPU Cores
  - Memory
  - Disk
  - Network
- Sum them, and use the “least” one
In a tie, we go by the name of the node

Then, for each hosts in that order, we check if:

If the experiment guest fits on that node (based on current usage)
- Checking node constraints
  - CPU/Memory/Disk/Network
  - Tag constraints
    - This is new
  - Hostname constraints
  - TODO: Image constraints, but we would need a list of what images we can run to begin with for each node
If so, we use it, otherwise, we go next in the list
When unspecified, we try to place the nodes as VMs first, and then try as bare metal nodes

Infrapod/Emulation Server:

Round robin/use the one in the least amount of materializations

Surprisingly, even though it’s pretty simple, it’s not very efficient, for a 10k node experiment, it takes about ~100 seconds to run.

Link Realization

This is where things get really complicated and is where the bulk of the code lies.

General Network Construction

It’s useful to understand first how we make a link in general.

In general, it’s done by repeatedly taking the shortest path between two guests and properly configuring the path between them.

Previously, it done by repeatedly taking the shortest path between two interfaces on the host resources. That works if you assume that for a given link, that the interface can reach all other nodes, but that isn’t necessarily true if different interfaces on that host can reach different devices
As an example, consider that you have A - VM - B in a line. You need to have both (A, VM-A) and (VM-B, B) interfaces in your network path for it to work, but it would previously only have (A-VM, VM-A) and (VM-A, B) (since it could only pick 1 interface per host per link) which is not usable.
As a result, the code had to be refactored to support entering in the host resource itself for the endpoints, which got complicated because the routing tables are not designed to route from node to node.

“Configuring the path” means that the actual endpoints and waypoints align with each on the path and the guests would be able to communicate through the path, such as if you have VTEPs that you have BGP peers and to make sure VXLAN/VLAN ids match each other.

Now, the code for generating this configuration path assumes that you’ve already decided on what specific endpoints that you want to use, for example a VXLAN to a trunked VLAN or so.

So, you have to figure out what endpoint you want and note that some endpoints, based on the facility topology and capabilities are not possible (this is mostly around the capabilities of the switches, if they support VXLAN or not).

Previously, endpoint selection was somewhat hardcoded, the code has been refactored to support endpoint selection being able to take in much more context (like the other nodes in the link) to be able to make better decisions.

Configuration now also includes the “virtual interface/bridge” routing too, when you have virtual machines on the same machine, especially if you have emulation on it as well.

In general during the construction of a path, if VXLAN is possible, we use that between switches instead of raw vlan trunking.

For a 10k node experiment that had around 500 links, which took around 5 seconds to execute on the new version.

Infranet

The infranet is constructed as a flat network across each experimental node.
So, we take the infrapod server and add a link from every experimental node to the infrapod server with the same VXLAN/VLAN.

Previously, VXLAN was the only possible endpoint on the infrapod server, now VLAN endpoints can be used too if VXLANs are not supported across the topology or if it’s a single node.

Hosts get untagged traffic for the harbor network and tagged traffic for the infranet. When we use sled, we pass the VLAN id and interface with kernel arguments to set up the link.

Xpnet

Xpnets are constructed in two ways: for regular links and for emulated links.

In general, bare metal guests get untagged traffic when the number of links is less than or equal to their physical links, after that, it becomes tagged. VMs get untagged traffic, the hypervisor deals with that for them.

Regular Links

Every pair of guests in the link are connected. VTEPs are on hypervisors when possible. (Previously, it was always VTEPs.)

Emulated Links

Every guest in the link is connected to the chosen emulation server, the emulation server bridges them together and applies emulation rules onto them. VTEPs is on the emulation server when possible.

Single Node Notes

When you have a link that spans only a single node, VM traffic doesn’t leave the host and no endpoint is made for it. In the future, that could change with VRFIO.

What was implemented was single node traffic with infrapods/emulation being run by the hypervisor. In those cases, a VETH device is created, with one half attached to the infrapod/exposed to moa, and the other half connected to the VM tap bridge.

Actually Assigning Usable VLAN/VXLAN Ids

During embedding, placeholder ids are used, it’s only when it’s written as a realization that actual ones are chosen.

This is mostly relevant for the embedding testing code, as the ID translation between the placeholders and actual ones are not tested there.