Multi-facility experiments (aka the "merge" in Merge)

Just getting the meeting notes down. I’ll flesh this out tomorrow.

Very Basic Roadmap:

  1. Multi site VTE
    • One portal, two phobos like facilities
    • Update raven and phobos ansible to support this
  2. Infranet Connectivity
    • wireguard connecting facilities. This should use the existing Wireguard service of the facility.
      The realization will be updated to say how many intra-facility endpoints a facility should make. The facilities will then make the endpoints, and give the portal keys
    • DNS (likely just give all data to all facilities and let them configure it)
  3. Experiment network connectivity
    • exp net switch? Working out how to update the realization to include this.
    • Figure out how to update realization for this.
  4. Network Emulation
    • Lots of questions here.
  5. Everything else. These will likely happen between or as part of previous steps.
    • How to model things outside a facility - right now all models are internal to a facility. Who creates the models? How to are characterize variable latency/bandwidth links?
    • How users model this in the experiment model. Minimum link bandwidth constraints?
    • How to tell users about the nature of intra-facility links? Warn them? Assume they know?

@glawler said:
So I’m looking at mutli-facility experiments. The current code very much assumes one facility for realization. So I’m thinking that rather then shoehorn multiple facilities into realization, we just realize on each facility (splitting the resource pool into discrete per-facility lists for reach realization). Then store facility connection information in a new MultiRealization data structure. This would be whatever is needed to connect facilities. During materialization, the facility would get it’s own realization plus the MultiRealization data if it exists. Does this make sense?

The realization engine would have to be updated to understand that some links may be external to the facility and to not freak out about that.

Hmm. Or just update the realization engine to understand that some links will be external. Data about external links is passed between different realizations so the experiment network can be patched together between the different, err, sub-realizations.

@christra care to chime in?

I think the things to weigh here are basically going to be ease of embedding, ease of programming, convenience in reconfiguration of live connection endpoints, and ease of monitoring (to determine real resource availability).

I think a single realization is better too, because otherwise, your cross-facility experiments links are going to be a lot more complicated and you still need to coordinate things like VXLAN/VLAN tags.

Basically, the proper way of doing this is to add new portal endpoint types (like a wireguard endpoint type) and how to path across the network (which for experiment network, might be a little bit more complicated because you’d use wireguard as the underday, and to use vxlan on top of it or so, but for infranet, you can probably just get away with wireguard as layer 3.)

If you do that part correctly, then constructing things on top of it (like for network emulation) should not be that bad. This is probably the most complicated part though, as network path construction is orders of magnitude more complicated than anything else in realization/embedding.

I had code in the way back times of setting up multiple infrapods for an experiment for the hierarchal infranet tests, you’d probably need something like that (to put different infranets on different subnets so you can route them via wireguard).

In terms of characterizing network performance, I think cross-facility xplinks are opt-in only, aka, the user has to explicitly state in the model that for this link, cross-facility is okay.

I’m not sure if the network performance between facilities should be modelled so much as it should be measured and updated dynamically.

Not sure about this. There is nothing in the modelling language that defines where something is and I think that is by design. If a user just says give me some links, then cross facility links are fine. If they say give me links under 10ms, then cross-facility links may not be fine. If the links are not constrained then anything goes. I realize this makes things much more complicated - but it is in the spirit of the model design.

My question is at the code level. There will still be one “realization” as far as the user is concerned. The thought I had was that we have a single facility realization anyway - and that code is complicated and fragile - so why not call that code once per facility then patch those together with a new chunk of realization code that patches them together. We’d store each realization as we do do currently, then a new meta-realization storage object which puts them together.

This may not be a good idea, but wanted feedback on it.

Why do we need a wireguard endpoint type? We use wireguard now to connect to the infranet yet it is modeled nowhere. I was thinking that the xpnet would connect to some network namespace that had a wireguard connection. The portal would ask the facilities to create the WG interfaces at mtz time much like it does now.

The hesitation I have about that is that external links are going to have much different properties than internal links and that extends not just to network properties, but usage too.

For example, you should not send DOS traffic over an external link. The experimenter needs to know that.

External links are probably going to be rate limited by default, unlike internal links which by default you just get the line rate. External links are going to be much more suspectable to cross experiment traffic because the link is likely shared with general internet connectivity meaning that the cross experiment link’s performance is much more likely to degrade (regardless of whatever you specify) if someone downloads something, like if someone is provisioning a bunch of VMs.

I don’t think being invisible and having the experimenter only know if they care to know is good enough.

At the very, very least, you should probably allow the experimenter to explicitly forbid an external implementation of a link because they want to send DOS traffic or the like over the link.

The tricky thing here for XP links is that where are you putting these network namespaces?

I do not believe we made it a requirement for the infrapod server to be connected onto xpnet, so it wouldn’t make sense to stitch the networks together on the infrapod server.

Ideally, you’d use wireguard to create an inter-facility underlay network on the across the gateway switches/nodes and then you’d VXLAN/BGP peer across, which is a little bit more involved than reusing the wireguard tunnel code for infrapods.

Unfortunately for the VXLAN endpoint/BGP peers approach though, we’d probably have to start assigning ASN’s/IP’s so that facilities don’t overlap…

So then, the other way would be to create separate wireguard tunnels for each mtz in their own network namespace, bridged somehow on the gateway switches. Again, you have to create the path from node 1 to facility 1’s gateway device to facility 2’s gateway device to node 2. So, you start thinking of wireguard as a thing for canopy to manage instead of to be done via facility API call.

This is related to why you’d want to extend the current realization stuff instead of trying to stitch things together yourself. The current code is complicated (because network pathfinding is complicated across a variety of conditions), but it’s a better base than essentially, trying to manually do all of the stitching together yourself. The current code, however, is not particularly fragile because of all of the unit tests.

Also, if you realize N times and try to stitch them together, it’s going to be more complicated coordinating infranet IP assignment/VLAN/VXLAN assignment than if you had a wholistic view to begin with.

Another thing to keep in mind about segmented realizations is how are you deciding how to segment an experiment model? What happens if you have a node allocation where some of the xp links are internal and some of them are external? How are you going to stitch that together?

What if we did?

This is what I was thinking. Minus canopy managing WG. If it did how would we do key exchange between facilities? I was thinking that during mtz, the facility checks if the mtz has other facilities, and if so then it creates WG interfaces for each one and gives the keys to the portal in the normal way. The portal then distributes the keys to the other facilities.

As a general strategy, is there anything wrong with essentially duplicating the infra net setup in the xp net? Nodes with external links get a connection to an “xppod” (the infrapod for xp nets), That pod has WG set ups for connecting to other facilities. Then VLAN/VXLAN for connections…?

Or even just make XP lans over the infra switches? Pretend I know nothing…

Then you can use it as an egress point in the same way that you can use any gateway node/switch as an egress point.

Why do key exchange when you can just generate all of the keying information in the portal for everyone to use?

I didn’t want to expose private keys as a good security practice. I still think we should not have the portal generate private WG keys.

I think Brian ignored that on the facility side though.

The portal still needs to distribute the pub keys and other WG info.

You need to use the XP ports and those aren’t physically connected to the infranet switches.

The infranet differs because that creates a LAN with the infrapod (usually) as the rendezvous point and that external connections into the infranet are done just with wireguard, which is layer 3 only. Layer 3 only is ok for infranet across sites if you put them on a different subnet (for routing reasons).

Since you want L2 connections for xpnet, you’d want to use VXLAN over the wireguard tunnel so you only have to set up 1 wireguard tunnel and 1 BGP peer.

Don’t these support multiple VXLANs per port? Can’t we create one for the xp traffic over the infra switch?

Oh, one thing complicated about a node being on both infranet and xpnet is that currently, because we have straight up two, separate and distinct AS/BGP networks for infranet and xpnet is that it’s difficult for a node to be on both of them, so that would make the infrapod having both infranet and xpnet BGP peers tough.

It’s the same reason why we currently cannot make a switch be both infranet and xpnet for VXLAN.

No, I mean, the XP ports aren’t physically connected to the infranet/infranet switches, they are physically isolated. How is traffic going to go between them if there’s no physical link between them?