Degrees of Merge Unmanageability

Overview of the problem

For the merge appliance work, there’s some need to integrate specialized machines or black boxes (like SBCs). This has some obvious overlap with the IoT work for SPHERE.

I’ve had some discussion with Mike, and we’ve come up with a spectrum of how difficult it is for a Merge facility to manage a hardware device for use in experiments.

There are a few levels of manageability we’ve come up. If you are at one level, you typically can operate at levels beneath it.

4. Can we install an operating system?

Currently, for bare metal, we use Sled to manage the operating system.
(For virtual machines, it’s inherently part of the process.)

Without it, we will figure out a method or process of cleaning an installation (like user accounts or packages or something) without completely reimaging it.

This is the current level of control that Merge assumes for an experimenter node (specifically because of the image constraint).

3. Can we run Foundry or an equivalent?

Foundry brings user accounts, network configuration and the other material that I would describe as “get to a device from an XDC”. Whether it’s done through one of the implementations of foundry (there is currently an essentially, separate foundry implementation for tinycore) or something completely different (like ansible) is an implementation detail.

If we can SSH into the device, it’s possible to have an implementation of Foundry at the facility level that automatically SSH’s into the machine at materialization time to run scripts.

2. Does it DHCP?

If it DHCP’s, then we can control its IP address at least.

1. Can we get a shell or something similar?

The shell could be a serial console, or in the case of IoT testbed, an API to interact with the IoT device.

0. Complete black box

The only thing that Merge knows about it is that it exists on the experiment network somehow and can be theoretically accessed by nodes if everything beforehand is configured probably (like maybe it has a static IP address and you can only send and monitor network traffic to/from it.)

Currently, Merge implements experimenter nodes (whether bare metal or VMs) at level 4.

Examples

IoT testbed devices probably operate around level 2.
Arduinos: fail at step 2, as they are often lacking a network interface, with a network interface, probably level 3 cannot be fully completed.
Raspberry PI: With an ARM sled uroot implementation, probably level 4.

The implications for this is that there’s a desire to be able to model different types of nodes within XIR, both on the facility side and the experimentation side.

For merge appliance work, we’re currently aiming to implement on implementing something nice at level 0, which definitely has overlap with IoT testbed.

Questions

I’m hoping that I’m not missing anything that a facility would want to manage.
I’m also wondering from an experimenter standpoint, what sort of promises should we making to begin with.

For a Merge Appliance, it’s generally assumed that the experimenters and the facility operators are close parties – which mostly means that you can get away with a Merge black box implementation as you can actually make the facility operator/experimenter responsible for the things that Merge would otherwise do. But I’m not sure if that works well when the testbed has a larger audience.

For the very far off multisite materializations, it’s possible that you could have a user plop down a simple facility, consisting of a server and whatever BYOD they would want to plug into an experiment, and maybe modelling their BYOD as a Merge black box, to make the server requirements pretty simple (probably just a Merge provided simple service to handle simple MTZ requests, which would probably just be like canopy and an infrapod).

The NEU IoT testbed will all be at level 0. These are commercial IoT devices. They are going to completely setup the network, putting things on static VLANs. No dynamic networking. Not sure yet how they are going to do wireguard connections. They will be a “merge black box” implementation more or less.

A fun complication for their testbed is that the IoT devices’ command and control can interfere with each other. They user speakers to “talk” to devices. All the devices are not isolated - so audio commands for one device can and will be picked up by others. So devices will be realized in groups - ask for one speaker you get all the devices in the cabinet that are with that speaker. There is nothing in XIR to take that into account, so we’ll have to add it. Some sort of “allocation group”. This would be something like level -1? Black Box group.

I would maybe put level 1 at 2 and 2 at 1. Getting a shell is more than just DHCPing.

spitballing… my initial intuition of this is that these concepts should be layered capabilities that we can choose “best control fit” from. our default standpoint is “you give us hardware with model, model determines what can happen”, and it seems like this would be a natural extension of this.

“can we kexec an OS”, “can we install an OS” and “can be a hypervisor” should be separate features. “CANNOT use sled/foundry” can be a separate feature, since we could theoretically be incapable of post-boot config without more tooling, but possibly able to image something (as a bad example, a cumulus switch). perhaps the defaults for fully provisionable machines can be a set of features and we can turn them off for less manageable things

this would mean that ability to dhcp, ability to accept “serial” or terminal commands, availability of ipmitool/redfish, and ability to accept api commands are all features that we could support, and we would just have to internally create priority lists to control things based on either operator modeling, suggestions, or a default priority list (i.e. sled/foundry takes precedence over serial control), and, i think, create what amounts to a plugin/drop-in system where the reference implementation supports everything we have encountered and there’s a place to make a spec for anything we have not.

i could be totally wrong about this, just let me know why :slight_smile:

Interesting discussion. I think we should think a bit about level 0. For me, level 0 means “there is a device powered on and configured and has an IP and an OS. I just want to run some sw on the device to make 0.0.0.0/0 route point to an experiment in SPHERE and I want to be able to reach out to that device from SPHERE (adjusting the experiment’s routing tables).” I would consider this bare minimum. The device could have a default username/pass. We could run a custom program (SSH? custom server) to allow access to the device from SPHERE.

level 0 is: it has an ip address (somehow). merge ecosystem will have no idea how to control or what to do with it other than connect the ip into an experiment.

i think you are more up at level 1 or 2 when you are talking about “running software on a device”.

i think “levels” are not the right way of thinking about this, but capabilities or features is, mostly because the “levels” get mushy when there are black boxes, white boxes, and fully configurable machines that actually support interfacing in “any of the above” ways. not only that, but a “feature set” rather than levels also allows us to fall back on features that exist without any concept of hierarchy.

there is a chance we could utilize something like the roles system to shoehorn this in with existing code, but unsure whether it’s a good or bad idea to overload that if it was not intended for that kind of use.

I do agree that levels 1 and 2 are not quite “levels” but rather features.
Roles in theory could be used to manage this, but I kinda don’t like how roles are currently implemented, since it exposes a specific facility’s implementation to everyone in the same API (does NEU IoT need to know what a rally server is? do we need to know what their “terminal access server” is?), so it’s kind of hard to grow organically while being scoped to the needs of specific testbeds but that’s something I’ve been kinda thinking about.

I think this is a valid concern – does a portal “merging” facilities need specific details on the facility? maybe only at realization/materialization time since realizations are embedded by the portal? does a facility need to know what a rally server is? i don’t really know, but if this is a xir role, you could also say something like ‘why does a facility with only sata drives need to know what an nvme drive is?’ it’s not like we are obfuscating anything when it comes to merge code - its all open source. so, where would it matter that the roles “exist” vs “are used”?

in any case, i am not really set on using the roles system unless it seems like it makes sense to use it, so the question then becomes if it’s not something that can be shoehorned into roles, then we need another way to specify features, and a way to model it so nobody has to manually configure things ever (unless an operator does that purposefully) once the model is set up.