Degrees of Merge Unmanageability

Overview of the problem

For the merge appliance work, there’s some need to integrate specialized machines or black boxes (like SBCs). This has some obvious overlap with the IoT work for SPHERE.

I’ve had some discussion with Mike, and we’ve come up with a spectrum of how difficult it is for a Merge facility to manage a hardware device for use in experiments.

There are a few levels of manageability we’ve come up. If you are at one level, you typically can operate at levels beneath it.

4. Can we install an operating system?

Currently, for bare metal, we use Sled to manage the operating system.
(For virtual machines, it’s inherently part of the process.)

Without it, we will figure out a method or process of cleaning an installation (like user accounts or packages or something) without completely reimaging it.

This is the current level of control that Merge assumes for an experimenter node (specifically because of the image constraint).

3. Can we run Foundry or an equivalent?

Foundry brings user accounts, network configuration and the other material that I would describe as “get to a device from an XDC”. Whether it’s done through one of the implementations of foundry (there is currently an essentially, separate foundry implementation for tinycore) or something completely different (like ansible) is an implementation detail.

If we can SSH into the device, it’s possible to have an implementation of Foundry at the facility level that automatically SSH’s into the machine at materialization time to run scripts.

2. Does it DHCP?

If it DHCP’s, then we can control its IP address at least.

1. Can we get a shell or something similar?

The shell could be a serial console, or in the case of IoT testbed, an API to interact with the IoT device.

0. Complete black box

The only thing that Merge knows about it is that it exists on the experiment network somehow and can be theoretically accessed by nodes if everything beforehand is configured probably (like maybe it has a static IP address and you can only send and monitor network traffic to/from it.)

Currently, Merge implements experimenter nodes (whether bare metal or VMs) at level 4.

Examples

IoT testbed devices probably operate around level 2.
Arduinos: fail at step 2, as they are often lacking a network interface, with a network interface, probably level 3 cannot be fully completed.
Raspberry PI: With an ARM sled uroot implementation, probably level 4.

The implications for this is that there’s a desire to be able to model different types of nodes within XIR, both on the facility side and the experimentation side.

For merge appliance work, we’re currently aiming to implement on implementing something nice at level 0, which definitely has overlap with IoT testbed.

Questions

I’m hoping that I’m not missing anything that a facility would want to manage.
I’m also wondering from an experimenter standpoint, what sort of promises should we making to begin with.

For a Merge Appliance, it’s generally assumed that the experimenters and the facility operators are close parties – which mostly means that you can get away with a Merge black box implementation as you can actually make the facility operator/experimenter responsible for the things that Merge would otherwise do. But I’m not sure if that works well when the testbed has a larger audience.

For the very far off multisite materializations, it’s possible that you could have a user plop down a simple facility, consisting of a server and whatever BYOD they would want to plug into an experiment, and maybe modelling their BYOD as a Merge black box, to make the server requirements pretty simple (probably just a Merge provided simple service to handle simple MTZ requests, which would probably just be like canopy and an infrapod).

The NEU IoT testbed will all be at level 0. These are commercial IoT devices. They are going to completely setup the network, putting things on static VLANs. No dynamic networking. Not sure yet how they are going to do wireguard connections. They will be a “merge black box” implementation more or less.

A fun complication for their testbed is that the IoT devices’ command and control can interfere with each other. They user speakers to “talk” to devices. All the devices are not isolated - so audio commands for one device can and will be picked up by others. So devices will be realized in groups - ask for one speaker you get all the devices in the cabinet that are with that speaker. There is nothing in XIR to take that into account, so we’ll have to add it. Some sort of “allocation group”. This would be something like level -1? Black Box group.

I would maybe put level 1 at 2 and 2 at 1. Getting a shell is more than just DHCPing.

spitballing… my initial intuition of this is that these concepts should be layered capabilities that we can choose “best control fit” from. our default standpoint is “you give us hardware with model, model determines what can happen”, and it seems like this would be a natural extension of this.

“can we kexec an OS”, “can we install an OS” and “can be a hypervisor” should be separate features. “CANNOT use sled/foundry” can be a separate feature, since we could theoretically be incapable of post-boot config without more tooling, but possibly able to image something (as a bad example, a cumulus switch). perhaps the defaults for fully provisionable machines can be a set of features and we can turn them off for less manageable things

this would mean that ability to dhcp, ability to accept “serial” or terminal commands, availability of ipmitool/redfish, and ability to accept api commands are all features that we could support, and we would just have to internally create priority lists to control things based on either operator modeling, suggestions, or a default priority list (i.e. sled/foundry takes precedence over serial control), and, i think, create what amounts to a plugin/drop-in system where the reference implementation supports everything we have encountered and there’s a place to make a spec for anything we have not.

i could be totally wrong about this, just let me know why :slight_smile: