MergeTB 1.0
This RFC will define all the capabilities that will be in the MergeTB 1.0 release. For MergeTB 1.0 each capability must have end-to-end testing that asserts its functionality. These tests will be hashed out in a separate RFC.
Release Date
1 July 2020
Release Format
MergeTB follows the release channel format of Debian. We will maintain stable, testing and unstable release channels. As we have not yet reached a 1.0 release, there is no stable yet. Deployed systems
and the collection of latest git semver tags are implicitly the testing release and developer branches implicitly comprise unstable.
When the initial release is cut all semver tags across all Merge code repositories will go to v1.0, and packages will be pushed to the stable repo for the first time. At time the testing and unstable repos will stop accepting ad-hoc CI pushes as is the current practice and henceforth be governed in accordance with migration rules, testing and policies currently being defined in the ecosystem RFC.
Moving forward, follow-on major releases in stable and testing will follow a capability driven model (as opposed to a calendar driven model). When a sufficiently motivating set of new capabilities has reached the testing repo, testing will enter a freeze period and acceptance testing for migration of an entire snapshot of the testing channel will begin. Minor releases and bug fixes may migrate through the release channels outside of release windows.
Portal Capabilities
The portal capabilities are the set of functionalities directly accessible to users. They are implemented by a Merge portal system that presides over a collection of testbed facilities. The documentation that follows
- defines each capability
- provides status of each component that fully or partially implements that capability
Experiment Modeling
Experiment modeling is broadly broken down into model expression, validation and reticulation. Each is addressed in more detail below.
Expression
Experiment modeling is the capability of a user to Programatically express an experiment in terms of
- Network topology structure
- nodes
- links
- endpoints that belong to nodes and bind to links
- point-to-point and multipoint links
- Node configuration
- OS Image
- Interface IP addresses
- Node characteristics as constraints
- Number of CPU cores
- Memory capacity
- GPU model presence
- interface bandwidth capacity
- Link configuration
- maximum capacity
- latency
- loss rate
- Global experiment properties
- Automatic IP address assignment
- Automatic routing calculation
Merge v1.0 will ship with the Python 3 based MX library for experiment expression. Merge will track Debian Bullseye in terms of what exact Python 3.X release is supported.
Validation
Modeling also includes the capability for the experimenter to validate an experiment model through
- Compilation
- Is the expression of the model syntactically and structurally semantically sound.
- The result of compilation is an eXperiment Intermediate representation (XIR) file.
- Static analysis
- Is the experiment fully connected (islanded experiments not supported in v1.0)
- Are IP addresses well formed
- Visualization
- Basic force-directed graphs of experiments
XIR (eXperiment Intermediate Representation)
XIR is JSON with an additional set of semantics built in
- The structural model of a network is represented through nodes with endpoints that reference links
- Constraints are represented in a general format with a concrete syntax and semantics
TODO: provide more detail here
Reticulation
Modeling also includes the capability to automatically augment a model with derived features according to a high level specification. This is refered to as model reticulation. An example is automatically calculating routes across a topology so all nodes can communicate with each other. Reticulators for v1.0 will include.
- IP address assignment
- Each link gets its own subnet
- Route calculation
- Routes are calcluated exhaustively for all endpoints in the topology
Revisions
A user makes an experiment model available to a Merge portal by pushing an experiment revision. The experiment revision history is an immutable stack of experiment versions. For each revision the high level source code (which is always Python MX in v1.0) and the compiled XIR is tracked by the portal.
Status
Capability | Component | Status | Test Coverage |
---|---|---|---|
Python based modeling | mx | in-production | implicit |
Model compilation | model service | in-production | implicit |
Model reticulation | model service, mcc | in-production | some unit testing |
Model checkers | mcc | planned | none |
Visualization | tbui | in-production | none? |
Realization
Realization is the capability of taking a users experiment topology and attempting to find an embedding of that topology into the overall network of resources a portal presides over. The network of resources a portal presides over is the internet formed by interconnecting all the resource networks of the testbed facilities that are managed by the portal. This is referred to as the resource-internet.
Realization may or may not succeed, depending on what the user asks for. For example if a node with 47 GB of ram is requested, but does not exist in the resource internet, or resources that do have that much ram have already been allocated, the realization will fail.
Realization for Merge v0.1 will have the following capabilities.
- Realize experiment nodes as physical devices
- Realize experiment nodes as virtual machines multiplexed over virtual machine hosts
- Realize experiment interfaces as physical interfaces
- Realize experiment interfaces as virtual interfaces multiplexed onto physical interfaces
- Follow the scarcity model of allocation, which dictates that the order in which resources are considered for node-realization is inversely proportional to their overall scarcity as a collection of characteristics.
- For example, if there are only 10 nodes available with > 30 GB of RAM, but 200 nodes available with 10 GB of ram, the more abundant nodes with 10 GB of ram will be allocated first. (There is a much longer discussion of this with more detail and precision in an MR that should be linked here)
- When a realization is made, the user can either accep or reject the realization. They have 47 seconds to do so. No response is considered an implicit reject.
- When a realization is made, the resources belong to the owning experiment until explicitly freed.
- Realization shall take no longer than 5 seconds independent of size of experiment or of the resource internet.
- Realization shall not be aware of the specifics of resources, or the specific constraints of experiments, but rather, the realizer understands how to determine if a constraint is satisfied by a resource in a generic way. This way the realization engine, resource pools and experiment definitions can all evolve independently.
A single experiment may have many simultaneous realizations at any given time.
Materialization
Materialization is the capability of taking a realization and communicating with testbed facilities to provision the resources that underpin that materialization.
Users can
- Create materializations from realizations
- Check on the status of a materialization
- Dematerialize a materialization
- Attach to experiment networks through the experiment virtual private network (xVPN)
There is a 1:[0,1] mapping between realizations and materializations at any given time, as a realization is a collection of resources, so a single set of resources can not be materialized multiple times at once.
A concrete protobuf 3 protocol specification for materialization exists. Furthermore semantics for how the protocol is carried out between the portal and sites is implicitly defined in code and the minds of the core Merge development team but should be spelled out and well defined here.
Of particular concern is defining the failure model for when things go sideways on a testbed facility during a materialization. How is the relevant information propagated, and what are the specified actions the portal must take. Again, all of this is implicitly captured in portal code, but should be spelled out here.
Project, User and Workspace Management
The following capabilities will be provided for project, user and workspace management in Merge v1.0
- Users accounts are created through an OAuth2 registration flow
- Once a user account is created it must be activated by a portal administrator
- User accounts automatically get a personal project
- User accounts are provisioned with home directory in
/home/<username>
that may be accessed through XDCs but are independent of the lifetime of XDCs - Projects are provisioned with a project directory in
/proj/<name>
that may be accessd through XDCs but are independent of the lifetime of XDCs - Projects, users and experiments are governed according to the declarative portal policy framework.
- Project maintainers can manage their own projects and permission sets without the need for testbed administrators.
API
The MergeAPI is an OpenAPI 2.0 Spec. It provides the ability to manage the following objects. In the list below CRUD (create, read, update, delete) operations are not explicitly listed but assumed for all objects.
- Users
- Pubkeys
- Projects
- Members
- Experiments
- Source code
- XDCs
- Tokens
- Connections
- Realizations
- Accept / reject
- Materializations
- Status
- Attach
- Sites
- Views
- Certs
- WGD config
- Activation state
- Resources
- List
- Pools
- Sites
- Projects
- Models
- Compile
- Health
TODO: fill in more detail for all the above, each probably needs to be a subsection
XDCs
An eXperiment Ddevelopment Container, or XDC is a portal managed container that can be created on demand by users. The container is accessible through SSH via the portal SSH jump container and through a Jupyter web interface via the portal HTTPs proxy container.
XDCs exist at project scope. When an XDC is created
- the project directory for the owning project is mounted in the XDC at
/proj/<name>
- home directory for all users in the project in mouted in the XDC at
/home/<user>
An XDC can be attached to a materialized experiment by using the materialization/attach
API endpoint. When an XDC is attached a Wireguard interface is created on the XDC that provides a secure tunnel to the infrastructure network of the running experiment that the attach was requested for.
XDCs come with a base image that is a configuration parameter of the Merge portal they are hosted on. Users can override this base image with an image of their choosing in the API call to create the XDC. Containers that are used as XDCs must derive from one of the XDC base containers or the request will be rejected.
Deployment
Beyond being a platform as a service PaaS, the Merge portal is also distributable software. The portal ships as a self contained Kubernetes cluster. The process of deploying a Merge portal is a matter of
- Creating a complete configuration file for the portal (TODO: specify this)
- Preparing the host cluster host operating systems and networks for Portal installation (TODO: specify this)
- Deploying the Merge Portal onto the cluster
Operations
The Merge Portal provides the following operations capabilities
- The mergectl utility, which sits behind the policy layer and can manage all portal objects administratively.
- A CLI and web dashboard that shows
- Core service health
- Active XDCs
- Resource usage across the resource internet
- Quantity of resources in use per project/user
- Duration of resource allocation
- Active materializations
- Mass storage utilization
- Certificate lifetimes
Facility Capabilities
The facility capabilities are the set of functionalities needed to materialize the space of experiments that can be expressed according the the Experiment Modeling section above.
TODO: There is a lot of temptation to write about the how here, but I’m making an effort to stick to the what with a bit of the why. It seems to me (ry) that a good idea for capturing some of the how is an implementation practices RFC, I’ve seen this style of RFC in the IETF and think it’s useful.
Commander
The MergeTB commander is a delegation authority between a Merge portal and a testbed facility. The commander implements the Merge materialization API with an interface that requires a client TLS certificate. The certificate is generated by a facility administrator and given to a Merge portal through the portal’s site/cert
API endpoint. Once the portal has the certificate, it will use that cert for communications with the commander.
The commander does not actually implement any of the materialization requests that come from the portal. Rather it delegates them to drivers. Every materialization request has a set of resource IDs associated with it. Drivers can register to receive materialization requests keyed on resource ID. When the commander gets a request, it looks up all drivers that have registered for the associated resource IDs and delegates the command to them.
The intent here is that not all drivers may be appropriate to drive all types of resources.
Driver
The MergeTB driver is a daemon that runs on facility infrastructure machines. The purpose of the driver is to take requests from a commander and turn them into an executable graph of tasks. The driver then stores those tasks in the task data store. Tasks are broadly partitioned into 3 categories
Notify
Notify tasks come in two flavors
Incoming
The incoming notification indicates that a set of materialization tasks are incoming, and to make the needed preparations to perform follow-on materialization tasks. This includes
- Creating a network enclave on an infrapod server an infrapod
- Spinning up an infrapod (described below)
- Saving the initial state of the materialization
Node Task
Node tasks can
- Setup a node
- Image node with specified OS
- Add the specified configuration to the foundryd service in the materialization’s infrapod
- Place on materialization infrastructure network
- Recycle a node
- Place on harbor network (a holding materialization for all un-allocated nodes)
- Wipe current OS to clean state
- Go into imaging standby mode
- Reset a node
- Image node with clean OS
- Reboot a node
- Power cycle the node, soft cycle by default, hard on request.
Link Task
Link tasks can
- Create a virtual link between
- 2 nodes (point-to-point link, P2P)
- A group of 2+ nodes (multi-point link MPL)
- 2 nodes going through an emulator (eP2P)
- A group of 2+ nodes going through an emulator (eMPL)
- Destroy virtual links
Details of virtual networks are covered in the Virtual networks section.
Infrapods
Infrapods are a collection of containers placed on the infrastructure network of a materialization that provide services to the nodes in the materialization. The base set of infrapod containers includes
- Nex: a DHCP/DNS container with a gRPC API for configuration. Provides addresses and name resolution for the infranet interfaces of nodes in a materialization.
- Foundry: a node configuration daemon with a gRPC API for configuration. Provides configuration to nodes when they boot up. All Merge OS images come with a foundry client that runs on startup and asks the foundry server for a system configuration to implement. The address of the foundry server is found by a dns entry
foundry
that is resolved by Nex. - Moactld: (optional) provides a gRPC interface for controlling the parameters of network emulations. When this container comes up, it’s seeded with an emulation ID. This is the network emulation ID that belongs to the materialization. The moactld service will only accept requests for this emulation ID. As the network emulators are a facility level resource, this prevents one user from controlling the emulation parameters of another.
- SledAPI: (harbor only) imaging configuration daemon that provides a gRPC API for configuration. When nodes boot, they boot into the Sledc bootloader. This bootloader asks the Sled API what it should do in terms of imaging the node (more details in imaging section).
- Etcd: this is a container that provides data storage services for the other containers in an infrapod.
An infrapod exists is a single network namespace context, this is called the network enclave of the infrapod. Inside the infrapod are two network interfaces
- ceth0: connects to the infranet of the materialization
- ceth1: connects to the management network of the hosting server
The ceth0 provides for node-infrapod communications, and ceth1 provides for testbed automation system-infrapod communications. Additionally the combination of these two interfaces along with a simple pair of NAT rules and routes allows for nodes inside an experiment to communicate with upstream networks and for infrapods to communicate with shared testbed facility resources.
-
the first NAT translates all traffic from nodes not destined for an address on the infranet subnet to the ceth1 interface address which will kick it up to the infrapod host. A routing rule is also created that ensures the path for this traffic is over the administrator specified external interface. On the host a second NAT rule that translates this traffic from the infrapod service IP space to the external IP space is present.
-
the second NAT rule is outside the infrapod on the host, and translates addresses coming from infrapods destined to shared services onto the testbed management network addressed owned by the infrapod host. This rule takes precedence over the first, so that traffic destined for facility services stays within the facility.
Imaging
Every testbed facility must have at least one imaging server. This server is responsible for implementing the Sled Imaging protocol. As described in the Infrapods section, the bootstrap agent that performs imaging operations communicates with the SledAPI server in the harbor infrapod. When it comes to actually retrieving images, this is done through the Sled imaging server.
- Images are fetched over HTTP through GET requests based on the image name
- Images are stored in /var/img by name
Images are broken up into three pieces
- disk image: what actually goes on the disk
- OS kernel: the operating system kernel that is run (at least initially).
- initramfs: the root file system image used for booting the kernel
The reason for this partitioning is the way the imaging system works. When a node first boots, a preboot execution environment (PXE) agent, or unified extensible firmware interface (UEFI) program loads the Sledc bootloader. This bootloader is a minimal in-memory Linux OS. When it starts, it contacts the Sled API which tells it what to do. Commonly this will be to lay down some OS image on the disk and wait for further instructions. When a materialization happens, further instructions come down. It could be to jump into the operating system laid down on initialization, or to lay down a new image and jump into that.
To jump into the specified OS, Sledc does not reboot. It uses a Linux capability called kernel execute (kexec), to jump directly into the new OS. This saves a lot of time in the reboot cycle. However, it requires that the kernel and the initramfs be available outside the image itself. Note that this process is not limited to Linux, FreeBSD and other operating systems with a proper kernel, initramfs, persistent disk distinction and well defined boot protocols. Proprietary operating systems such as Windows can be loaded by considering bootloader like Grub to be the kernel and then chainload the OS through it’s native bootloader.
Virtual Networks
In a MergeTB facility virtual networks
- Connect all nodes in a materialization, and all services in a materialization through a flat network called the infrastructure network (infranet).
- Connect experiment nodes to each other through experiment links
- Connect experiment nodes to each other through emulated links
The ability for a MergeTB facility to connect a device to a virtual network in the way specified by an experiment topology, depends on the capabilities of the device and how those capabilities relate to the entry point types provided by the virtual network apparatus of the testbed. The virtual network entry point types provided by MergeTB in v1.0 are the following.
- Switch VLAN access ports
- Switch VLAN trunk ports
- Switch VXLAN VTEP devices
- Switch VRF isolated VXLAN underlay access
- Hypervisor VTEPs
These entry point types provide virtual network access in different scenarios.
VLAN Access Ports
Used for providing direct virtual network access to the device without the need for network virtualization on the device itself. This is useful in the following situations.
- The device is not capable of network virtualization.
- The experiment deals with the network interface at the kernel level.
- The experiment requires the use of VLAN tags as a part of the experiment (requires that the upstream switching mesh stacks tags correctly)
VLAN Trunk Ports
Used for providing multiple virtual links to a single device. In this case the device is responsible for creating VLAN virtual interfaces communicate through the trunk port. The MergeTB facility automation software will detect when this is possible and automatically provision the needed Foundry configuration to make this happen. This is useful in the following situations.
- The experiment requires more interfaces on a given node than it has physical interfaces.
Switch VXLAN VTEP devices
Used for providing direct virtual network access to the device without the need for network virtualization on the device itself. This is very similar to the VLAN Access Ports described above. The distinction here is that instead of bridging into a VLAN network, the switch port feeds directly into a VXLAN device, entering in to the fully encapsulated VXLAN network overlay. This has the benefit of precluding the need for VLAN stacking when experiments use VLAN tags as a part of the experiment and reducing VLAN plumbing complexity in the core of the switching mesh, and makes the traffic immediately routable to testbed serviecs, emulators and other testbed facilities for cross-facility materializations.
Switch VRF isolated VXLAN underlay access
Use for pushing VTEPs down to testbed nodes themselves. This provides a full encapsulation point on the node, which may be required for certain classes of traffic that a VLAN device would interfere with. However it does require special consideration from the connecting switch, as having a functional VXLAN device means exposing part of the virtual network underlay to the node, so this must be done carefully. VTEPs on user nodes cannot be allowed to connect to other VTEPs that are not a part of their experiment. To enforce this a virtual routing and forwarding table (VRF) on the connecting switch must be used to ensure that the host VTEP can only communicate with VTEPs in the same materialization.
Hypervisor VTEPs
The hypervisors that are directly controlled by the MergeTB platform are trusted entities that are allowed virtual network underlay access. As such they create VTEPs for the virtual machines they host that are not considered trusted infrastructure.
Platform Support
As of Merge v1.0, the following platforms are supported.
- X86/X86_64 devices that are capable of running Linux
- Raspberry PI 4 devices (4 introduced reasonable network booting)
- Nvidia Jetson X2+ and Jetson Nano
- QEMU/KVM Hypervisors running on Intel Xeon or AMD Zen platforms
Network Emulation
Network emulation is the capability for users to model links with certain characteristics. In Merge v1.0 the following characteristics are supported on P2P links and MPLs.
- Latency
- Capacity
- Loss
Network emulation is implemented by a centralized emulator and has specific virtual network considerations.
- Each virtual network link must be cut in half. For example if nodes
A
andB
are communicatingA
cannot be on the same virtual link asB
otherwise communications could bypass the emulator. Thus a dedicated virtual link is created between each node endpoint and the emulator.
Network emulations can be modified by the user at runtime through the moactld
container in the infrapod.
Network emulators operate on VXLAN VTEP devices using EVPN as a control plane. When an emulation link is set up, VTEPs are established at the node-entry-points of the network and on the emulator. The emulator then advertises the MAC addresses of all the node-endpoints in the emulation on adjacent VTEPS e.g. it will advertise A’s mac address on the VTEP that tunnels to B
. In this way nodes always communicate through the emulator.
The emulator picks up packets off the VTEP devices through XDP so the host kernels network stack does not interfere with packet handling.
Emulation processes are handled through the Moa network emulation daemon. This daemon is always running on emulation nodes. When an experiment is materialized, the testbed facility automation systems send requests to this daemon over gRPC to initialize and start network emulations. On teardown the facility automation systems ensure that these emulations are stopped and discarded.
Node Virtualization
TODO: After the initial implementation is prototyped, we’ll start to write this section.
Mass storage
TODO: After the initial implementation is prototyped, we’ll start to write this section.
User Console access
TODO: After the initial implementation is prototyped, we’ll start to write this section.
Deployment
Deployment of a Merge testbed facility is a 5 phase process. Merge testbeds are model centric, this is true for both the deployment and operation of the facility. Once the model of the testbed exists, everything else is automated based on that model.
The model development kit (MDK) produces stable models that can be incrementally updated in code when needed. An example is adding or removing resource from the testbed, or fixing a bug in the connectivity model. When this happens, the MDK produces a new JSON model from the source code that can be applied to the testbed without disrupting current operations.
- Create a model of the facility using the MergeTB testbed modeling framework.
- Physically construct the network that interconnects the testbed elements.
- Prepare the host machines and switches such that Ansible playbooks can execute on them.
- Use merge-confgen to generate an tb-deploy configuration from the model created in (1)
- Launc the tb-deploy playbook using the configuration generated in (4)
Operations
Merge testbed facilities provide the following operations capabilities.
A command line tool and web interface to
- List and inspect materializations
- List and inspect and manage materialization task graphs
- Dematerialize and rematerialize materializations
- Manage OS images
- Manage, visualize, check and fix virtual network.
- Provide console access to nodes
- Power cycle nodes, hard and soft