MergeTB 1.0 Spec

MergeTB 1.0

This RFC will define all the capabilities that will be in the MergeTB 1.0 release. For MergeTB 1.0 each capability must have end-to-end testing that asserts its functionality. These tests will be hashed out in a separate RFC.

Release Date

1 July 2020

Release Format

MergeTB follows the release channel format of Debian. We will maintain stable, testing and unstable release channels. As we have not yet reached a 1.0 release, there is no stable yet. Deployed systems
and the collection of latest git semver tags are implicitly the testing release and developer branches implicitly comprise unstable.

When the initial release is cut all semver tags across all Merge code repositories will go to v1.0, and packages will be pushed to the stable repo for the first time. At time the testing and unstable repos will stop accepting ad-hoc CI pushes as is the current practice and henceforth be governed in accordance with migration rules, testing and policies currently being defined in the ecosystem RFC.

Moving forward, follow-on major releases in stable and testing will follow a capability driven model (as opposed to a calendar driven model). When a sufficiently motivating set of new capabilities has reached the testing repo, testing will enter a freeze period and acceptance testing for migration of an entire snapshot of the testing channel will begin. Minor releases and bug fixes may migrate through the release channels outside of release windows.

Portal Capabilities

The portal capabilities are the set of functionalities directly accessible to users. They are implemented by a Merge portal system that presides over a collection of testbed facilities. The documentation that follows

  • defines each capability
  • provides status of each component that fully or partially implements that capability

Experiment Modeling

Experiment modeling is broadly broken down into model expression, validation and reticulation. Each is addressed in more detail below.

Expression

Experiment modeling is the capability of a user to Programatically express an experiment in terms of

  • Network topology structure
    • nodes
    • links
    • endpoints that belong to nodes and bind to links
    • point-to-point and multipoint links
  • Node configuration
    • OS Image
    • Interface IP addresses
    • Node characteristics as constraints
      • Number of CPU cores
      • Memory capacity
      • GPU model presence
      • interface bandwidth capacity
  • Link configuration
    • maximum capacity
    • latency
    • loss rate
  • Global experiment properties
    • Automatic IP address assignment
    • Automatic routing calculation

Merge v1.0 will ship with the Python 3 based MX library for experiment expression. Merge will track Debian Bullseye in terms of what exact Python 3.X release is supported.

Validation

Modeling also includes the capability for the experimenter to validate an experiment model through

  • Compilation
    • Is the expression of the model syntactically and structurally semantically sound.
    • The result of compilation is an eXperiment Intermediate representation (XIR) file.
  • Static analysis
    • Is the experiment fully connected (islanded experiments not supported in v1.0)
    • Are IP addresses well formed
  • Visualization
    • Basic force-directed graphs of experiments

XIR (eXperiment Intermediate Representation)

XIR is JSON with an additional set of semantics built in

  • The structural model of a network is represented through nodes with endpoints that reference links
  • Constraints are represented in a general format with a concrete syntax and semantics

TODO: provide more detail here

Reticulation

Modeling also includes the capability to automatically augment a model with derived features according to a high level specification. This is refered to as model reticulation. An example is automatically calculating routes across a topology so all nodes can communicate with each other. Reticulators for v1.0 will include.

  • IP address assignment
    • Each link gets its own subnet
  • Route calculation
    • Routes are calcluated exhaustively for all endpoints in the topology

Revisions

A user makes an experiment model available to a Merge portal by pushing an experiment revision. The experiment revision history is an immutable stack of experiment versions. For each revision the high level source code (which is always Python MX in v1.0) and the compiled XIR is tracked by the portal.

Status

Capability Component Status Test Coverage
Python based modeling mx in-production implicit
Model compilation model service in-production implicit
Model reticulation model service, mcc in-production some unit testing
Model checkers mcc planned none
Visualization tbui in-production none?

Realization

Realization is the capability of taking a users experiment topology and attempting to find an embedding of that topology into the overall network of resources a portal presides over. The network of resources a portal presides over is the internet formed by interconnecting all the resource networks of the testbed facilities that are managed by the portal. This is referred to as the resource-internet.

Realization may or may not succeed, depending on what the user asks for. For example if a node with 47 GB of ram is requested, but does not exist in the resource internet, or resources that do have that much ram have already been allocated, the realization will fail.

Realization for Merge v0.1 will have the following capabilities.

  • Realize experiment nodes as physical devices
  • Realize experiment nodes as virtual machines multiplexed over virtual machine hosts
  • Realize experiment interfaces as physical interfaces
  • Realize experiment interfaces as virtual interfaces multiplexed onto physical interfaces
  • Follow the scarcity model of allocation, which dictates that the order in which resources are considered for node-realization is inversely proportional to their overall scarcity as a collection of characteristics.
    • For example, if there are only 10 nodes available with > 30 GB of RAM, but 200 nodes available with 10 GB of ram, the more abundant nodes with 10 GB of ram will be allocated first. (There is a much longer discussion of this with more detail and precision in an MR that should be linked here)
  • When a realization is made, the user can either accep or reject the realization. They have 47 seconds to do so. No response is considered an implicit reject.
  • When a realization is made, the resources belong to the owning experiment until explicitly freed.
  • Realization shall take no longer than 5 seconds independent of size of experiment or of the resource internet.
  • Realization shall not be aware of the specifics of resources, or the specific constraints of experiments, but rather, the realizer understands how to determine if a constraint is satisfied by a resource in a generic way. This way the realization engine, resource pools and experiment definitions can all evolve independently.

A single experiment may have many simultaneous realizations at any given time.

Materialization

Materialization is the capability of taking a realization and communicating with testbed facilities to provision the resources that underpin that materialization.

Users can

  • Create materializations from realizations
  • Check on the status of a materialization
  • Dematerialize a materialization
  • Attach to experiment networks through the experiment virtual private network (xVPN)

There is a 1:[0,1] mapping between realizations and materializations at any given time, as a realization is a collection of resources, so a single set of resources can not be materialized multiple times at once.

A concrete protobuf 3 protocol specification for materialization exists. Furthermore semantics for how the protocol is carried out between the portal and sites is implicitly defined in code and the minds of the core Merge development team but should be spelled out and well defined here.

Of particular concern is defining the failure model for when things go sideways on a testbed facility during a materialization. How is the relevant information propagated, and what are the specified actions the portal must take. Again, all of this is implicitly captured in portal code, but should be spelled out here.

Project, User and Workspace Management

The following capabilities will be provided for project, user and workspace management in Merge v1.0

  • Users accounts are created through an OAuth2 registration flow
  • Once a user account is created it must be activated by a portal administrator
  • User accounts automatically get a personal project
  • User accounts are provisioned with home directory in /home/<username> that may be accessed through XDCs but are independent of the lifetime of XDCs
  • Projects are provisioned with a project directory in /proj/<name> that may be accessd through XDCs but are independent of the lifetime of XDCs
  • Projects, users and experiments are governed according to the declarative portal policy framework.
    • Project maintainers can manage their own projects and permission sets without the need for testbed administrators.

API

The MergeAPI is an OpenAPI 2.0 Spec. It provides the ability to manage the following objects. In the list below CRUD (create, read, update, delete) operations are not explicitly listed but assumed for all objects.

  • Users
    • Pubkeys
  • Projects
    • Members
  • Experiments
    • Source code
  • XDCs
    • Tokens
    • Connections
  • Realizations
    • Accept / reject
  • Materializations
    • Status
    • Attach
  • Sites
    • Views
    • Certs
    • WGD config
    • Activation state
  • Resources
    • List
  • Pools
    • Sites
    • Projects
  • Models
    • Compile
  • Health

TODO: fill in more detail for all the above, each probably needs to be a subsection

XDCs

An eXperiment Ddevelopment Container, or XDC is a portal managed container that can be created on demand by users. The container is accessible through SSH via the portal SSH jump container and through a Jupyter web interface via the portal HTTPs proxy container.

XDCs exist at project scope. When an XDC is created

  • the project directory for the owning project is mounted in the XDC at /proj/<name>
  • home directory for all users in the project in mouted in the XDC at /home/<user>

An XDC can be attached to a materialized experiment by using the materialization/attach API endpoint. When an XDC is attached a Wireguard interface is created on the XDC that provides a secure tunnel to the infrastructure network of the running experiment that the attach was requested for.

XDCs come with a base image that is a configuration parameter of the Merge portal they are hosted on. Users can override this base image with an image of their choosing in the API call to create the XDC. Containers that are used as XDCs must derive from one of the XDC base containers or the request will be rejected.

Deployment

Beyond being a platform as a service PaaS, the Merge portal is also distributable software. The portal ships as a self contained Kubernetes cluster. The process of deploying a Merge portal is a matter of

  • Creating a complete configuration file for the portal (TODO: specify this)
  • Preparing the host cluster host operating systems and networks for Portal installation (TODO: specify this)
  • Deploying the Merge Portal onto the cluster

Operations

The Merge Portal provides the following operations capabilities

  • The mergectl utility, which sits behind the policy layer and can manage all portal objects administratively.
  • A CLI and web dashboard that shows
    • Core service health
    • Active XDCs
    • Resource usage across the resource internet
      • Quantity of resources in use per project/user
      • Duration of resource allocation
    • Active materializations
    • Mass storage utilization
    • Certificate lifetimes

Facility Capabilities

The facility capabilities are the set of functionalities needed to materialize the space of experiments that can be expressed according the the Experiment Modeling section above.

TODO: There is a lot of temptation to write about the how here, but I’m making an effort to stick to the what with a bit of the why. It seems to me (ry) that a good idea for capturing some of the how is an implementation practices RFC, I’ve seen this style of RFC in the IETF and think it’s useful.

Commander

The MergeTB commander is a delegation authority between a Merge portal and a testbed facility. The commander implements the Merge materialization API with an interface that requires a client TLS certificate. The certificate is generated by a facility administrator and given to a Merge portal through the portal’s site/cert API endpoint. Once the portal has the certificate, it will use that cert for communications with the commander.

The commander does not actually implement any of the materialization requests that come from the portal. Rather it delegates them to drivers. Every materialization request has a set of resource IDs associated with it. Drivers can register to receive materialization requests keyed on resource ID. When the commander gets a request, it looks up all drivers that have registered for the associated resource IDs and delegates the command to them.

The intent here is that not all drivers may be appropriate to drive all types of resources.

Driver

The MergeTB driver is a daemon that runs on facility infrastructure machines. The purpose of the driver is to take requests from a commander and turn them into an executable graph of tasks. The driver then stores those tasks in the task data store. Tasks are broadly partitioned into 3 categories

Notify

Notify tasks come in two flavors

Incoming

The incoming notification indicates that a set of materialization tasks are incoming, and to make the needed preparations to perform follow-on materialization tasks. This includes

  • Creating a network enclave on an infrapod server an infrapod
  • Spinning up an infrapod (described below)
  • Saving the initial state of the materialization

Node Task

Node tasks can

  • Setup a node
    • Image node with specified OS
    • Add the specified configuration to the foundryd service in the materialization’s infrapod
    • Place on materialization infrastructure network
  • Recycle a node
    • Place on harbor network (a holding materialization for all un-allocated nodes)
    • Wipe current OS to clean state
    • Go into imaging standby mode
  • Reset a node
    • Image node with clean OS
  • Reboot a node
    • Power cycle the node, soft cycle by default, hard on request.

Link Task

Link tasks can

  • Create a virtual link between
    • 2 nodes (point-to-point link, P2P)
    • A group of 2+ nodes (multi-point link MPL)
    • 2 nodes going through an emulator (eP2P)
    • A group of 2+ nodes going through an emulator (eMPL)
  • Destroy virtual links

Details of virtual networks are covered in the Virtual networks section.

Infrapods

Infrapods are a collection of containers placed on the infrastructure network of a materialization that provide services to the nodes in the materialization. The base set of infrapod containers includes

  • Nex: a DHCP/DNS container with a gRPC API for configuration. Provides addresses and name resolution for the infranet interfaces of nodes in a materialization.
  • Foundry: a node configuration daemon with a gRPC API for configuration. Provides configuration to nodes when they boot up. All Merge OS images come with a foundry client that runs on startup and asks the foundry server for a system configuration to implement. The address of the foundry server is found by a dns entry foundry that is resolved by Nex.
  • Moactld: (optional) provides a gRPC interface for controlling the parameters of network emulations. When this container comes up, it’s seeded with an emulation ID. This is the network emulation ID that belongs to the materialization. The moactld service will only accept requests for this emulation ID. As the network emulators are a facility level resource, this prevents one user from controlling the emulation parameters of another.
  • SledAPI: (harbor only) imaging configuration daemon that provides a gRPC API for configuration. When nodes boot, they boot into the Sledc bootloader. This bootloader asks the Sled API what it should do in terms of imaging the node (more details in imaging section).
  • Etcd: this is a container that provides data storage services for the other containers in an infrapod.

An infrapod exists is a single network namespace context, this is called the network enclave of the infrapod. Inside the infrapod are two network interfaces

  • ceth0: connects to the infranet of the materialization
  • ceth1: connects to the management network of the hosting server

The ceth0 provides for node-infrapod communications, and ceth1 provides for testbed automation system-infrapod communications. Additionally the combination of these two interfaces along with a simple pair of NAT rules and routes allows for nodes inside an experiment to communicate with upstream networks and for infrapods to communicate with shared testbed facility resources.

  • the first NAT translates all traffic from nodes not destined for an address on the infranet subnet to the ceth1 interface address which will kick it up to the infrapod host. A routing rule is also created that ensures the path for this traffic is over the administrator specified external interface. On the host a second NAT rule that translates this traffic from the infrapod service IP space to the external IP space is present.

  • the second NAT rule is outside the infrapod on the host, and translates addresses coming from infrapods destined to shared services onto the testbed management network addressed owned by the infrapod host. This rule takes precedence over the first, so that traffic destined for facility services stays within the facility.

Imaging

Every testbed facility must have at least one imaging server. This server is responsible for implementing the Sled Imaging protocol. As described in the Infrapods section, the bootstrap agent that performs imaging operations communicates with the SledAPI server in the harbor infrapod. When it comes to actually retrieving images, this is done through the Sled imaging server.

  • Images are fetched over HTTP through GET requests based on the image name
  • Images are stored in /var/img by name

Images are broken up into three pieces

  • disk image: what actually goes on the disk
  • OS kernel: the operating system kernel that is run (at least initially).
  • initramfs: the root file system image used for booting the kernel

The reason for this partitioning is the way the imaging system works. When a node first boots, a preboot execution environment (PXE) agent, or unified extensible firmware interface (UEFI) program loads the Sledc bootloader. This bootloader is a minimal in-memory Linux OS. When it starts, it contacts the Sled API which tells it what to do. Commonly this will be to lay down some OS image on the disk and wait for further instructions. When a materialization happens, further instructions come down. It could be to jump into the operating system laid down on initialization, or to lay down a new image and jump into that.

To jump into the specified OS, Sledc does not reboot. It uses a Linux capability called kernel execute (kexec), to jump directly into the new OS. This saves a lot of time in the reboot cycle. However, it requires that the kernel and the initramfs be available outside the image itself. Note that this process is not limited to Linux, FreeBSD and other operating systems with a proper kernel, initramfs, persistent disk distinction and well defined boot protocols. Proprietary operating systems such as Windows can be loaded by considering bootloader like Grub to be the kernel and then chainload the OS through it’s native bootloader.

Virtual Networks

In a MergeTB facility virtual networks

  • Connect all nodes in a materialization, and all services in a materialization through a flat network called the infrastructure network (infranet).
  • Connect experiment nodes to each other through experiment links
  • Connect experiment nodes to each other through emulated links

The ability for a MergeTB facility to connect a device to a virtual network in the way specified by an experiment topology, depends on the capabilities of the device and how those capabilities relate to the entry point types provided by the virtual network apparatus of the testbed. The virtual network entry point types provided by MergeTB in v1.0 are the following.

  • Switch VLAN access ports
  • Switch VLAN trunk ports
  • Switch VXLAN VTEP devices
  • Switch VRF isolated VXLAN underlay access
  • Hypervisor VTEPs

These entry point types provide virtual network access in different scenarios.

VLAN Access Ports

Used for providing direct virtual network access to the device without the need for network virtualization on the device itself. This is useful in the following situations.

  • The device is not capable of network virtualization.
  • The experiment deals with the network interface at the kernel level.
  • The experiment requires the use of VLAN tags as a part of the experiment (requires that the upstream switching mesh stacks tags correctly)

VLAN Trunk Ports

Used for providing multiple virtual links to a single device. In this case the device is responsible for creating VLAN virtual interfaces communicate through the trunk port. The MergeTB facility automation software will detect when this is possible and automatically provision the needed Foundry configuration to make this happen. This is useful in the following situations.

  • The experiment requires more interfaces on a given node than it has physical interfaces.

Switch VXLAN VTEP devices

Used for providing direct virtual network access to the device without the need for network virtualization on the device itself. This is very similar to the VLAN Access Ports described above. The distinction here is that instead of bridging into a VLAN network, the switch port feeds directly into a VXLAN device, entering in to the fully encapsulated VXLAN network overlay. This has the benefit of precluding the need for VLAN stacking when experiments use VLAN tags as a part of the experiment and reducing VLAN plumbing complexity in the core of the switching mesh, and makes the traffic immediately routable to testbed serviecs, emulators and other testbed facilities for cross-facility materializations.

Switch VRF isolated VXLAN underlay access

Use for pushing VTEPs down to testbed nodes themselves. This provides a full encapsulation point on the node, which may be required for certain classes of traffic that a VLAN device would interfere with. However it does require special consideration from the connecting switch, as having a functional VXLAN device means exposing part of the virtual network underlay to the node, so this must be done carefully. VTEPs on user nodes cannot be allowed to connect to other VTEPs that are not a part of their experiment. To enforce this a virtual routing and forwarding table (VRF) on the connecting switch must be used to ensure that the host VTEP can only communicate with VTEPs in the same materialization.

Hypervisor VTEPs

The hypervisors that are directly controlled by the MergeTB platform are trusted entities that are allowed virtual network underlay access. As such they create VTEPs for the virtual machines they host that are not considered trusted infrastructure.

Platform Support

As of Merge v1.0, the following platforms are supported.

  • X86/X86_64 devices that are capable of running Linux
  • Raspberry PI 4 devices (4 introduced reasonable network booting)
  • Nvidia Jetson X2+ and Jetson Nano
  • QEMU/KVM Hypervisors running on Intel Xeon or AMD Zen platforms

Network Emulation

Network emulation is the capability for users to model links with certain characteristics. In Merge v1.0 the following characteristics are supported on P2P links and MPLs.

  • Latency
  • Capacity
  • Loss

Network emulation is implemented by a centralized emulator and has specific virtual network considerations.

  • Each virtual network link must be cut in half. For example if nodes A and B are communicating A cannot be on the same virtual link as B otherwise communications could bypass the emulator. Thus a dedicated virtual link is created between each node endpoint and the emulator.

Network emulations can be modified by the user at runtime through the moactld container in the infrapod.

Network emulators operate on VXLAN VTEP devices using EVPN as a control plane. When an emulation link is set up, VTEPs are established at the node-entry-points of the network and on the emulator. The emulator then advertises the MAC addresses of all the node-endpoints in the emulation on adjacent VTEPS e.g. it will advertise A’s mac address on the VTEP that tunnels to B. In this way nodes always communicate through the emulator.

The emulator picks up packets off the VTEP devices through XDP so the host kernels network stack does not interfere with packet handling.

Emulation processes are handled through the Moa network emulation daemon. This daemon is always running on emulation nodes. When an experiment is materialized, the testbed facility automation systems send requests to this daemon over gRPC to initialize and start network emulations. On teardown the facility automation systems ensure that these emulations are stopped and discarded.

Node Virtualization

TODO: After the initial implementation is prototyped, we’ll start to write this section.

Mass storage

TODO: After the initial implementation is prototyped, we’ll start to write this section.

User Console access

TODO: After the initial implementation is prototyped, we’ll start to write this section.

Deployment

Deployment of a Merge testbed facility is a 5 phase process. Merge testbeds are model centric, this is true for both the deployment and operation of the facility. Once the model of the testbed exists, everything else is automated based on that model.

The model development kit (MDK) produces stable models that can be incrementally updated in code when needed. An example is adding or removing resource from the testbed, or fixing a bug in the connectivity model. When this happens, the MDK produces a new JSON model from the source code that can be applied to the testbed without disrupting current operations.

  1. Create a model of the facility using the MergeTB testbed modeling framework.
  2. Physically construct the network that interconnects the testbed elements.
  3. Prepare the host machines and switches such that Ansible playbooks can execute on them.
  4. Use merge-confgen to generate an tb-deploy configuration from the model created in (1)
  5. Launc the tb-deploy playbook using the configuration generated in (4)

Operations

Merge testbed facilities provide the following operations capabilities.

A command line tool and web interface to

  • List and inspect materializations
  • List and inspect and manage materialization task graphs
  • Dematerialize and rematerialize materializations
  • Manage OS images
  • Manage, visualize, check and fix virtual network.
  • Provide console access to nodes
  • Power cycle nodes, hard and soft

Great progress on the draft!

So I’m assuming the intent here is to develop a technical specification rather than a set of “product” requirements right? Based on our conversation yesterday with Mike and Steve, that seems to be what was being asked for.

With that in mind, here’s some initial feedback based on a quick read through:

  1. It would be good to add a preamble that clearly states a high level summary of what MergeTB is suppoed to be.
    • Include a blurb like: “MergeTB is an extensible platform-as-a-service PaaS for testbeds. It can be run on-premesis as well as off.”
    • Layout at a 50,000 ft level of abstraction what the constituent moving parts are of MergeTB. e.g. It consists of portals, facilities etc.
    • Add any additional information you feel would help anchor the core concept in the reader’s mind.
    • This will help provide context for a reader as they read the more detailed parts of the document. e.g. “why is this guy talking about this stuff, oh that’s right, this is a PaaS”
    • The premable should probably go into the section hierarchy somewhere like this:
      # MergeTB 1.0
      # Release Date
      # Overview
      
  2. More clearly delineate between the functional and non-functional characteristics
    • While the “contract” for how MergeTB software and deployments will be delivered and operated is important to specify in this document, right now you’ve comingled these non-functional concerns with the functional ones, like “a portal lets you make an XDC as a jump box for exploring and managing your experiment(s)”
    • Perhaps, you could switch the sections from this:
      # Release Format
      # Portal Capabilities
      ## Deployment
      ## Operations 
      # Facilities Capabilities
      ## Deployment
      ## Operations 
      
    • To something like this?
      # Release Format
      # Portal Capabilities
      ## Deployment
      ## Operations 
      # Facilities Capabilities
      ## Deployment
      ## Operations 
      # Releases
      ## General software
      ## Images
      # Portals Deployments
      ## Standard
      ## Distributable 
      # Facilities Deployments 
      # Portals Operations
      # Facilities Operations
      
    • Goal is to make a more clear distinction between the specification of capabilities vs releases formats vs deployments vs operations
  3. For the Facilities related stuff, it might be helpful to more explicitly call out the “here’s the minimum technical obligations of a facility vendor” versus “here’s what our current facility vendors are delivering”. Especially for things like how to deploy facilities and/or operate them

Thanks for the feedback. I agree that there is tons of muddling of concerns here and it will be a long road to factor things in a way that makes sense as a specification for all concerns involved.

On the specific note about non-functional vs functional, the additional bit of complexity here is that we need to define the spec for multiple audiences (perhaps multiple specs is best then?). The audiences are

  • experimenters
  • operators / administrators
  • developers

And i think what’s functional vs non-functional in some circumstances is relative to the audience?

Good questions.


Ideally we’d have several different documents, aimed at differing audiences:

  • Bills of lading for 3rd party bureaucracies
    • Primarily for auditors
    • Specifications of what MergeTB is as a “product” in their minds
    • Structured to match any statements of work etc.
    • More focused on contractual obligations than being a user manual or developer guide
  • User Manual(s)
    • Primarily for operators/administrators/experimenters/auditors
    • Either one document with different sub-sections per audience, or several different documents
      • For a single document model, have an “Administration” chapter as well as an “Experimentation” chapter
      • For a multi-document model, have a “User’s Manual”, an “Administrator’s Manual” etc.
    • Focus is on usage of the product’s various capabilities from the “end-user” perspective
    • Not a technical specification used to guide developer efforts or accreditation etc.
  • Classic "business requirement(s)"
    • Typical audiences are developers and product/project managers for the development team(s)
    • Intent is to enumerate and prioritize the business needs in a way that can be used to keep development and business stakeholders aligned
  • Classic "technical specification(s)"
    • Typical audiences are developers or advanced users/operators/administrators and maybe auditors
    • Enumerates what the product is supposed to do from an engineering perspective.
    • At a minimum, provides the cross-cutting lore for the architecture and high level design of the constituent parts of the product.
  • Sales pitches or proposals
    • Aimed at potential customers etc.
    • Should distill the product capabilities etc. but not be as details oriented as a classic technical specification or requirements document
  • Scholarly publications
    • Aimed at academic/research publications
    • Should lens the other product documents into a form suitable for the academic topic the publication is addressing, while adding any necessary research info etc.

All of the above documents would need to align with each other, and probably share some duplicate content and/or reference each other. And they’d all need to be versioned in some manner to match the release cadence(s) of the technology they are describing.


But given our current constraints etc., I would do one of the following:

  • Write a developer-centric tech spec that can be used to inform a “bills of lading” document for the Merge 1.0
  • Write a user manual(s) for operators and experimenters , and then use them to inform a “bills of lading” document for Merge 1.0

I would not try to write a document that is both a developer tech spec and a user guide. You’ll just make yourself miserable and end up frustrating both audiences.

All that being said, if you need to brain dump into a scratch document and then refine it into one of those, then go for it. Just as long as your end goal isn’t trying to be too many kinds of documents.

Just to follow up on my last comment.

I realized I hadn’t addressed your functional/non-functional concerns specifically enough.

In a classic requirements doc or tech spec, those could be called out using that kind of terminology.

But for other kinds of docs, like user or administrator guides, those characteristics get expressed a different way.

For example, A “User Manual” could go into depth on the functional capabilities, while an “Admin Guide” could reference the user manual for those, and then elaborate on the “non-functional” details of operations and deployments. But because each doc was specialized to either functional or non-functional concerns, you might not have to call out that distinction within the docs themselves.

Admittedly the functional/non-functional distinction becomes challenging when we’re delivering tools for folks to build things, as well as built things in and of themselves.

Ok. I think I’ve gotten a few things put together that are starting to smell like victory. Everything is under the Validation GitLab group. Here is the rundown.

  • Specs: contains the Merge specifications in ASCII doc format. I went with this format for one primary reason good support for consistent numbering across an entire document. Another nice thing that comes with this repo is I used the asciidoctor Ruby API to automatically generate a JSON file containing all section numbers (with full nesting) and all numbered lists. This is for other tools to ingest to determine test coverage. Both the HTML format of the docs and the section metadata in JSON are produced by CI as artifacts.

  • TestPlan: contains the test plans that cover the specs. The testpalns are compiled into a site using Docusaurus 2 and a current build is available at https://testplan.mergetb.org. It’s very much just a starting point, but I’d like to call attention to the following things.

    • The Readme contains a formal way to specify test plans that includes spec coverage, plan description, pipeline linkage, and test source linkage.
    • The TestPlan site acts both as documentation for the plans and as a dashboard for test status.
    • See an example test plan
  • Tests: contains the tests that implement the test plan. The tests are written in Go and follow a very simple folder structure that follows the test plan structure. Each folder may contain a suite of tests that corresponds to a test plan. Each suite of tests is run by an individual CI job. The test plan links to the pipeline output of that job.

Some future work / needed improvements have already been identified in the TestPlan Issues.

Ok, I will take a look. :+1:

For the “audience at home”, this approach is the outcome of a side-band mattermost/zoom conversation that @ry and I had last week.

Idea is to start with a minimal viable set of docs for specifying and verifying MergeTB as a whole from an engineering perspective. Then cross reference those docs to plans and mechanisms for verifying how well any particular edition of MergeTB fulfills them.

And to do this in a way that is the most ergonomic for contributors while retaining a decent amount of rigor.

Side note, the intent is to get practical, relatively cheap scaffolding in place to facilitate this kind of workflow. We don’t want to invent DOORS for example. :slight_smile: And we’ll re-use existing stuff (like gotests etc.) as much as possible.

I finished a first pass through the Specs repository, and I think its a good direction to go.

And I like the integration you have from the asciidoc sections to the dashboard in the TestPlan repository.

I did think of one or more refinements which would be nice from an easy of use perspective, and I will add them as issues for later discussion.

As part of that, I created a merge-request to add the capability for building the docs locally (in a way that matched the CI pipeline): https://gitlab.com/mergetb/validation/specs/-/merge_requests/1

Hi guys, some high level comments.

  1. Why 47 seconds? It’s the kind of number an end user is going to fixate on, as opposed to something more user-friendly like 60 or 90.
  2. I suggest that for failure modes, you should map out fault and response trees in order to constrain the number of failures that will require administrator intervention. FMEA in particular on the portal/testbed interaction will help reduce the number of midnight calls we have to field.
  3. We’ll need some administrator documentation
  4. What’s the access control model for the API?

Regarding events, I would like the following available:

  1. Account created
  2. Account enabled
  3. User logon
  4. XDC connection
  5. File access/XDC/interval (e.g., XDC x read 2 GB in the last 2 hours). Can be a fixed interval.
  6. Heartbeat signals (messages that the a process is alive)
  7. Account disabled
  8. Logon failure
  9. XDC Creation
  10. XDC Destruction
  11. Materialization
  12. Dematerialization

I’m going to use EFK for monitoring, so I can snarf syslog.

Thanks for the feeback @mpcollins, and note that the content of the spec is moving from this forum to the Specs repo.

Why 47 seconds

Watch more Star Trek.

Fault and repsonse trees.

Yes, having this level of rigor for failure modes is eventually where we want to be.

Administrator documentation

As a part of the 1.0 specs, specifications for administrative functionality for both the portal and the facility will be provided along with tests that assert the specs are in fact true. Documentation will in turn be derived from those tests.

Access control model for API

This is called the policy layer, it’s somewhat documented here but will be formally specified in the 1.0 spec.

Events

This would likely require the introduction of an auditing layer, similar to the policy layer architecturally.