First class storage objects and artifacts

This post proposes support for storage devices and artifacts as first class entities in Merge

Conceptual Model

The image below illustrates the conceptual model

Let’s highlight the key components:

Dedicated storage resources

The site support a notion of first class mass storage devices. These include:
- Block devices: i.e., the type of device that typically presents at /dev/sd[a,b,c,...] in Linux (e.g., a hard drive). Users can partition them, format them with filesystems, etc., or simply leave them as raw storage devices. These are mounted via iSCSI and, at least initially, will only be available for virtual machines. Bare metal support is possible but may require BIOS/UEFI integration.
- Filesystems e.g., NFS or ZFS. These are network mounted somewhere like /nfs/... on the desired exp nodes. Available to bare-metal and VMs
- Objects e.g., S3 buckets. These are accessible via MinIO using the S3 storage API. Available to bare-metal and VMs

Network accessibility

All mass storage devices are accessible over the Merge infranet. Raw storage devices should likely be provisioned on a dedicated storage node or storage cluster (e.g., stor0 above), but could be hosted on infrastructure server to reduce cost and cabling needs (e.g., ifr0)

Artifacts

Artifact archival

While mass storage resources live on the site, artifacts are stored on the portal. Storage devices can be made into artifacts via the archival process (we can consider a different name). Archival is explicitly requested by a user through the Merge API. Archival involves packaging up storage device contents, creating a compressed tarball, and transferring to the Merge portal where it will be placed in a dedicated storage volume on the portal storage cluster.

Artifact contents

Artifacts consist of data and metadata

Artifact data includes:

  • Zero or more site storage devices
  • Zero or more directories uploaded from experiment nodes
  • Zero or more directories uploaded from an XDC

Artifact metadata includes:

  • Version number
  • Creation date
  • Access control policy (i.e., who is allowed to access?)
  • Deployment metadata:
    • Merge Experiment ID (e.g., botnet.discern)
    • Scripts to facilitate deployment/experiment provisioning
      • e.g., Jupyter notebook

Artifact deployment

First-class objects can be deployed automatically. When a user deploys an artifact, we check the artifact metadata for the experiment ID. The experiment XIR tells us which types of devices are present and which site they must live on. Merge then realizes+materializes a version of that experiment, and then the site provisions the storage resources. The site may have them cached locally in the artifact cache, but otherwise must download them from artifact store on the portal

Unstructured data can be created in artifacts. This includes arbitrary directories from experiments nodes or XDCs. We don’t deploy these given we don’t have enough information about where they came from in the first place (assuming they were simply uploaded via something like mrg create artifact --upload <path to directory>). Note, however, that a user could write a Jupyter notebook or script that would take this unstructured data and deploy it through any process they like.

I like it. Overall it’s a nice clean design. I’d add creator’s username to artifact’s metadata and possibly their email address so people can ask questions if these arise on deployment.

1 Like

Looks good overall.

Do you see deployment of an artifact being post-materialization? Will there be a mrg deploy... command?

And/or will the deployment be included in the experiment model?

Will users deploy (mount) to specific nodes or to all nodes?

Will there be support to deploy multiple artifacts to an experiment?

Support for exporting artifacts would be useful as well I think. Let a researcher export artifacts that can then be incorporated into their containers for analysis or for non-Merge-based demos.

All good questions.

At a high level I view an artifact as mostly a snapshot of an experiment at some revision. While currently we view an experiment mostly as just a topology, we could think of it more holistically as:

  1. An environment. Some of this is static (typically including topology, network configuration, and OS selection), while some could be dynamic – e.g., changing link rates via Moa to emulate a mobile network or orbital system.
  2. A set of inputs to the experiment – e.g., could be datasets, traffic configurations, etc
  3. A set of outputs generated by the experiment

An artifact could capture any or all of the above characteristics. The post touches mostly on inputs and outputs through the focus on storage types, but ansible/jupyter/etc to administer the runtime environment are definitely part of it.

Do you see deployment of an artifact being post-materialization? Will there be a mrg deploy... command?

And/or will the deployment be included in the experiment model?

Not a requirement, but the more information in the model the better. This lets Merge automate more of the work for you – e.g…, if you tell us which storage devices hold relavant data, we can archive them automaticallly.

Deployment of an artifact involves materialization and post-materialization work. Materialization because the artifact may be tied to a particular realization of resources (e.g., mapping of nodes to specific device types on sites), and post-materialization for the runtime orchestration components.

We could consider something like mrg artifact deploy ... as a one-liner to orchestrate the whole process.

Deployment can have manual steps too, since I don’t think we want to force a certain operational model on users. That is, we can give them things we think will help, but if they prefer to manually say “archive this dataset from this directory on nodeA”, we should support that too

Will users deploy (mount) to specific nodes or to all nodes?

Totally up to the user and how they define their model and (if provided) manual deployment scripts

Will there be support to deploy multiple artifacts to an experiment?

Given the model so far, this wouldn’t be straightforward. Users could of course download artifacts into an XDC and deploy them manually on top of an existing experiment, in cases where that makes sense. But hard to see how Merge would automate much of that

Support for exporting artifacts would be useful as well I think. Let a researcher export artifacts that can then be incorporated into their containers for analysis or for non-Merge-based demos.

Abssolutely. mrg artifact download ... would pull the whole artifact down.

One question is do we want to develop something like a manifest that could be defined with an experiment (and possibly sit at a well-defined location in the experiment git repo). The manifest could include all of the relevant information to setup, run, and capture results of the experiment – and by extension, would include most or all of the information needed to create an artifact from it.

This sounds like the Merge CI idea, which I’ve always liked.

Maybe I misunderstand the model. If an artifact can be block device can Merge not mount multiple block devices? What is in the model that would not allow multiple artifacts to be deployed?

If the block device was just that – a device with some data, but not tied to any node or experiment - than that would work. So people could publish artifacts that are just sets of data living on a device - that actually makes some sense

What is the case where that doesn’t make sense?

Well for the general idea of experiment-as-artifact, you wouldn’t be able to deploy multiple experiments into one experiment.

For artifacts that are just data, not full blown experiments, it makes sense

An interesting possibility for an experiment-as-artifact (at least vm-based) is the archival of VMs’ overlays with a possibility of redeploying them. If we could do the same for XDCs, we might be able to fully capture an experiment, including all added customization.

That’s an interesting point. If we have support for node snapshotting, a user can snapshot their nodes and have the snapshotted images be part of the artifact definition. Certainly that is consistent with the overall architecture and goals of the artifact