What's New in MergeTB v1

Welcome to 2021 and MergeTB v1. This post goes over what’s new in Merge v1 from an architecture, design and development perspective. To get a sense for what a workflow looks like for an experimenter in v1, take a look at this guided tour through a hello world experiment in v1.

We’ll start with a quick overview of what things are new or changed, and then dive in to how things are plugged into development and operations.

Overview of v1 Changes

  • Experiments as Repositories: In Merge v0.9 we approximated a principled versioned workflow with the experiment push pull semantics. In v1, we have Git fully integrated into the portal and define experiments as whole versioned repositories and not just a single file.

  • Leveraging K8s Ingress: In Merge v0.9 we exposed public services as K8s external services and used a standalone HA-proxy/keepalived setup to route traffic into the K8s cluster. In v1 all of this is done by the K8s Ingress mechanism, obviating the need for a standalone external ingress mechanism.

  • Single Integrated Portal API: In v0.9 we had a gRPC API for each core service, an OpenAPI 2.0 spec implementation for experimenter facing capabilities and bunch of code that manually translated data structures and call semantics between the two. In v1 we leverage gRPC-Gateway to act as a single source for data structures and API semantics with a generated OpenAPI 2.0 spec and implementation that automatically bridges between REST and gRPC. This leads to both reduced complexity in the portal itself and clients, and the ability to offer both a REST and gRPC API as public endpoints.

  • Scalable Object Storage: We’ve been talking about this for a long time, now it’s finally here. In v0.9 we stored everything in Etcd. Etcd really does not like storing any values much over 1.5 MiB. When experiments get large and complex, compiled models can easily reach into the 100s of MiB. This has resulted in total lockups of Etcd in production. In v1 we use MinIO for all object types that have the possibility to become large.

  • Reconciler Architecture: In v0.9 you could consider the portal design a centralized orchestration system, where the core services were directly controlled by the API and had no real autonomy of their own. In v1 we’ve adopted the reconciler architecture where the API simply communicates the desired state on behalf of user requests and the core services autonomously observe desired state, current state and continuously drive desired state to curent state. To quote the above link “Because all action is based on observation rather than a state diagram, reconciliation loops are robust to failures and perturbations: when a controller fails or restarts it simply picks up where it left off”. The Cogs already work somewhat in this way in v0.9, in v1 much more so.

  • Protocol Buffers base XIR: In v0.9 XIR was embedded in JSON. This is problematic for 2 primary reasons 1) schema is implicit in code, 2) serialization is space inefficient. The impact of 1 is having no real structure around what defines XIR objects and consistency between subsystems is left up to developers sorting out what data structures should look like by looking at the code that generates them. The impact of 2 is massively inefficient representations for both interchange and storage. Movement to protocol buffers solves both of these problems.

  • Virtualization Support: We now support virtualized nodes, in expression through the Python MX library, interchange through XIR and implementation through Cogs QEMU/KVM integration.

  • Experiment Mass Storage: Experimenters can now define storage volumes either statically, or ephemerally in experiments and attach them to devices. Mass storage is available either as files systems or block devices. This system created by @lincoln is underpinned by Ceph and goes by the development name Rally.

  • Experiment Orchestration: Experience with our first few years of deployemnt with Merge and heaviy use of Ansible as a make-shift experiment orchestrator has shown that while Ansible and the general ecosystem of tools it exists within are great at confirutaion management, they are not orchestration tools. To this end @alefiya has been working on the Orchid Orchestrator as the next evolution of Magi.

  • Certificate Based SSH: SSH keys that are used in a Merge context are now generated by the Merge portal and take a certificate based apporach. This solves several problems that have plagued users, first when jump containers roll, or an XDC name is reused, or a node name is reused SSH will complain about shifting hosts. Second generating and submitting SSH keys, and understanding how SSH works in general has been an obstacle for many users, by having the portal take the initiative to set things up in a way that is guaranteed to work on behalf of users takes away much of this pain for users and operators.

  • Self Contained Authentication: In v0.9 we relied on Auth0 as both an identity provider and an authentication provider. In v1, spearheaded by @glawler we have integrated the Ory identity infrastructure into the Portal. This solves several problems 1) streamlined registration that was a point of confusion for users in v0.9 having to both register with Auth0 and initialize with MergeTB, 2) Support for fully isolated operations, 3) Support of OAuth2 delegation flows to integrate with other services such as GitHub, 4) full control over token issue to create more natural workflows for CLI clients and OAuth2 server based clients.

  • Self Contained Registry: The portal now comes with a built-in registry. We got this by switching from vanilla K8s to OKD. This allows for the images in the portal to operate on a push-in model rather than a pull-from model. This allows the portal to have a self contained installation and also, to act as a container provider for facilities further bolstering capabilities for walled garden environments.

  • Self Contained Installers: Both the portal and facility technology stacks now come with a single, standalone, self-contained installer.

  • Cyber Physical Systems Support: Experimenters can now describe physical objects and a set of differential-algebraic equations, define connections between object variables, define sensors and actuators that connect physical objects to cyber ones and use the simulation API to control a CPS experiment.

  • Organizations: The portal now has the concept of an Organization. If a user or project joins an organization, they are delegating authority to manage their account or project to the maintainers of that organization. Organization maintainers can initialize and freeze users within their organizations. This is initially in support of educational use cases for labs and classrooms, but may also turn out to be useful for large labs or companies or government groups using Merge.

Navigating Changes

Experiments as Repositories

Experiments as repositories is implemented as a portal core service under the reconciler architecture. The Git Core Service basically does the following things.

  • Observes when a new experiment is created through an etcd watcher.
  • Creates a git repository in a K8s persistent volume using the excellent go-git library.

This is accomplished through the reconciler. The Git Core Service is also responsible for observing new commits that come into the repository and creating observable etcd events for other reconcilers to pick up. The primary example of this is when a new commit is pushed the Model Core Service needs to observe that a new revision is available and compile that revision. The Git Core Service accomplishes this by installing a post-receive-hook in each Git repository it creates. On each commit, git will run this hook, which places a new revision in etcd that can be observed via notification by others.

The Git service as a whole is made available to users through a K8s ingress object that is put in place by the installer. The Core Git Service runs an HTTP service that operates on HTTP Basic Authentication as supported by Git clients. This allows tokens to be provided as user credentials in the same way GitHub and GitLab support client requests such as

git clone https://<token>@git.mergetb.net/<project>/<experiment>

When the token is received by the Git Core Service, it validates the token with the internal Ory identity infrastructure and decides either to allow or deny the request. Tokens are available to users through the new Merge command line tool

mrg whoami -t

Leveraging K8s Ingress

The semantics of K8s ingress are delightfully simple. Define an external domain to listen on and plumb that listener to some service inside K8s. The installer is responsible for setting up Igress objects. Currently there are 4 Portal Ingress objects.

  • api.mergetb.net: Forwards REST API traffic to the Portal apiserver.
  • grpc.mergetb.net: Forwards gRPC API traffic to the Portal apiserver.
  • git.mergetb.net: Forwards Git HTTPS traffic to the Portal Core Git Service.
  • auth.mergetb.net: Forwards the public Ory APIs to the Portal’s internal identity infrastructure.

Because this obviates the need for an external HA-Proxy instance, this means we can completely install and run a portal on a single VM. I’m currently doing this using OKD’s Code Ready Containers.

note: OKD ingress is not completely working for reencrypt routes. Reencrypt routes are nice because they let admins manage certificates at the edge (Ingress) for replicated services, but still maintain encryption within the K8s cluster. I have a PR open which addresses the issues I came across.

Single Integrated Portal API

The Merge Portal as a whole now has 1 comprehensive API. In the gRPC API definition you will find many of the core service APIs organized as gRPC services, such as workspace. Also notice that in every RPC call definition, there are HTTP options specified. This is how we tell the gRPC-Gateway generator to generate corresponding REST API code for us. This topo level API object (which is poorly named xp.proto at the moment) only contains the form and structure of the API in terms of RPC calls. The data structure definitions are captured in a set of supporting protobuf definitions organized by service.

This set of protobufs is compiled in 2 phases.

Implementing the gRPC service is business as usual and exactly the same as we’ve been doing for all of our gRPC services up to this point.

Implementing the REST service using the generated gateway code is very similar to the gRPC flow.

A New Merge CLI App

The availability of a public gRPC API dramatically reduces the complexity of implementing the Merge CLI application. We can now directly leverage gRPC generated client code and data structures directly in the implementation of this CLI application. This code has been started in the ry-v1 branch.

Scalable Object Storage

We are now using MinIO to store object types that have the potential to be large. For the portal this includes

  • Compiled Experiment XIR
  • Realizations
  • Facility XIR
  • Views

MinIO is deployed by the Portal Installer as a K8s pod and service. A MinIO client is available from the Portal Storage Library that uses access keys generated by the Portal Installer and provided to pods through plumbed environment variables.

Here are a few examples of MinIO usage in my working branch of the Portal.

Reconciler Architecture

The reconciler architecture as implemented in the Merge portal is the idea that

  1. The apiserver defines what the aggregate target state of the Portal is by updating Etcd and to a certain extent MinIO with target states in response to API calls.
  2. A collection of services reacts to target state updates and drives the actual state of underlying systems to the target state.
  3. When a service starts up, it observes the current state of the elements it presides over, reads the target state, and drives any mismatch toward the target. Then it goes into reactive mode as described by (2)

Portal Services that currently implement the reconciler architecture are

  • git
    • Observes: Experiment create/delete
    • Drives: Git repository create/delete
    • Notifies: Git push events
  • model
    • Observes: Git push events
    • Drives: Model compilation and XIR data management
    • Notifies: Experiment compilation events
  • realize
    • Observes: Realization requests
    • Drives: Model embedding and resource allocation
    • Notifies: Realization completion events

Portal services that still need to be implemented are

  • xdc
    • Observes: XDC requests
    • Drives: K8s XDC pods and services
    • Notifies: XDC availability
  • mergefs
    • Observes: Experiment and Project create/delete
    • Drives: Mergefs users and groups
    • Notifies: Mergefs user/project readiness
  • credential manager
    • Observes: Experiment and Project create/delete
    • Drives: SSH key provisioning
    • Notifies: SSH key availability
  • materialization
    • Observes: Materialization requests
    • Drives: Multi-site materializations and route reflectors
    • Notifies: Materialization state

Protocol Buffers based XIR

In v0.9 there was a generalized XIR model that both experiment and facilities crammed into. In v1 this model is flipped on it’s head, we start with well defined experiment and facility models, and provide the ability to lift those models into a generalized network representation. The following represents the model graphically.

Here the generalized network model is composed of

  • Devices
  • Interfaces that belong to devices.
  • Edges that belong to connections that can plug into interfaces.
  • Connections that connect devices through interfaces and edges.

Experiment network models are composed of

  • Nodes
  • Sockets that belong to nodes
  • Endpoints that belong to links and plug into sockets
  • Connections that connect nodes through sockets and endpoints

Resource network models are composed of

  • Resources
  • Ports that belong to resources
  • Connectors that belong to cables and plug into ports
  • Cables that connect resources through ports and connectors

Physical network models are composed of

  • Phyos (physical objects)
  • Variables that belong to phyos
  • Couplings that belong to bonds and attach to variables
  • Bonds that connect phyos through variables and complings

Mechanically, the XIR data objects are defined in the core.proto protobuf file. And there are libraries surrounding these generated data structures for Go and Python. At the current time the Go library is geared toward describing testbed facilities and the python library toward describing experiments.

Virtualization Support

The initial cut of virtualization support is up and working using Minimega from Sandia. Minimega is a great technology, however, it’s not designed to operate as a long lived daemon. It’s designed to spin up and tear down with experiments. This is creating friction points in trying to implement a reliable virtualization service. Discussion revolving around the general implementation of virtualization can be found at the following link

The basic design of the current Minimega systems is to install Minimega on each infrapod server as a head node. Head nodes do not actually spawn VMs, but expose the Minimega CLI API (which is another friction point, something like gRPC would be nice) to the Cogs to control machines that are imaged with a hypervisor/Minimega image. In this way Rex can use the minimega API to manage mesh state with other Minimega nodes over the testbed management network and deploy virtual machines. If a hypervisor or VM process crashes, it’s entirely up to the Cogs to detect and reconstitute the hypervisor of VM state once it becomes possible to do so again.

We have our own branch of Minimega. The primary modification in that repo is support for vanilla Linux bridging to avoid having to use OpenVSwitch.

I’m going to try my best to stick with Minimega, but if turns out not be be a great fit, not because Minimega is not great, just that we are covering different use cases and this may be a square peg for a round hole, then I’ll implement a simple QEMU/KVM reconciler deamon that integrates with the Cogs etcd.

Experiment Mass Storage

Rally is on it’s way to being integrated with my v1 branches thanks to the efforts of @lincoln . I’ve not had enough hands on time with Rally yet to provide any meaningful development guidance here.

Experiment Orchestration

Coming soon!

Certificate Based SSH

This is on the TODO list, it should take the form of a reconciler in the portal as outlined in the Reconciler Architecture section. Based on my reading of how user SSH certificates are commonly deployed, there are two paths here. Some references here

  1. Issue short lived (1 day) keys from the API to users in exchange for an authentication token + short-lived/single use certs to tools through JWKs.
  2. Create long lived (permanent but revocable) keys when users are created from a reconciler.

For host SSH certificates we have two targets we need to provision certs for

  1. XDCs
  2. Testbed nodes

XDC host certificates should be generated by the XDC reconciler. The testbed node certificates should be generated by the materialization reconciler.

Self Contained Authentication

Self contained authentication has been in the works for quite some time. Thanks to the efforts of @glawler. He’re I’ll outline how things are currently integrated.

The Portal Installer is responsible for provisioning the Ory Identity Infrastructure inside the Portal K8s cluster.

When a request comes into the portal, the vast majority of the RPC service endpoints implement a authentication check that extracts a token from the gRPC context metadata, and checks that token with the internal Ory Kratos identity server.

Login for API clients is now build directly into the API. Login for web clients should use the Ory API directly that is exposed by a dedicated Portal Ingress as described in the Leveraging K8s Ingress section..

We are using Ory Kratos v0.5, which has excellent documentation.

At the current time, we have only integrated Ory Kratos and not Ory Hydra. Which means we cannot support OAuth delegation flows from external entities such as GitHub and GitLab. I view this as absolutely critical to support, but it probably won’t make it into the first v1 release. We do need to do some study to see if we can implement full OAuth support within the v1 release without breaking the current Kratos mechanisms.

Self Contained Registry

We essentially got this for free by using OKD. OKD comes with a registry built in out of the box that comes with integrated OAuth token flows for managing the container registry as well as integration with K8s RBAC for segregating parts of the registry in different K8s namespaces (what OKD calls projects) to different clients. This is a really nice system and what the Portal Installer uses to push containers into the Portal at install time.

The idea here is also for Cogs infrapods to pull containers from this Portal based registry for walled garden environments.

Self Contained Installers

The installer for the Portal is a thing and it works

it assumes that OKD/K8s has already been setup and requires that the user provides kubectl credentials and a basic configuration that describes how the portal should be set up and provides specific keys for things like registry access.

On install the Portal installer will also dynamicall generate some configuration, this will be located in the .conf folder within the installers working directory. The generatd config is in YAML and looks like the following.

auth:
  cookiesecret: BpqacfbmjiQE93NW146ZYA2CP50v8Gt7
  postgrespw: CX4Oh5t0D7imZ381VcrFK2GdHoJs9p6j
minio:
  access: gnnRpkm9RNvTNqxuUMCofimQCouPHKFN
  secret: XErHLFaf4uta6q48um6aybxGUYmrvgdy
oauth:
  salt: 6BIyTWvd18Lh5s4mOSEGwb70Mn92P3Nj
  systemsecret: NQpSsxMH4Tv1g82hoi9F76DmlJKP35O0
merge:
  opspw: 4c9k8TUv2fj01Rq6PH5sdY7CzVJQibN3

As you can see this is entirely generated passwords. These can be changed later, but it was deemed a better staring point than default passwords.

Organizations

This has yet to be implemented, how it will fit into the overall picture looks like this.

Cyber Physical Systems Support

CPS support has the following

Hills Left to Climb

Portal @glawler

  • Certificate Based SSH, some notes here @glawler for user and XDC management, @ry for testbed node management, work together on CA server pod/service/ingress.
  • Reconcilers to implement
    • XDC
    • Mergefs
    • Materialization to be done in conjunction with Facility tasks below @ry
  • Filling out the Portal Apiserver Implementation
    • Implementing remaining unimplemented methods
    • Completing policy layer integration
  • Rally Integration
  • TBUI v1
  • Finishing up loose ends in the interconnect planner @ry

Facility @ry

  • Squash down the Commander + Driver interface similar to what was done with the Portal apiserver
  • Factor the reconcile library out of the portal and use for facilities
    • This implies an architectural change to the cogs similar to what was done in the portal
      • The facility apiserver pushes state
      • Individual cogs observe target and resident state, drive toward the target and emit events.
  • MinIO integration similar to Portal
  • Canopy v1 based on reconciler approach
    • The current problem with Canopy in it’s current form is that it does not do any state tracking and relies on the cogs to tell it what to do at all times. This includes if a switch reboots, the config will be lost and an admin has to explicitly go in and reconfigure the switch. The cogs have tools to make this easy, but form an architecture design perspective this is not a great situation.
  • Sort out whether to keep moving forward with Minimega or create a simple hypervisor reconciler controller.
  • User accessible console subsystem.

Testing, testing, testing

Merge v1 will not be cut until the following conditions are met.

  • The items identified above are complete.
  • A code freeze has taken place.
  • Comprehensive and reliable integration tests exist.