Near Term Stats Design in the Merge Portal

This is a brief post to discuss the design discussion had today with Brian, Joe, and Geoff. We need a moderately quick way to implement the following statistics for Merge.

  • Materialization start and end events
  • Realization (resource reservation) and relinquish events (resource release)
  • New project, experiment, organization, and user events
  • Delete project, experiment, organization, and user events
  • Project and organization membership change events

The short term design is to 1) augment the existing stats service (which already publishes some of these events) and 2) capture the currently published data via a Prometheus instance running inside the portal kubernetes namespace.

The current stats service is a minimal reconciler: it just watches for materialization events. In addition, it collects periodic data, counting various things like experiments, physical nodes reserved, virtual nodes reserved, etc. This effort will expand the reconcilation to watch the create and delete events for the projects, experiments, organizations, and users. This data will be put into Promethus counter variables which increase or decrease depending on create or delete events. Since these events do not have many chained events, we will assume they are successful.

Given the changes above, we will have point-in-time data for the desired stats. Standing up a standard promethus server within the portal k8s namespace that scrapes the published data will give us data change over time, but more importantly, it stores the data. (Currently the data is published to an endpoint, but no one is capturing it.) The data store will be the existing Merge data store used by the portal (and thus get folded into the existing backups.

To display the data, we will write scripts or simple web front end. The scripts will query the promethus server and dump the data as tables on stdout. The web front end, will query the promethus server and display the data in a number of charts - likely using React + Patternfly and the Promethus javascript package. Likely the scripts will be done first to debug and confirm the proof of concept for this design.

The Promethus server installation may or may not be added to the existing Portal Helm/Ansible installation. If so, it will likely by only a short-term update before writing a more extensive stats service update.

Gitlab issue for tracking this task: Collect statistics from stats service and export through prometheus (#301) · Issues · MergeTB / Portal / services · GitLab