Near Term Stats Design in the Merge Portal

This is a brief post to discuss the design discussion had today with Brian, Joe, and Geoff. We need a moderately quick way to implement the following statistics for Merge.

  • Materialization start and end events
  • Realization (resource reservation) and relinquish events (resource release)
  • New project, experiment, organization, and user events
  • Delete project, experiment, organization, and user events
  • Project and organization membership change events

The short term design is to 1) augment the existing stats service (which already publishes some of these events) and 2) capture the currently published data via a Prometheus instance running inside the portal kubernetes namespace.

The current stats service is a minimal reconciler: it just watches for materialization events. In addition, it collects periodic data, counting various things like experiments, physical nodes reserved, virtual nodes reserved, etc. This effort will expand the reconcilation to watch the create and delete events for the projects, experiments, organizations, and users. This data will be put into Promethus counter variables which increase or decrease depending on create or delete events. Since these events do not have many chained events, we will assume they are successful.

Given the changes above, we will have point-in-time data for the desired stats. Standing up a standard promethus server within the portal k8s namespace that scrapes the published data will give us data change over time, but more importantly, it stores the data. (Currently the data is published to an endpoint, but no one is capturing it.) The data store will be the existing Merge data store used by the portal (and thus get folded into the existing backups.

To display the data, we will write scripts or simple web front end. The scripts will query the promethus server and dump the data as tables on stdout. The web front end, will query the promethus server and display the data in a number of charts - likely using React + Patternfly and the Promethus javascript package. Likely the scripts will be done first to debug and confirm the proof of concept for this design.

The Promethus server installation may or may not be added to the existing Portal Helm/Ansible installation. If so, it will likely by only a short-term update before writing a more extensive stats service update.

Gitlab issue for tracking this task: Collect statistics from stats service and export through prometheus (#301) · Issues · MergeTB / Portal / services · GitLab

i have no operator qualms about a built-in prometheus, as long the prometheus and stats data source can be scraped from inside the bastion.

would this prometheus take the place of any other operational level prometheus or would it be strictly for portal stats?

what are the limits on the prometheus data? will it exist for the lifetime of the portal instance? how large can it expect to get or does it not matter (i.e. should this data be preserved at any cost in storage?)