This is a brief post to discuss the design discussion had today with Brian, Joe, and Geoff. We need a moderately quick way to implement the following statistics for Merge.
- Materialization start and end events
- Realization (resource reservation) and relinquish events (resource release)
- Project and organization membership change events
The short term design is to 1) augment the existing
stats service (which already publishes some of these events) and 2) capture the currently published data via a Prometheus instance running inside the portal kubernetes namespace.
stats service is a minimal reconciler: it just watches for materialization events. In addition, it collects periodic data, counting various things like experiments, physical nodes reserved, virtual nodes reserved, etc. This effort will expand the reconcilation to watch the create and delete events for the projects, experiments, organizations, and users. This data will be put into Promethus counter variables which increase or decrease depending on create or delete events. Since these events do not have many chained events, we will assume they are successful.
Given the changes above, we will have point-in-time data for the desired stats. Standing up a standard promethus server within the portal k8s namespace that scrapes the published data will give us data change over time, but more importantly, it stores the data. (Currently the data is published to an endpoint, but no one is capturing it.) The data store will be the existing Merge data store used by the portal (and thus get folded into the existing backups.
The Promethus server installation may or may not be added to the existing Portal Helm/Ansible installation. If so, it will likely by only a short-term update before writing a more extensive
stats service update.
Gitlab issue for tracking this task: Collect statistics from stats service and export through prometheus (#301) · Issues · MergeTB / Portal / services · GitLab