Facility side operational hardware troubleshooting

Overview

Purpose

The purpose of this post is to discuss effective operational troubleshooting workflow from the PoV of a facility manager without portal operator/admin access. There is a particular pain point regarding materializations and current resource utilization that will be discussed within.

Intended audience

The MergeTB software devops team, as well as all merge portal and facility operators.

Definitions

Merge portal: The SPHERE/ModDeter portal that is currently functional, or any other implementation of the same.
Merge/MergeTB: The codebase that drives the merge portal
Facility: A remote infrastructure which is managed by a Merge Portal
Facility operator: The contact point for remote facility infrastructure managed through a Merge portal. Typically has administrative access to some levels of the infrastructure, but not guaranteed to have 100% access to all layers.
Portal operator: The contact point for a Merge portal, typically has at least administrative access to the portal microservices and infrastructure which runs them, but not guaranteed to have 100% access to all layers.
Realization: Reservation of facility resources through API calls to a facility from a Merge portal, within constraints specified by a portal user.
Materialization: Operational instantiation of reserved resources on a facility.
Nodes: Compute resources available within a facility, intended for allocation by realization.
Infrapod host(s): A facility-side node whose allocable resources are split amongst realized experiments. Typically handles basic network services such as NTP, DNS, and gateway routing to the outside world.

Troubleshooting workflow on hardware errors

Problem description

The facility operator has received a warning or critical message from monitoring tools with regard to the state of a node within a facility. The facility operator must determine how allocated the node itself is, and what resources on the node are most likely actually in use or at minimum who has materialized on the node, so end-users can be notified that there will be some actions required (or not).

Example problem

There is a bad stick of ram on a node (Uncorrectable ECC error).

Operator troubleshooting workflow

  1. Identify problem node, utilize operational procedures to verify System Event Log or other errors without altering node characteristics
  2. Utilize Merge software (mars/mrs - facility side client, merge/mrg - portal side client) to determine how utilized the node is and what end-users will be affected
    a. mrs list metal | grep <nodename>
    b. mrs list vm | grep <nodename>
    c. (assuming VMs are in use) determine when the materializations utilizing the node were created and materialized, to estimate current usage before contacting end-users.
    • mrs list vm | grep md05 | awk '{print $2}' | uniq | xargs echo ^List\|^Mat\|^== | tr [:blank:] \| | xargs -I{} bash -c "mrg list mat --filter all | grep -E '{}'"
      As you can see, this is a rather complicated operational leap from mrs list | grep. This is necessary only because it is impossible to get these dates or times from the mrs tool, and, it is also impossible to use a regex or match filter on the mrg list commands, which is the only class of request that shows the materialization creation and update times.
      Example output of the complex command construction:
      List materializations
      =====================
      Materialization                                  Metal    VMs    Links    Created                         Last Updated                    Status
      harrisonhw8r.harrisonhw8.se3210ah                0        1      1        Sat Nov 18 21:02:12 UTC 2023    Thu Apr 25 03:04:24 UTC 2024    Success
      ingress.firstexp.georgegg                        0        1      2        Wed Oct  4 00:47:43 UTC 2023    Thu Apr 25 03:02:48 UTC 2024    Success
      nattest1.imaging.mpcollins                       0        6      6        Wed Apr 24 18:28:35 UTC 2024    Thu Apr 25 03:10:39 UTC 2024    Success
      ne.test.litols0816                               0        3      2        Fri Dec  8 19:50:04 UTC 2023    Thu Apr 25 03:02:48 UTC 2024    Success
      ne6.test1.litols0816                             0        2      2        Fri Dec  8 20:31:17 UTC 2023    Thu Apr 25 03:02:48 UTC 2024    Success
      real.bufferoverflow.cos356ao                     0        1      1        Fri Apr 26 15:43:58 UTC 2024    Fri Apr 26 15:46:17 UTC 2024    Success
      real.bufferoverflow.cos356ce                     0        1      1        Fri Apr 26 00:32:02 UTC 2024    Fri Apr 26 00:33:17 UTC 2024    Success
      real.bufferoverflow.cos356dm                     0        1      1        Fri Apr 26 01:22:28 UTC 2024    Fri Apr 26 01:23:43 UTC 2024    Success
      real.bufferoverflow.cos356eo                     0        1      1        Thu Apr 25 17:05:15 UTC 2024    Thu Apr 25 17:06:33 UTC 2024    Success
      real.dnsmitm.usc430bc                            0        4      3        Sat Apr 20 02:45:51 UTC 2024    Thu Apr 25 03:05:54 UTC 2024    Success
      real.dnsmitm.usc430by                            0        4      3        Wed Apr 24 21:09:43 UTC 2024    Thu Apr 25 03:12:10 UTC 2024    Success
      real.dnsmitm.usc430ck                            0        4      3        Fri Apr 26 06:43:41 UTC 2024    Fri Apr 26 06:44:55 UTC 2024    Success
      real.exp76938.testuser                           0        4      3        Fri Apr 26 19:21:45 UTC 2024    Fri Apr 26 19:21:55 UTC 2024    Success
      real.firewall.umdsecam                           0        2      2        Thu Apr 25 22:10:51 UTC 2024    Thu Apr 25 22:12:09 UTC 2024    Success
      real.firewall.umdsecbe                           0        2      2        Wed Apr 24 20:24:13 UTC 2024    Thu Apr 25 20:15:26 UTC 2024    Success
      real.firewall.umdsecbp                           0        2      2        Thu Apr 25 23:23:50 UTC 2024    Thu Apr 25 23:25:08 UTC 2024    Success
      real.firewall.umdsecbr                           0        2      2        Thu Apr 25 22:12:28 UTC 2024    Thu Apr 25 22:13:49 UTC 2024    Success
      real.intro.snail                                 0        1      1        Tue Nov 28 23:57:28 UTC 2023    Thu Apr 25 03:04:25 UTC 2024    Success
      real.pathname.cos356dg                           0        1      1        Fri Apr 26 02:39:22 UTC 2024    Fri Apr 26 02:44:16 UTC 2024    Success
      real.sqli.cos356cg                               0        1      1        Thu Apr 25 22:41:27 UTC 2024    Thu Apr 25 22:42:45 UTC 2024    Success
      real.sqli.cos356cn                               0        1      1        Fri Apr 26 16:54:01 UTC 2024    Fri Apr 26 16:55:19 UTC 2024    Success
      real.sqli.cos356cq                               0        1      1        Fri Apr 26 16:28:02 UTC 2024    Fri Apr 26 16:29:16 UTC 2024    Success
      real.sqli.cos356dy                               0        1      1        Wed Apr 24 18:08:23 UTC 2024    Thu Apr 25 03:10:23 UTC 2024    Success
      real.synflood.cos356ca                           0        4      3        Fri Apr 26 18:27:48 UTC 2024    Fri Apr 26 18:35:11 UTC 2024    Success
      real.synflood.usc430an                           0        4      3        Sat Apr 20 03:53:54 UTC 2024    Thu Apr 25 03:06:03 UTC 2024    Success
      real.synflood.usc430bh                           0        4      3        Wed Apr 24 12:44:15 UTC 2024    Thu Apr 25 03:12:29 UTC 2024    Success
      real.xss.umdsecao                                0        2      2        Wed Apr 24 17:26:39 UTC 2024    Thu Apr 25 20:33:48 UTC 2024    Success
      realize1.probability.socialkeyrecovery           0        6      2        Thu Apr  4 07:25:35 UTC 2024    Thu Apr 25 03:09:26 UTC 2024    Success
      realizesw1.swisspost.evotingsystems              0        4      2        Thu Mar 28 07:54:42 UTC 2024    Thu Apr 25 03:04:33 UTC 2024    Success
      v1.testbotenv.discern                            0        7      7        Thu Jan 18 06:20:30 UTC 2024    Thu Apr 25 03:02:48 UTC 2024    Success
      
  3. Notify end-users who own the materializations that their experiments may be disrupted during a scheduled node downtime.
  4. Set the node into a NoAlloc state, to disallow new realizations from reserving resources on the node. mrs portal alloc add moddeter NoAlloc <nodename>
  5. Remediate the hardware issue (Example problem: Remove/replace bad stick of RAM) during an appropriate downtime window.
  6. Bring hardware up, and verify errors in step 1 are de-asserted or cleared via operational tools/monitoring.
  7. Verify that user materializations are up and/or reconciled on the node.
  8. Set node out of NoAlloc state, to allow resource allocations to continue on the node. mrs portal alloc delete moddeter NoAlloc <nodename>

Suggestions for workflow

  • Consider whether it should be possible to filter a “listing” based on a regex.
  • Consider whether the mrs tool should be capable of seeing “created/updated” type data w/ regard to allocations/reservations of facility side resources without querying the portal.
  • Other suggestions?

Another suggestion might be to list the owner(s) of an active materialization, and whether they have recently logged in to the portal or to ssh-jump (or what was the last time they did so). By owners(s) I also mean who realized it, since I think an experiment can be realized or materialized by anyone in a project?