Overview
Purpose
The purpose of this post is to discuss effective operational troubleshooting workflow from the PoV of a facility manager without portal operator/admin access. There is a particular pain point regarding materializations and current resource utilization that will be discussed within.
Intended audience
The MergeTB software devops team, as well as all merge portal and facility operators.
Definitions
Merge portal: The SPHERE/ModDeter portal that is currently functional, or any other implementation of the same.
Merge/MergeTB: The codebase that drives the merge portal
Facility: A remote infrastructure which is managed by a Merge Portal
Facility operator: The contact point for remote facility infrastructure managed through a Merge portal. Typically has administrative access to some levels of the infrastructure, but not guaranteed to have 100% access to all layers.
Portal operator: The contact point for a Merge portal, typically has at least administrative access to the portal microservices and infrastructure which runs them, but not guaranteed to have 100% access to all layers.
Realization: Reservation of facility resources through API calls to a facility from a Merge portal, within constraints specified by a portal user.
Materialization: Operational instantiation of reserved resources on a facility.
Nodes: Compute resources available within a facility, intended for allocation by realization.
Infrapod host(s): A facility-side node whose allocable resources are split amongst realized experiments. Typically handles basic network services such as NTP, DNS, and gateway routing to the outside world.
Troubleshooting workflow on hardware errors
Problem description
The facility operator has received a warning or critical message from monitoring tools with regard to the state of a node within a facility. The facility operator must determine how allocated the node itself is, and what resources on the node are most likely actually in use or at minimum who has materialized on the node, so end-users can be notified that there will be some actions required (or not).
Example problem
There is a bad stick of ram on a node (Uncorrectable ECC error).
Operator troubleshooting workflow
- Identify problem node, utilize operational procedures to verify System Event Log or other errors without altering node characteristics
- Utilize Merge software (mars/mrs - facility side client, merge/mrg - portal side client) to determine how utilized the node is and what end-users will be affected
a.mrs list metal | grep <nodename>
b.mrs list vm | grep <nodename>
c. (assuming VMs are in use) determine when the materializations utilizing the node were created and materialized, to estimate current usage before contacting end-users.mrs list vm | grep md05 | awk '{print $2}' | uniq | xargs echo ^List\|^Mat\|^== | tr [:blank:] \| | xargs -I{} bash -c "mrg list mat --filter all | grep -E '{}'"
As you can see, this is a rather complicated operational leap frommrs list | grep
. This is necessary only because it is impossible to get these dates or times from the mrs tool, and, it is also impossible to use a regex or match filter on themrg list
commands, which is the only class of request that shows the materialization creation and update times.
Example output of the complex command construction:List materializations ===================== Materialization Metal VMs Links Created Last Updated Status harrisonhw8r.harrisonhw8.se3210ah 0 1 1 Sat Nov 18 21:02:12 UTC 2023 Thu Apr 25 03:04:24 UTC 2024 Success ingress.firstexp.georgegg 0 1 2 Wed Oct 4 00:47:43 UTC 2023 Thu Apr 25 03:02:48 UTC 2024 Success nattest1.imaging.mpcollins 0 6 6 Wed Apr 24 18:28:35 UTC 2024 Thu Apr 25 03:10:39 UTC 2024 Success ne.test.litols0816 0 3 2 Fri Dec 8 19:50:04 UTC 2023 Thu Apr 25 03:02:48 UTC 2024 Success ne6.test1.litols0816 0 2 2 Fri Dec 8 20:31:17 UTC 2023 Thu Apr 25 03:02:48 UTC 2024 Success real.bufferoverflow.cos356ao 0 1 1 Fri Apr 26 15:43:58 UTC 2024 Fri Apr 26 15:46:17 UTC 2024 Success real.bufferoverflow.cos356ce 0 1 1 Fri Apr 26 00:32:02 UTC 2024 Fri Apr 26 00:33:17 UTC 2024 Success real.bufferoverflow.cos356dm 0 1 1 Fri Apr 26 01:22:28 UTC 2024 Fri Apr 26 01:23:43 UTC 2024 Success real.bufferoverflow.cos356eo 0 1 1 Thu Apr 25 17:05:15 UTC 2024 Thu Apr 25 17:06:33 UTC 2024 Success real.dnsmitm.usc430bc 0 4 3 Sat Apr 20 02:45:51 UTC 2024 Thu Apr 25 03:05:54 UTC 2024 Success real.dnsmitm.usc430by 0 4 3 Wed Apr 24 21:09:43 UTC 2024 Thu Apr 25 03:12:10 UTC 2024 Success real.dnsmitm.usc430ck 0 4 3 Fri Apr 26 06:43:41 UTC 2024 Fri Apr 26 06:44:55 UTC 2024 Success real.exp76938.testuser 0 4 3 Fri Apr 26 19:21:45 UTC 2024 Fri Apr 26 19:21:55 UTC 2024 Success real.firewall.umdsecam 0 2 2 Thu Apr 25 22:10:51 UTC 2024 Thu Apr 25 22:12:09 UTC 2024 Success real.firewall.umdsecbe 0 2 2 Wed Apr 24 20:24:13 UTC 2024 Thu Apr 25 20:15:26 UTC 2024 Success real.firewall.umdsecbp 0 2 2 Thu Apr 25 23:23:50 UTC 2024 Thu Apr 25 23:25:08 UTC 2024 Success real.firewall.umdsecbr 0 2 2 Thu Apr 25 22:12:28 UTC 2024 Thu Apr 25 22:13:49 UTC 2024 Success real.intro.snail 0 1 1 Tue Nov 28 23:57:28 UTC 2023 Thu Apr 25 03:04:25 UTC 2024 Success real.pathname.cos356dg 0 1 1 Fri Apr 26 02:39:22 UTC 2024 Fri Apr 26 02:44:16 UTC 2024 Success real.sqli.cos356cg 0 1 1 Thu Apr 25 22:41:27 UTC 2024 Thu Apr 25 22:42:45 UTC 2024 Success real.sqli.cos356cn 0 1 1 Fri Apr 26 16:54:01 UTC 2024 Fri Apr 26 16:55:19 UTC 2024 Success real.sqli.cos356cq 0 1 1 Fri Apr 26 16:28:02 UTC 2024 Fri Apr 26 16:29:16 UTC 2024 Success real.sqli.cos356dy 0 1 1 Wed Apr 24 18:08:23 UTC 2024 Thu Apr 25 03:10:23 UTC 2024 Success real.synflood.cos356ca 0 4 3 Fri Apr 26 18:27:48 UTC 2024 Fri Apr 26 18:35:11 UTC 2024 Success real.synflood.usc430an 0 4 3 Sat Apr 20 03:53:54 UTC 2024 Thu Apr 25 03:06:03 UTC 2024 Success real.synflood.usc430bh 0 4 3 Wed Apr 24 12:44:15 UTC 2024 Thu Apr 25 03:12:29 UTC 2024 Success real.xss.umdsecao 0 2 2 Wed Apr 24 17:26:39 UTC 2024 Thu Apr 25 20:33:48 UTC 2024 Success realize1.probability.socialkeyrecovery 0 6 2 Thu Apr 4 07:25:35 UTC 2024 Thu Apr 25 03:09:26 UTC 2024 Success realizesw1.swisspost.evotingsystems 0 4 2 Thu Mar 28 07:54:42 UTC 2024 Thu Apr 25 03:04:33 UTC 2024 Success v1.testbotenv.discern 0 7 7 Thu Jan 18 06:20:30 UTC 2024 Thu Apr 25 03:02:48 UTC 2024 Success
- Notify end-users who own the materializations that their experiments may be disrupted during a scheduled node downtime.
- Set the node into a
NoAlloc
state, to disallow new realizations from reserving resources on the node.mrs portal alloc add moddeter NoAlloc <nodename>
- Remediate the hardware issue (Example problem: Remove/replace bad stick of RAM) during an appropriate downtime window.
- Bring hardware up, and verify errors in step 1 are de-asserted or cleared via operational tools/monitoring.
- Verify that user materializations are up and/or reconciled on the node.
- Set node out of
NoAlloc
state, to allow resource allocations to continue on the node.mrs portal alloc delete moddeter NoAlloc <nodename>
Suggestions for workflow
- Consider whether it should be possible to filter a “listing” based on a regex.
- Consider whether the mrs tool should be capable of seeing “created/updated” type data w/ regard to allocations/reservations of facility side resources without querying the portal.
- Other suggestions?