Initial Chaos Testing Fixes for VTE shutdown

This post is a minor write up to document the use case where we are either chaos testing at the most basic level, or shutting down and cycling every node in the VTE.

Part of this post will enumerate some of the initial issues (that we will hopefully address soon), and the other part will go into short-term solutions.

The brains of the testbed is the portal, but the heart is the site (commander & cogs) which take actions from the portal and implement an experiment. In the VTE, the core of the site is the cmdr node which operates as commander, driver, rex, etcd, etc. When the cmdr or even on the dcomp testbed, the site nodes go down, we lose a lot of ephemeral data, that really doesnt have to be emphemeral. The biggest issues are the network settings (set by canopy) and the infrapods (managed by containerd and configured by the cogs). During a power outage event, we lose the core services which live in the infrapod and all of the network state which allowed a materialization to operate.

Power outage fixes for cmdr & storX

This solution does not attempt to ‘repair’ or ‘recover’ the state, but instead writes new data.

  1. The first step we need to do is to replay some of the cog tasks. The reasoning here is that because the cmdr went down, we’ve lost not only our materialization infrapods, but the special materialization for the harbor network. Starting with the harbor task to re-create the harbor, and then any mzids that also were configured prior to shutdown. Make sure to go in-order, where order is determined by verifying that the task’s dependencies have already been completed.
cog list tasks --all
cog reset <taskid>
...

Once completed, all that is left in the cog list tasks queue are tasks with NodeSetup, this is because the storage nodes have gone down and the bgp network and sled containers also need to be repaired.

  1. The next step is to repair our bgp network. We need to re-advertise each of the storage nodes based on the model (integrated/prototypes/site/xir/tb-deploy-vars.json)
# create new origin
gobgp global rib add {{ data.serviceTunnelIP }}/32 origin igp

# advertise layer 2
gobgp global rib add macadv `ip -j link show | jq -r '.[4].address'` 0.0.0.0 etag 0 label 2 rd "{{data.serviceTunnelIP}}:2" rt "{{data.bgpAS}}:2" encap vxlan nexthop {{data.serviceTunnelIP}} origin igp -a evpn

# advertise layer 3
gobgp global rib add multicast {{data.serviceTunnelIP}} etag 0 rd "{{data.serviceTunnelIP}}:2" rt "{{data.bgpAS}}:2" encap vxlan origin igp pmsi ingress-repl 2 {{data.serviceTunnelIP}} nexthop {{data.serviceTunnelIP}} -a evpn

data.serviceTunnelIP is from integrated/prototypes/site/xir/tb-deploy-vars.json, excerpt below:

    "stor0": {
      "Net": {
        "vtepIfx": "eth1",
        "mtu": 9216,
        "vtepMtu": 9166,
        "serviceTunnelIP": "10.99.0.20",
        "bgpAS": 64720
      }
    },

Running gobgp global rib to see the current routing information base to verify neighbors and gobgp global rib -a evpn to verify the advertisements. If you dont know what you are looking for in those entries, you can verify connectivity on stor0-3 by ping the overlay addresses (172.30.0.X) or on cmdr in the harbor namespace sudo ip netns exec main.harbor.spineleaf /bin/bash then running ping 172.30.0.X, where X is 172.30.0.1 on cmdr and 3 on stor0, 4 on stor1, and so on.

  1. The final repair step to get the materializations working again is to restart the docker containers on stor0-3 hosts. The two images are sledd and slednginx which will probably show up as EXITED due to the power outage. To restart sudo docker restart sledd or restart using the container id (sudo docker ps -a).

  2. Now that the containers are back online and the bgp network has been repaired, we can now delete the task errors with cog delete task-errors <taskid> and the nodes should properly sled now.