Notes on battles with stale EVPN advertisements

I’ve noticed a large buildup of stale EVPN routes on one of the testbed facilities we manage. This is a collection of my notes, hopefully ending with a solution.

Failing to delete routes through GoBGP

At first when I noticed the stale routes, my first instinct was to use GoBGP to delete the routes.

gobgp global rib -a evpn del multicast 10.99.1.3 etag 0 rd 10.99.1.3:12

However, it appears that GoBGP will only withdraw a route that it originated. An indication of this is the following in the gobgpd logs.

WARN[0006] No matching path for withdraw found, may be path was not installed into table  Key="[type:multicast][rd:10.99.1.3:12][etag:0][ip:10.99.1.3]" Path="{ [type:multicast][rd:10.99.1.3:12][etag:0][ip:10.99.1.3] | src: local, nh: 0.0.0.0, withdraw }" Topic=Table

Where the key text is

No matching path for withdraw found, may be path was not installed into table

This is a roundabout way of saying the path is from a a peer. The source of the peer can be found by using GoBGP query with the --json flag. This will dump tons of data so it’s best to pipe it to a file and inspect after the fact.

gobgp global rib -a evpn --json > out
"[type:macadv][rd:10.99.1.3:12][etag:0][mac:00:08:a2:0d:dc:ab][ip:<nil>]": [
  ------ snip --------
  {
    "source-id": "10.99.1.2",
    "neighbor-ip": "fe80::526b:4bff:fe8e:9e70"
  }
],

So this tells us that our problem lies with the router 10.99.1.2. This router is a Cumulus switch running FRR, so the next stage of our saga will go there.

Investigating stale routes on Cumulus switches running FRR

Hopping on to the router with id 10.99.1.2 we see the following

# net show bgp evpn route rd 10.99.1.3:12 mac 00:08:a2:0d:dc:ab
BGP routing table entry for 10.99.1.3:12:[2]:[0]:[0]:[48]:[00:08:a2:0d:dc:ab]
Paths: (1 available, best #1)
  Advertised to non peer-group peers:
  swp7 swp9 xf0(xf0) xf1(xf1) xf2(xf2) xf3(xf3) xf4(xf4)
  Route [2]:[0]:[0]:[48]:[00:08:a2:0d:dc:ab] VNI 693
  64803
    10.99.1.3 from xf0(xf0) (10.99.1.3)
      Origin IGP, valid, external, bestpath-from-AS 64803, best
      Extended Community: RT:64803:693 ET:8
      AddPath ID: RX 0, TX 1452543
      Last update: Tue Apr 28 11:55:34 2020

which tells us that the route came from 10.99.1.3.

Hopping onto the router with id 10.99.1.3 we see the following

xf0:$ net show bgp evpn route rd 10.99.1.3:12 mac 00:08:a2:0d:dc:ab
% Network not in table 

which appears to mean the route is not here, which is … odd.

On the 10.99.1.2 doing a ‘hard clear’ of BGP got rid of this particular stale route. Here xf0 is the name of the 10.99.1.3 router.

$ vtysh
% enable
% clear bgp l2vpn xf0

Alas, this is a known bug in the version of Cumulus we are running.

https://frrouting.slack.com/archives/C58SZTP39/p1588170762254000