A refresher: last week, I had thought that it was Dance’s Etcd accesses that were causing Etcd to “time out” so to say, so I implemented it using only a single key and using Minio as the actual configuration backend.
This week: I deployed the code (the code itself worked well), but only kind of helped. I could get 4000 nodes working, but only if I manually stopped and started reconcilers at specific moments. After further investigation, it seems like it’s a network bottleneck – past a certain amount of nodes, the amount of packets on the single infrapod server is so high that loopback pings on it have 100% loss. If loopback pings don’t work, I can’t imagine Etcd accesses working. I was able to have 5x1000 node experiments up and working, but 1x5000 wasn’t working, so I suspect that it’s related to the big /16 of ARP traffic.
To divide the big 'ol /16, I think we can get further if all of the infrapods are on the infrapod server, but I’m unsure if we can reach 10000 by only using 1 infrapod server. Next week, I’d like to try Nx200 to see how high N can be, which would tell us if it’s due to ARP/BUM traffic scaling badly. If we can fit 10000 nodes with 50 realizations of 200 nodes, then we’re probably fine with 1 infrapod server.
As for how for the hierarchy would be implemented, the first thing I thought of was:
- 1 “subnet” infrapod per ~200 nodes
- 1 “main” infrapod router to each subnet infrapod
So, this would look like:
172.30.(Infrapod #).(Node #).
With multiple facilities (which we’ll need for ADES eventually), it’d look like:
172.(30 + Facility #).(Infrapod #).(Node #).
,
with another infrapod router that can go to each mtz’s main facility infrapod.
If infrapods can be anywhere, there could be an advantage to putting them on hypervisor or switches instead – then a node’s internet NAT path could be (node → hypervisor → infra switch → gateway), which could alleviate the Minio congestion that we see on Lighthouse when internet usage across the testbed is high (like when someone is provisioning their materialization). It’d also be useful for a single node facility as well.
This is the simplest first thing I thought of. I wrote this to ask if someone else has a better idea (that isn’t complicated, I don’t think I’d be able to implement a bug-free VRF network in 6 weeks).