10000 Node Scaling -- We’re Going To Need To Have a Hierarchical Infranet

A refresher: last week, I had thought that it was Dance’s Etcd accesses that were causing Etcd to “time out” so to say, so I implemented it using only a single key and using Minio as the actual configuration backend.

This week: I deployed the code (the code itself worked well), but only kind of helped. I could get 4000 nodes working, but only if I manually stopped and started reconcilers at specific moments. After further investigation, it seems like it’s a network bottleneck – past a certain amount of nodes, the amount of packets on the single infrapod server is so high that loopback pings on it have 100% loss. If loopback pings don’t work, I can’t imagine Etcd accesses working. I was able to have 5x1000 node experiments up and working, but 1x5000 wasn’t working, so I suspect that it’s related to the big /16 of ARP traffic.

To divide the big 'ol /16, I think we can get further if all of the infrapods are on the infrapod server, but I’m unsure if we can reach 10000 by only using 1 infrapod server. Next week, I’d like to try Nx200 to see how high N can be, which would tell us if it’s due to ARP/BUM traffic scaling badly. If we can fit 10000 nodes with 50 realizations of 200 nodes, then we’re probably fine with 1 infrapod server.

As for how for the hierarchy would be implemented, the first thing I thought of was:

  • 1 “subnet” infrapod per ~200 nodes
  • 1 “main” infrapod router to each subnet infrapod

So, this would look like:
172.30.(Infrapod #).(Node #).

With multiple facilities (which we’ll need for ADES eventually), it’d look like:
172.(30 + Facility #).(Infrapod #).(Node #). ,
with another infrapod router that can go to each mtz’s main facility infrapod.

If infrapods can be anywhere, there could be an advantage to putting them on hypervisor or switches instead – then a node’s internet NAT path could be (node → hypervisor → infra switch → gateway), which could alleviate the Minio congestion that we see on Lighthouse when internet usage across the testbed is high (like when someone is provisioning their materialization). It’d also be useful for a single node facility as well.

This is the simplest first thing I thought of. I wrote this to ask if someone else has a better idea (that isn’t complicated, I don’t think I’d be able to implement a bug-free VRF network in 6 weeks).

It is concerning that high traffic on the infrapod vtep causes non responsiveness on the loopback device. The fact that ARP floods on a vxlan network can cause host services to die is a major performance isolation problem. Breaking that vxlan network into multiple networks with a router pod sounds like a nice architecture to me and will definitely reduce ARP traffic, so definitely worth pursuing IMO. But I think we should also try and figure out how to provide better isolation in the infrapod hosts.

have you tried tuning the sysctl values around garbage collection (gc _thresh, gc_stale_time, etc?)
http://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/tree/net/core/neighbour.c

Just an FYI I believe we originally were at the minimum going to use the vrf + subnet architecture to wall off some of the crazy arp stuff on networks that spanned large subnets or multiple switches. The router architecture sounds generally like a good idea, as long as it can be serviced by someone who is unfamiliar with routers/routing or the merge internals.