Wireguard and NAT traversal

Inroduction

This post is for discussing the issue of NAT traversal for wireguard. In the existing system Wireguard is used to connect XDCs to facility gateways which host infrapods. These infrapods have wireguard interfaces that connect to the infranet - and thus to the experiment nodes. The XDCs have their own wireguard interfaces that are created at XDC attach time. This is a standard Wireguard connection.

The Merge portal hosts a wireguard key exchange service. When a new wireguard interface is created on an XDC or a facility, the private keys stay on the local wireguard interface and the public key is sent to the portal. The portal then distributes this peer information (public key, initial client ip address and port, and the allowed addresses for that client in the tunnel) to all entities that need it for the given materialization. If a facility key is given, all attached XDCs get the peer information. If an XDC key is given all facilities in the materialization get the peer information.

Now this works well for all the reasons that Wireguard works well: minimal setup (single key and peer information exchange), client roaming, minimal attack service. Once all peer information is distributed, wireguard will wait for an initial connection from either side and note from where the first authenticated packet comes from and set that to be the response address and port for packets back to that client. The client does the same when the other side responds. So at least one end needs to have a route-able endpoint for the initial connection. This will break when both sides are NATted as neither side has a well known endpoint.

This is generally fine for Merge as it exists today. The facilities all have well known endpoints and the connection is driven by XDCs - XDCs make first contact. This means the facility knows how to route back to XDCs regardless of where they are.

But we are introducing bring-your-own-device (BYOD) to Merge. This is generally imagined to be a single node or small cluster not running with a public or well known endpoint. BYOD will, we think, will use multi-facility functionality to materialize the device into an experiment that is mostly running on another facility. So two endpoints may be NATted - the XDCs and the BYOD. at least one of them will need to get the NATted endpoint to be able to make first contact.

So how to do that?

As this case is not thought to be that prevalent (is this true?) I think the solution should not disrupt the existing system too much. We do not want to re-architect a working system for an edge case that won’t happen too often. Especially a user facing part of the system which the wireguard service is. We want it to be as robust and simple as possible.

Chris’ Suggestion

Chris has suggested a hub and spoke - the portal will host one (or more?) wireguard-pods that will have well known endpoints. All wireguard traffic for a materialization (XDC <=> facilities) will flow through this pod or pods. The traffic between these pods and the XDCs will not be encrypted so XDCs will not need wireguard keys. Please let me know if this is not correct.

The upside of this approach is, first of all, it solves the double NAT problem. And as a nice side effect, it simplifies key exchange as each materialization will only need 1 key for the portal (the wireguard pod) and 1 key for each facility. The downsides of this approach (according to me) are: 1) a single point of failure if the pod goes down or has network issues all connections are broken; 2) key distribution doesn’t really need simplification - it’s not a complex protocol; and 3) creates a dependency that doesn’t exist in wireguard - wireguard is very robust to changing network conditions. Introducing a hub to a peer to peer connection adds complexity and possible network latency; 4) it breaks “local XDC” connections (XDC that run on user machines not hosted in the portal).

Chris has rightly said that if we do not go with this, well, we need to go with something. BYOD will generally be behind a NAT.

Another Suggestion

I don’t really have another suggestion. :slight_smile: Although after a very small bit of looking around I have found that others have come up with solutions for this. Or at least thought about it. This is not surprising as many people use wireguard and many people are behind NATs.

Here’s a write up for one: WireGuard Endpoint Discovery and NAT Traversal using DNS-SD | Jordan Whited. First it talks about using STUN (Session Traversal Utilities for NAT) to punch through a NAT. That may work. Then it also suggests a NAT traversal broker that distributes peer addressing information. I think at first glance this looks like a pretty good solution. It could likely be easily integrated into the existing wireguard service on the portal so there would be minimal new “things” needed. When a wireguard connection is created the client (facility) already gives it’s keying information to the portal as normal. It could, in addition, make a second call to this address broker. The broker would note the (NATted or not) endpoint for the facility. This endpoint would be added to the existing wireguard enclave information that is distributed. So all XDCs (and other facilities) would get the (NATted or not) endpoint of the facility. We’ll need to look into this a more, but on the face of it it seems reasonable to me. This API call / broker only extends an existing system and integrates nicely. No real re-architecting is needed and it does not route all traffic though a single point so the robustness inherent in Wireguard is kept.

Edit: and if we can get the GRPC system to give us the packet used to send the public key, we will already have the (NATted or not) address for the facility. So no second API call would be needed. I’m not sure GRPC will give us that though. Worth looking into anyway.,

Anyway…thoughts?

Maybe all we need is NAT hole punch: WireGuard NAT Traversal Made Easy » Nettica

Seems a little hacky but may be enough. I suppose a more robust solution is better and reliable. Just putting this one out there for discussion.

I think tailscale and headscale also support NAT punch-through. We could just move to that and get rid of the home-grown wireguard service we use now. Presumably the code is better (more rigorously tested) and better documented.

It may also be useful to know when a facility is NATted. Maybe we need a NATTED_GATEWAY role in xir to describe this. Then we can handle this case separately from a standard gateway.

A related question: how is a portal going to communicate to facilities behind NATs? Currently most comms between the two are portal initiated.

It’s important to note that we are now in a double NAT situation, with our pods (both our services and our XDCs) being behind a NAT (obviously, they don’t each have their own public IP). Most of the time, I’d say that you only deal with single NAT, which is a lot simpler to deal with (just connect to the public one) and this currently a requirement for a facility – to have public endpoints for the infrapod servers. This is a burden on the facility ops, even for us. I remember Lincoln originally recently set up the CPS testbed with the infrapod server behind a NAT. So, implementing NAT’d facilities means that it’s possible for NAT’d facilities to the default case, rather than an edge case going forward.

If you can configure your own router (which I can do at home), then it’s really easy to manually UDP hole punch/NAT for your own Wireguard network, so you really only deal with a single NAT situation. But this gets a lot more complicated when we have dynamic Wireguard networks (as we do, as each mtz has at least 1 wireguard network) and when you cannot configure your router (as the CSU people cannot), so this is where we start going into the other methods.

STUN servers don’t do anything on their own (as linked to in the article), they just tell you your apparent external IP:Port and what kind of NAT you have. Integrating one would basically mean periodically telling the portal what the asking a STUN server for the eIP:ePort we are, telling the portal the eIP:ePort what they are, hoping those work (they don’t for all NAT cases, since it’s possible for the IP Address used to communicate with the portal is different that the IP address used to talk to a peer, aka NAT type 4), and having the portal update every Wireguard config to use the periodically informed eIP:ePort. While this can work (in a complicated way), it’d still need to be supplemented with other methods.

The other thing you see around is some sort of relay server (like TURN or DERP), which this implementation pretty much is: we have an endpoint accessible to both the NAT’ed facility side and whatever NAT’ed pod we want to connect, and we bridge the two there. The main cost of a relay server is that if your relay server is far away from both endpoints, you incur a higher bandwidth cost. Since the vast majority of our XDCs are in the portal, there really isn’t a bandwidth cost, since the relay server would be collocated with portal XDCs.

There’s also nothing stopping us from implement p2p access for Local XDCs, as there’s nothing stopping the infrapod server from continuing hosting wireguard connections, other than code complexity. Of course, you’d still probably use the relay pod in the portal for NAT’ed facilities anyways.

There’s also nothing stopping you from running a separate relay server for each individual XDC, which the current system basically does, if you want to avoid “a single point of failure,” other than code complexity. It’s important to note though, that failures tend to be colocated, so dodging that a particular pod is or is not working doesn’t mean much in the grand scheme of things if a portal worker itself is broken. How often have we encountered where just a single pod itself was broken and other pods were unaffected? It’s also important to note, as you’ve correctly stated, it would be 1 wg-pod per mtz, meaning that if we do have a single pod failure, it’s broken for the mtz. For a lot of mtz, this is already true, as a lot of mtz’s only have 1 XDC attached to them to begin with.

We are not going to run Tailscale, the biggest reason being that it would not work in an airgapped environment, as Tailscale uses their own control servers for key/route distribution and the like. The open source implementation of control servers (Headscale) only supports 1 network, and we need 1 network per mtz, maybe more. Also, there’s the unknown complexity of integration/credentials/ACLs, cost, etc. Tailscale itself uses the previously mentioned methods to deal with NAT anyways. I do not know if Tailscale was designed for very dynamic creations of networks either, since most VPNs are “initialize once, add devices when you need to,” instead of our very dynamic setup.

Part of the reason for this design is a dissatisfaction towards the current robustness of XDC attachment. I do not accept “try detaching/reatching” your XDC when, despite status and everything looking fine, the wireguard connection does not work. It was awkward explaining that to my classmates for projects and it was even more awkward when working with someone on another project in which I would have to manually do that myself whenever they encountered issues with their XDC (they did not have competency to do it themselves).

While the current protocol is not that complicated (although we have a dance of keys that write other keys as a result), we still have a number of failure cases that are not currently being handled well, which do get complicated:

  • reboot of the portal
  • reboot of the infrapod server
  • reboot of the actual infrapod
  • desync of keys for whatever reason (how are we supposed to check and resync this? this can happen with connectivity issues between the portal and the facility. this is much more likely to happen than general state instability, as you attach/detach much more frequently and mat and demat.)
  • key exchange after a race condition of demat/remat
  • attachment before the XDC is inited (which adds a janky hard status, having its own status issues when the status tree isn’t accurate)
  • wg interface deletion (which can cause aforementioned status issues)

Is the current design going to scale well with multi-facility? If any facility has an issue, is the answer going to be “detach/retach”?

Considering that everything else for a facility is considerably more complicated (the coordination of networking, infrapods, VMs, etc) and is somehow more robust (it’s pretty rare that demtz/mtz will solve your issues, usually there’s an underlying issue), XDC attachment should be at least as robust. It would also be really nice to use the reconciler package directly to set up wireguard tunnels, rather than indirectly though xdcd, podwatch, and the like.

We’re also on our 2nd (which was Bryan) or 3rd cleanup (I saw Yuri did some stuff in there, I don’t know the extent) of XDC attachment and you have to start to wonder if it’s the design that’s actually complicated. Handling/testing for failure is how you get robustness, but it also gets complicated and expensive. It would just be really, really nice if we had a simpler design in the first place to avoid these failure conditions.

The idea is the same: have a wg-pod/node for a facility (probably just 1 pod/node/network for all NAT’d facilities). On the pod side, route all NAT’d facility traffic through that pod or node. On the facility side, just connect to the pod or node.

In some sense, this is a generalization of XDC attachment, where instead of supporting only XDCs, we support any arbitrary pod as well.

Of course, when you get to the specific details of how this would be implemented (like how the deployment is configured and how to actually do these things) is TBD, since I would need to get a working implementation first.

How is this double NAT with respect to wireguard? Only XDCs have wireguard connections. We may have that if we go with the WG pod idea, but we do not have that now.

Why and which methods?

Lincoln just assumed a NATted environment and had to update it to not NAT. How does this imply that facilities will be NATTed going forward as a default case? Do you mean if we implement WG pods, then this will be the default case?

How often have we encountered where just a single pod itself was broken and other pods were unaffected?

Nearly all XDC attach issues are this. A single XDC pod is in a state that breaks the attachment for some reason. I agree that this code needs to be made bullet proof - having multiple developers touch this code is not a recipe or guarantee for that. I full rewrite would be nice now that we have a better understanding of how things work and are used.

How is the current system is a relay server? All connections are peer-to-peer - they are standard WG connections.

It was not clear to me given your description that this was the case. I do like this better as a single pod error only brings down the wg connections for a single mtz.

This is ok by me. I was just throwing things out there. @lincoln is a fan and has experience with it I think so maybe he can chip in here.

This may be true. In this proposed design we’d not need WG at the XDC, which is a win.

The design is simple. The implementation can use work. There is a fair number of things that have to happen for the attach to work.

You mean with the absence of XDC WG interfaces,right? The wg pod would be reconciled. XDCs networking would still need to be dynamically updated. I assume this needs to be done within the XDC so xdcd would still need to be used. Podwatch will stlll be needed to handle the case of attached XDCs get restarted.

Any of these cases are complicated and will be difficult to handle well, I’d say. Dynamic tunnels across multiple independent systems is not easy. I do think the current implementation handles some of these fine. There has not been a lot of testing though.

Chris, in general my concern with this whole process is there was no process. You made a post on MM, then at the end of the day you made another one saying you were going to implement this an deploy it over the break.

That is not an acceptable way to introduce new things into a system that everyone needs to maintain and triage when something breaks. And, I think in general given the history, it will be me and Joe mostly needing to do triage when things go wrong. And they will go wrong - especially in the wireguard system. This is the most complicated user facing part of the system. Everyone uses it for each experiment. It will break sometimes. So declaring you’re changing how it works without a review is not great.

…and run the GRPC traffic over this? Would these pods just always exist for each facility?

To answer the last two things, all I have for you right now is that that stage I’m in is that I’m trying things out. I’m not at the stage of writing code within the Merge services right now, just trying out potential k8s deployment/daemon set configurations with a virtualized k8s test environment (once I get one of them up and running), since the networking is detailed enough to require testing it that way.

(I’ll probably end up doing some variant of it on Merge itself, since I realized that I really have to take into account of how we ingress into the portal itself, so I would feel uncomfortable testing only on Minikube itself, and we happen to develop something that allows for this exact testing.)

There are a lot of potential implementations though. The core idea though remains, use a “relay” pod to allow double NAT. I remain convinced that nothing else is really viable or as simple, like STUN servers or Tailscale. Either way, this is not so much input to the process so much as explaining why those aren’t viable.

The devil is really in the details, the current implementation I’m playing around with allow might even allow us to (inefficiently) run wireguard over wireguard without any changes to XDCs themselves which is not something that I was expecting to be able to do (hence for the design to include how XDCs will connect to facilities). Maybe that will work, maybe it won’t, so I still need a plan on how to deal with XDCs. Even the previous design allowed for seamless backwards compatibility too, as in, to do this for NAT’d facilities only or only apply the new changes for new XDCs. Maybe the code to allow that is simple. Maybe the code to allow that is a mess. I don’t know yet. This is only a couple of days later too, there’s still a lot I have to figure out before committing to an actual implementation, which is what will actually determine what will change in terms of code, features, backwards compatibility, and so on. I’ll ask for input on these details of what you would all prefer as I get closer to an actual implementation, but I think I’m at the level of “MM post of what I’m thinking.”

As for what “deploy over the break” means, I had meant that if the actual implementation exists by then and would be better if it was backwards incompatible, since we have a plan to upgrade things over the break, to include it then. Not that I was literally going to edit the live portal to deploy it. I usually don’t touch the live portal unless things are absolutely on fire and need a hotfix.

At least historically, I have tried to maintain backwards compatibility (maybe too much, as you’ve cleaned up the old deprecated status APIs) and to avoid changing too much code.

The situation that we would be in with regards to NAT’d facilities that we have NAT on two sides: The XDCs (and whatever pods we have) are behind a NAT and by definition, the NAT’d facility is behind a NAT. I pointed this out because now you’re looking at a very specific class of techniques used to address NAT and how we’ve (and other people) avoided dealing with it.

I explained in the parens, but NAT type 4 doesn’t work with STUN servers, due to the eIP:ePort being potentially different as presented to the STUN server and as presented to the thing you actually want to connect to.

As for method, it’s relay servers.

Specifically, if we allow for NAT’d facilities, it’s possible that by default, facility operators will choose to implement their facility under a NAT because it’s simpler, easier, and more secure to set up (since you don’t have to expose anything publicly).

But the point was to point out that NAT’d facilities may or may not be an edge case after implementation.

Is it the pod itself, as in, the underlying container and network namespace? Or is it that the configuration of what the pod is supposed to be is messed up? Or is it that the pod failed to apply the configuration? If it’s the last two cases (it’s pretty rare that k8s itself screws up), then simplifying the configuration would help (either by having 1 configuration that should work or by only having to ever apply 1 configuration).

I assumed that the reason why we don’t use the normal reconciliation package method on an XDC directly is because since it’s user facing, it probably shouldn’t be connected to etcd (otherwise maybe a user could change etcd).
At least for wg-pod, you could run a normal reconciler since users don’t have access to its internals. And if using a normal reconciler, you automatically have “on startup/ensure, reconcile what you need to do,” which would handle pod restarts.
XDCD is not a forgone conclusion either. I think it’s possible to set up route rules automatically on the host without touching the XDC itself or within the XDC’s network namespace directly. I think it’s also possible to configure dance to assume that if DNS requests are coming from an XDC, to provide the correct DNS search paths, so we wouldn’t have to update those either. I’d have to experiment though.

Right, which is why I’d like to dodge at least some of them with a “cleaner” design.

Note, typing these and design thoughts up takes a lot of time to do well. I am a little bit stretched for time right now. So forgive me if, while in a rush to type things, I phrase things in a way that gets misunderstood.

i have to read this later but, since a portal must have a public ip and facilities do not always have a public ip, what was the reason we could not just host a facility-named wireguard endpoint on the k8s and run wireguard as a “server” to relay/stitch things together, while also making the keying a little more robust and less prone to de-sync issues by just accepting that there will be a place that stores the keys or something?