Bringing SR/IOV device virtualization to Merge

bkocolos · February 3, 2023, 7:19pm

Our current Merge testbeds provide 2 methods for node and network synthesis:

user nodes mapped to bare metal machines with typically very high bandwidth network devices (40/100/200 Gbps)
user nodes mapped to virtual machines with software emulated devices (typically virtio-net-pci)

These two approaches represent the two extreme points on the resource efficiency vs. network performance design space. Bare metal machines deliver line rate network performance, but it is highly resource inefficient to use bare metal materializations, especially on the high resource dense hardware we’ve procured for ModDeter. On the other hand, VMs with virtio NICs allow us to achieve very high node utilization and support many concurrent materializations, but virtio networks are not high performant; they struggle to reliably transit much more than 1M packets per second, and performance is subject to many sources of overhead from the underlying platform given that virtio processing is very CPU intensive

SR/IOV provides an opportunity to achieve a more balanced tradeoff between testbed efficiency and network performance by subdividing the resources of modern NICs and assigning them directly to VMs. Device virtual functions (VFs) are exposed to the hypervisor OS as separate PCIe devices that can be directly assigned to VMs. This allows device drivers (e.g., mlx5) in guest operating systems to directly map device resources through the IOMMU, which completely removes the host OS network stack from the dataplane.

Linux based systems provide SR/IOV support through the vfio subsystem. The general process for mapping SR/IOV VFs into a guest VM is:

Enable VF in the BIOS. There are several things required:
- VT-d extensions enabled
- PCIe AER (advanced error reporting) enabled
- PCIe ACS (access control services) enabled
Possibly need to update device firmware to expose the maximum number of VFs
Configure IOMMU support in the host OS via kernel command line arguments
- Specific options may vary by chipset; however, typically this is something like:
  - iommu=pt
  - intel_iommu=on (Intel), or amd_iommu=on (AMD)
Unbind the native host device driver (e.g., mlx5) from the device VF
Bind the device VF to the vfio driver
(see some useful docs here: PCI passthrough via OVMF - ArchWiki)

To bring this capability to Merge, we need several pieces:

XIR needs a way to describe the number of VF devices supported by a device PF (physical function; i.e., a port)
The user-facing MX needs to be extended to allow users to request VF-based virtualization in their experiment models
Facility models need to declare capability and VF counts in their XIR models, along with any known performance implications (we can punt on the performance part to start; this needs detailed characterization)
The portal realization service needs to track how many VFs are currently allocated/available on each PF
Mariner needs to be updated to allocate vfio-pci devices, map them to the underlying PF(s) allocated for the VM, types and pass them through to the VM

We take these pieces one by one below

XIR extensions for VFIO

XIR’s NICModel already has a type for VF devices, and the user-facing NICModelConstraint is defined as well
We need a way to describe VFs in physical XIR Ports, so facility NICs can be modeled accurately
We also need a way to describe whether VLAN switch tagging (VST) is implemented by the NIC
- This should likely be a field of the xir.NIC
- However, we could also simply add the xir.NICModel to the xir.NIC type, and then use a helper function that automatically determines whether VST is implemented for each of the supported NICModels

MX extensions for VFIO

We need a new PortConstraint helper for MX
v0.3/mx/mergexp/constraint.py · 3402929ef27de7439a2b11909a1b29190cfc2e9b · MergeTB / xir · GitLab

Facility modeling

The XIR builder library (used for facility modeling) needs to be extended with new constructors to declare VF ports and whether their NICs support VST. These should capture the details required to determine how those ports are named and exposed to the hypervisor OS
- A relevant side note here: it would be great to make XIR smart enough to name devices based on their biosdevname which includes their address on the PCIe bus

Portal realization

The realization service provisioning code needs to map user requests for VF NICs (found via xir.NICSpec->Ports->Model) to underlying physical NICs with VF availability.
When found, the resulting xir.PortAllocation (found via xir.ResourceAllocation->NICs->Alloc->Alloc->]) needs to be updated to account for the VF allocation. xir.PortAllocation thus needs to be extended with a count for number of VFs
We already check whether a physical port has sufficient capacity to support the capacity constraint of the user: pkg/realize/sfe/provision.go · main · MergeTB / Portal / services · GitLab
- This needs to be extended to determine if the physical port has available VFs to support the allocation when VF is requested
The final piece of complexity here is setting up the VLAN aspects to ensure that packets that leave the VF and hit the testbed switch fabric are properly tagged. Ryan pointed out previously that some devices support a feature VLAN switch tagging (VST), whereby the NIC transparently tags egress packets so that they arrive tagged at the leaf switch. This allows the VF to behave like a VLAN access port. In situations where VST support is not present, we have a decision to make as to how to isolate traffic. It may be that we need to trunk the VLAN all the way from the leaf switch into the VM itself. This is undesirable, but similar to how we provision link isolation for bare metal materializations

Mariner

Mariner currently assumes that all NIC allocations are virtual. It needs to be updated to support VF based allocations
service/mariner/vm.go · fcba4af0d01b1171ff90bdafc9f10974d308a3ca · MergeTB / Facility / Mars · GitLab
- What this function is doing is translating the set of NIC requests from the user (as represented in spec *xir.NICsAllocation) to a set of virtual PCI devices and storing those devices in the topo PCINicTopology)
Mariner puts all virtio NICs on a bridge device. vfio-pci NICs are not placed on a bridge device
Mariner will also need to set VST IDs and enable MAC spoofchecking, when available:
- https://docs.nvidia.com/networking/pages/viewpage.action?pageId=19801751#SingleRootIOVirtualization(SRIOV)-VLANGuestTagging(VGT)andVLANSwitchTagging(VST)
- Possibly VST and spoofchecking should be done by canopy instead of Mariner.

yuri · February 13, 2023, 5:36pm

Links to related issues, will add more later:

xir

yuri · February 19, 2023, 6:36pm

sfe

yuri · February 24, 2023, 8:37pm

mars

yuri · March 13, 2023, 10:00pm

api

jdbarnes · April 26, 2024, 9:09pm

Hi @yuri
Did you ever look into using switchdev mode as supported by the mlx5_core driver?
I saw this recently and thought it was interesting.
https://docs.kernel.org/5.19/networking/device_drivers/ethernet/mellanox/mlx5.html#mlx5-subfunction
If we would rather move this to a new thread, that would be fine with me. Sorry for reviving a dead thread.
new link to the sriov docs where vgt/vgt+ and vst are listed.
https://docs.nvidia.com/networking/display/mlnxofedv590560125/single+root+io+virtualization+(sr-iov)

bkocolos · May 8, 2024, 3:34am

Yuri can reply, but we did not look closely at switchdev, despite the attractive benefits that it provides re: integration with canopy via netlink.