Converged infrastructure design for SPHERE facilities components

to simplify purchasing, rather than purposefully designing server systems for each individual role, we will make a “best fit” system that can be adjusted if necessary (i.e. more disks) or scaled out instead of designing systems to meet specific needs.
My view on this is that the 12 memory channels of the standard Zen architecture motherboards is probably wide enough for most scale out architectures. I think either 32 or 48 cores will be sufficient for most purposes (assuming we will scale OUT rather than UP), and am looking at the Zen5 CPUs which have significantly faster clock, bus, and memory transfer speeds such that they can truly be multi-purpose systems without any modified components.
An example:

  1. infrapod: 32/48 core Zen5 with max clock of 4.8Ghz 5600 MT/s memory at 192GB, 2 7TB NVMe (raid1 + 3 partitions)
  2. storage: 32/48 core Zen5 with max clock of 4.8Ghz 5600 MT/s memory at 192GB, 2 M.2, 6/8 7TB NVMe (raid6)
  3. emulator: 32/48 core Zen5 with max clock of 4.8Ghz 5600 MT/s memory at 192GB, 2 NVMe (raid1 + 3 partitions)

Any scaling would happen by ordering another node.
This will require some overhead design in the network architecture in case a scale out is required.
Scale up is possible by purchase of simple components if truly necessary (CPU, RAM, NVMe).

options explored in GPU facility design (external links may be broken):

if the purpose of this is to rubber stamp something that can become any component i think the gigabyte is the best price, smallest size, and still relatively flexible on OCP/PCIe cards. The 1U supermicro is the most flexible chassis, but must buy an AIOM to have a “management” network that’s on copper. this is not a huge issue unless we need 3 types of NIC port (may not need for core infrastructure?)

https://www.amd.com/en/products/processors/server/epyc/9005-series.html#specifications (list of all 9005/Zen5 Turin CPUs)

Based on this table of Zen5 (9005) CPUs, I would select 9375F (320W) for general use. at max speed 4.8GHz 320W and 32c it is sufficient for most workloads at a reasonable power envelope.

caveat: this design may not apply to ops hosts specifically, as those have different/lesser requirements.

Please discuss. Feeback welcome.

https://gitlab.com/mergetb/ops/sphere-issues/-/issues/62 (gitlab issue for this design)