One of the problems we originally encountered with XDCs was that a user could easily saturate a portal server’s resources. This created a situation where XDCs colocated with a resource hog could not perform work due to lack of resources.
The solution to the problem was to introduce resource limits. However, this has come with its own issues.
A container is not aware of what the resource limits are from inside the container. So software can easily OOM with no real way to self-police.
Running ansible playbooks is one of the primary things users do on XDCs, and an Ansible playbook configuring an inventory of dozens or hundreds of nodes can easily consume significant amounts of memory through forking. Often times forking is required to get reasonable performance from a playbook.
How many physical machines are there for the XDCs? Perhaps each team gets a dedicated physical system and then they only hurt themselves?
Or if there isn’t enough hardware, can you impose limits across groups of containers? So each team gets a set amount of memory across all of their XDCs.
Another thought is that when creating an experiment one specifies how much RAM they want and for how long. Then these XDCs have an expiration time so that the resources aren’t tied up forever.
It would also be really helpful if there was a way for the users to be notified of an out of memory error. Normally I would tail the system log or look at dmesg, but neither of those options are available on the XDCs.