Links are in replies, since the forum software says that I’m a new user and new users are only allowed two links per post.
How allocation tables are implemented:
Start with a ResourceAllocation (#1)
This stores information about the host resource (id, facility, etc), the experiment node this is for (mzid, node_id), and the resources it consumes (cpu, memory, nics, disks).
Multiple of these can be aggregated together into a single object (#2)
In a portal, we can have a list of these ResourceAllocationList (#3, yeah, this is in the api, not xir)
Then, we have AllocationTables (#4) groups ResourceAllocations by the Resource’s ID (like buoy000, for example).
We store AllocationTables in Etcd by writing a bunch of ResourceAllocationLists into Etcd, grouped/keyed by the host resource (#5), the documentation at the top of the file is wrong, mzid is not included)
These are stored unaggregated, i.e., for each experiment node, we record that experiment node X takes its resources on host resource Y.
Later, we read the AllocationTable’s ResourceAllocations by at some point, going through each resource, and looking in the map for a ResourceAllocation that matches the resource ID
During the realization, we add things to the local in-memory allocation table, and afterwards, we write the whole thing to etcd
During relinquish, make a new AllocationTable, read the entire AllocationTable, and copy entires from the old table and the new table if they DO NOT have the rel’d mzid.
The current limitations this implementation:
Keys assume a single facility and isn’t trivially retrofittable for multiple facilities
Etcd really doesn’t like how big the Resource Allocation Lists are
Probably should use the merge internal storage apis instead of implementing this via “raw”
Pretty sure it’s the main reason why we can’t do realizations in parallel, as the realization algorithms assume that we have exclusive access to the entire AllocationTable (stored via many etcd keys).
Current Options:
The Etcd option:
To solve Etcd value size, instead of storing a ResourceAllocationList, store a single ResourceAllocation
This either means storing each experiment node’s ResourceAllocation individually or aggregating them in some way
To solve Etcd keying, store them by {mzid}/{facility}/{resource} or {mzid}/{facility}/{resource}/{node} or some way
Leaning towards {mzid}/{facility}/{resource} because I don’t think we need the experiment node information per allocation (we operate at the mzid level for realization/relinquish, not the node level, unless we want some partial mtz/rel capabilities) and from previous experience, I’m skeptical of storing many, many, many etcd keys
The Minio option:
As far as I’m aware, Minio is best at storing big things that aren’t changed very often after being written once, like experiment or facility models.
Otherwise, you have to implement some sort of parallel access control (probably via etcd) anyways.
As far as I’m aware, it’s a worse at storing a lot of small keys that are a part of 1 logical object (does it have common prefix read?), so this would tend towards a single big file that contains all of the allocation information
A single big file (for example, just writing 1 big ResourceAllocationList or 1 per mzid) would work, but I’m unsure on how robust it is in parallelism (like individual locks on a resource/mzid) or if you’re totally hosed if the single big file gets corrupted somehow
Might as well go to a proper database if you want a single file
The database option:
I don’t know if we have any “merge-ready” database choices or integration, so definitely a fair bit of research and implementation work.
Parallelism notes:
Our current heuristic-driven algorithm of “try nodes in a specified order and die if it doesn’t work” can be fairly easily parallelized if individual resources can be locked.
A more complicated algorithm may not be parallelized easily (like bonsai in v0.9 https://gitlab.com/mergetb/shadow/portal/services/-/blob/master/pkg/realize/bonsai.go , which is kind of like MST + rollback if any choice results in an non-existing MST), since you’re exploring the entire space of possible embeddings in a dynamic programming way, but I’m not sure how well it would work if the space of possible embeddings can change in the background while exploring it.
so resourceallocation (RA) and lists of resourceallocations (RAL) are generated during realization, and mapped into an allocationtable(AT)? or just copied in (since resourceallocations are keyed by host resource?)? – underneath a RZID (realization id) as a new key?
is the AT a constant thing for the whole facility/portal, or is an AT created for each RZ?
also please correct me if i am wrong, but “keying” is simply the act of placing things into etcd under a certain prefix? or does it entail something else as well?
I was double checking to make sure I understood your question – it also seemed semi-plausible that you were asking how things would work under the implementation that I’m most likely doing.
Each of those entries is a RAL and during execution, they’re aggregated into a map that looks like
# right before rlz
alloctable["buoy000"] = /alloc/resource/buoy000
alloctable["buoy001"] = /alloc/resource/buoy001
When saving the allocation table, since that’s not stored in etcd, the map is “disassembled” back into RAL’s and that’s what’s stored in Etcd.
None of these resource allocation stuff are ever keyed by RZID, but in the RA data structure itself, there’s an entry for RZID.
As a result of this implementation, depending on technicalities, you can see that there’s either 0 Allocation Tables (since it’s not directly stored in etcd) or 1 Allocation Table (since all of those entries are used to assemble 1 Allocation Table)
so it looks like facility is embedded in resources in the data structure what am i missing here? just that the AT will be humongous if there are over 1 facility worth of nodes/virtual nodes and require a lot of processing?
There is a field in the RA that notes which facility that it’s on.
The lack of using the facility as part of the key is part of the issue though, since it’s not checked right now. So, if nodes share the id, the rlz algorithm will consider things allocated on one of them to be allocated on both of them. For example, if on lighthouse/buoy000 is filled, then you can’t allocate anything on moddeter/buoy000.
ok so is the only real issue basically the node naming/namespace? is this issue something that would exhibit only when attempting to manage two facilities with the same node names or one facility with two portals?
i am not a fan of the allocationtable design right now just because it seems a little crazy to me to build a giant map of every node (across all facilities) and then mark how much resource is used in each one, especially as facilities are “merged” under the portal and that AT gets to be a bit larger, or multiple ATs need to be joined by a resource that’s a link of the approximate shape/size of the pipes between facilities.
is the AT sparsely populated (if only experiments using resource on buoy000/buoy001 are realized, is buoy002 in the AT?)
also, how does the AT account for VMs? does it use the concept of MTU or just assume “available cpu == can use the node for a VM” i am guessing there is some calculation to account for constraints of any of the resources documented in the model (cpu, ram, disk, net bw) and that if a constraint is specified that’s beyond the limit of a given node, it will begin allocating another node?
That’s not the only issue, but it is the most obvious one when having two facilities with non-unique node names. I don’t think we’re going to have a single facility managed by two portals. (The facility just does what it’s told to do by the portal, so if two portals have conflicting orders, that sounds complicated.)
It is sparsely populated, but I’m not sure how you would avoid writing down how much of a resource is used for each facility resource. It seems identical to me in total size if you have 1 giant allocation table or N allocation tables, one for each facility (and I dunno where to put inter-facility connections in that model). There are optimizations you can do, like not loading the entire table if you’re only operating on one facility, but at the end of the day, there’s 1 giant table. It doesn’t have to be stored in RAM necessarily, but I can’t really see a facility’s allocation table (if aggregated by resources used per host) being more than 50 MB. Of course, being in the >5 MB range can be an issue for Etcd, but that’s more of a function of Etcd. Also, not having everything relevant in RAM sounds like it would make realization trickier for more complicated algorithms.
Basically, if the experiment node can fit on the host (enough CPU, enough RAM, enough disk, etc. etc), that host is valid to allocate on. In theory, it’s kind of tricky because some of these decisions depend your placement of other nodes. In practice, with a largely homogenous testbed with large pipes between nodes, it’s not that big of a deal. But with multiple facilities, it will be something to revisit.
One thing I guess maybe I am missing the boat on is the “size” issue. Etcd stores everything in a path with a key and tiny value right? So is the problem that we are trying to read too much at once and not relying on parallelism or CPU processing (to construct the AT in ram) enough? If the issue is reads then maybe the solution is a read replica/follower or two that can be abused for reading?