We want to add the concept of future realizations to the Merge code.
Since realizations already include a duration, all we’d need to add to the realization API is a start argument.
We could also make this in a constraint in the model itself like so:
from mergexp import *
net = Network('foo', addressing==ipv4, routing==static)
n1 = net.node('one')
n2 = net.node('two')
net.connect([n1, n2])
experiment(net, duration==2w)
Where the duration==2w means the exp needs to be materialized for 2 weeks.
Whether the constraint is given in the model or in the API call (mrg realize --duration=2w ...) the realization engine would add that to the list of constraints when realizing.
We would like start with just looking at node availability - with other constraints added later. So the realization just looks at nodes being available at the requested time. If other constraints fail at actual realization time, then users would need to contact OPs to see if they can free the missing resources so the user can re-realize.
Would the portal auto-realize at the given time? Not sure. I think it should be user driven. What if the user decides they do not need the resources - if the portal auto-realizes, we’d be using resources that no one is using.
How to realize satisfy this new constraint is left as an exercise for the reader. (Please show all work and include in the comments.)
We would also add an API that displays current and future resource reservations. So we can add a calendar of sorts to launch so users can see more or less when the facilities are in use.
Explicit selection of a time window by a user (subject to some constraints) for a set of particular resources needed for his/her realization using a shared calendar. During this time the user can realize models using these resources and these resources are guaranteed to be available. This may be like:
a) I’m marking next week I’m using GPUs 1, 2, and 5 Monday to Thursday.
b) Come Monday, I’m notified that my reservation is available and I can start using it (or maybe better, but more work: a model that I submitted during reservation is automatically realized with a given expiration date).
c) on Thursday my realization expires (I have a chance to extend it, if nobody else applied for Friday).
Some kind of scheduler that you can submit your models and it uses constraint satisfaction find the when it can realize this. If you’re OK with the time window, it will queue your model and realize it automatically when it told you or possibly sooner if resources become available. For this to work everything needs to go through the scheduler and it must make certain that all concurrently realized models are not starved of anything. This may or may not use a clever constraint satisfaction algorithm – we can continue using greedy strategy + FIFO.
The first option seems a bit simpler to do while the second might be a longer term goal. I think portal/facility changes needed for #1 are also needed for #2 (except for GUI).
Update: Juice is a great concept, but it sounds like you cannot self-host the controller, so while they do have a free version, I don’t think we can use this in MERGE.
Original message:
Also, since this is motivated by GPUs, this link may be of interest: Welcome to Juice | Juice
(although CPS’s will likely also be a point of contention)
If I was to implement 1, I think the easiest thing to do (both in the short and long term) would be to update Realizations with a start time.
Since allocations proto/portal/v1/realize_types.proto · main · MergeTB / api · GitLab are already keyed per realization, the first thing I would do is implement GetResources(time.Time start, finish) as an API call. This would get you all allocations from realizations whose start time is between the two time points. This is, of course, pretty easy to check since it’s literally start <= realization.start <= finish somewhere around here: pkg/realize/alloc.go · main · undefined · GitLab , where you just filter each element in the resource allocation lists if their realization is within that time period.
This would allow users themselves to check and “estimate” if there’s enough resources at a specific time. Automatic scheduling (aka, realize at the first available time) can be punted for later.
With that function, you can just pass in a different AllocationTable during embedding and it would 99% work “as is” without a lot of modifications.
The two main exceptions is that: vlan/vxlan allocation tables are stored somewhere else and emulation/infrapod node use is like, “hardcoded” stored. Both of these would need to be associated per realization such that filtering by time period works, but otherwise works on the same principle.
I’d leave updating the “constraint solver” as something to do in the future.
There’s no such thing as “automatic” realization here, since everything is already done and accounted for: assuming the current time is between its start and end time, you should just be able to materialize it, it’s a complete realization already.
Surprisingly, I think this is less code than trying to manually and separately trying to keep track of node request use too, as Yuri suggested.
I also want to say that opportunistic realizations would be useful too. While future realizations guarantee that you’ll have the resources at a specific time, queued opportunistic realizations would be useful too.
Consider that we have a node, and it’s reserved for a week, and there’s another reservation on the next week. Then, consider what would happen if they finish their work early and relinquish early – there’s now a free period which did not exist before.
Opportunistic realizations here means that when something is relinquished, we try to realize each thing in the queue, and if it’s possible, we do so and send out an email.
Future reservations are probably better when you have a “large” experiment that you want up or you want guaranteed access within a window, while opportunistic realizations is probably better if you have something small that you want to do quickly.
If we have Merge CI, this would be very powerful, as users could automatically run and relinquish their experiment when time permits for nodes with contention, like GPU nodes.
Oh yeah, I meant to mention this during the meeting. A realization queuing mode. Instead of a start date, start=queue. Then we mail the user when the realization is actually applied. Ideally we’d also have some estimation API around the reservation queue. Estimated time to realization, number in queue, etc.