Facilitating large transactions through etcd

bkocolos · February 5, 2024, 10:19pm

The etcd API provides a notion of a Transaction to facilitate use in distributed applications. Transactions provide several key features:

They allow updates to multiple keys to be performed atomically - e.g., assume there are two keys, A and B, which have values a0 and b0. If the transaction updates those values to a1 and b1 respectively, a reader that attempts to read A and B concurrently will either see values a0/b0 or a1/b1 - i.e., never a combination of old and new values
They allow for optimistic concurrency control (OCC) through If/Then/Else constructs, which obviate the need for explicit locks around concurrent updates to the same key space.

One place in which Merge uses transactions is on the facility when handling materialize API requests. When the apiserver receives a request, it constructs a transaction consisting of updates to the set of keys that are watched by the various site reconcilers. The transactional semantics here are especially important; i.e., it is assumed that if a watched key is updated, the reconciler that wakes up to handle it may assume that all other aspects of the etcd key space for that materialization have also been updated.

However, there is a challenge we need to deal with, which is that etcd places a limit on the number of keys that can be involved in a transaction. We can increase the limit by some amount, but in doing so we will eventually push it into an operating range that etcd simply cannot deal with.

Chris has a recent MR that works around this limit by splitting transactions - i.e., breaking the key-space into smaller segments and enforcing transactional semantics within (but not across) those individual segments. This gets around bulk txn limits, but breaks the full-txn atomic property that we rely on with our use of transactions.

Thus, we have 2 options: (1) remove our reliance on transactional semantics, or (2) find some other way to implement transactions without relying on etcd.

My initial preference is for (2). One way to achieve this is with reader-writer locks – i.e., moving away from OCC in favor of explicit locking. Etcd provides read-write locks as an experimental feature: recipe package - go.etcd.io/etcd/client/v3/experimental/recipes - Go Packages

The basic idea is as follows:

All txn writers must acquire a write-lock. Once acquired, they can non-atomically write to their key space (i.e., through a split transaction, or any other means). After all keys of the transaction are written, they release
All txn readers must acquire a read-lock. Once acquired, they can read anywhere in the transactions’ key space, and transactional semantics are ensured. Once they are finished reading, they release

Write locks are mutually exclusive with writes and reads, while read locks are only mutually exclusive with writes. That is, any number of concurrent reads are permitted, as long as there is no write. This aligns well with the read/write mode that we typically use in Merge, which is:

apiserver: single writer
reconcilers: many readers

It may also be that option (1) is preferable – do not build reconcilers to expect any sort of transactional semantics. This may be more in line with the idea of the reconciler model, which is that each reconciler only cares about state that is keyed under the prefix it is watching. We’d need some code updates to achieve this, because I know of at least one situation where a reconciler assumes that some keys outside of its immediate keyspace exist, but it might ultimately be the better option

christra · February 5, 2024, 10:55pm

How global is the lock for the RWMutex you are envisioning?
As in, is there a single RWMutex for etcd and you have to opt out? (Like status reads/writes lacking transactional semantics to begin with.) Or something on the level of per mtz?

jdbarnes · February 5, 2024, 11:07pm

wait are we already updating over 128 keys at a time for a single operation? what is causing that?

jdbarnes · February 5, 2024, 11:11pm

after just stupidly clicking through, am I correct in my assumption that this is specifically targeted at prefixes & keys underneath specific prefixes (https://github.com/etcd-io/etcd/blob/client/v3.5.12/client/v3/experimental/recipes/rwmutex.go#L25) ? if so, then hopefully our large transactions are under a specific prefix?

bkocolos · February 5, 2024, 11:28pm

The number of keys involved in a txn is roughly proportional to how many nodes and links are in your mtz. So larger mtzs can run into this issue

bkocolos · February 5, 2024, 11:29pm

Ideally yes. However currently I don’t think we’re doing any grouping to facilitate this. We could though – e.g., all keys pertaining to a specific mtz could be prefixed by the mtz

bkocolos · February 5, 2024, 11:30pm

I guess the right answer would be mtz-prefixed RWMutex. So, /materialization/<mzid> possibly. This may require some re-working of how we configure the keyspace

christra · February 6, 2024, 12:00am

I think the implementation uses a key prefix (to keep track of the different readers and writers), but the actual lock key name doesn’t have anything to do with the actual keys that you write.

jdbarnes · February 6, 2024, 12:02am

if the “thing” we are updating has so many subkeys under it that we have to break up “edits” into locked transactions to maintain atomicity, why are we not keying a ‘version’ into this so we can just point at the correct ‘version’ with infinite subkeys or something? i might be oversimplifying this in my head i suppose, but let me try to re-state the problem and check me.

“Thing M” has 129 associated keys and versions Mv0 and Mv1. at minimum we need 2 transactions to update M, T1 and T2, but of a group of reconcilers R0…Rn, R129 reads a partial update between T1 and T2 where key 129 is “wrong”, so now we have some kind of intermediary broken state where R129 is on Mv0 and the rest of R0…Rn are on Mv1…

So, I guess what is the problem if R129 will eventually re-read key 129 and update to Mv1?

If this is actually a problem because are reconciler machine does not truly reconcile every component cleanly without regard to previous state, then maybe we should make some kind of prefix version grouping like Mv0/keys[129] and Mv1/keys[129] and we should point the reconcilers to the correct one once “all udpates are done to the new prefix” rather than locking up a bunch of DB operations?

I guess the right answer would be mtz-prefixed RWMutex. So, /materialization/ possibly. This may require some re-working of how we configure the keyspace

I think somehow using 2 revision prefixes and switching a pointer to the “updated one” on a finished update would probably work here without doing locks.

christra · February 6, 2024, 12:07am

I think you could just use a single global lock:

All reads take a global read lock
All segmented writes take the global write lock
Single segmented writes can take a global read lock, I think these can be done concurrently with reads (due to txn atomically already)

So segmented writes halt the system, but I think that’s fine because there’s not a whole lot of them that need transactional semantics anyways (read: mtz write).

bkocolos · February 6, 2024, 12:23am

In your example, the problem is that not all reconcilers are keyed on key 129. e.g., R0 might have been keyed on key 0, but then subsequently also reads 129 (because it implicitly assumed that every key written as part of thing M was written atomically. Given it doesn’t watch key 129, it won’t see the eventual update.

I realize now that it may be poor design to have reconcilers depend on keys they don’t watch. Maybe the solution here is to reorganize the key space and make sure all needed state is available under a single prefix for each reconciler. That may mean duplicating some state in cases where multiple different reconcilers need the same information, but that may be a reasonable tradeoff to make

christra · February 6, 2024, 12:31am

I would agree that having “all info needed to reconcile” in the key itself is generally preferable (assuming it fits in etcd).

I remember doing similar things with storing “to be reconciled” stuff in minio, but having external data means that someone’s responsible for cleaning it up, which can get sketchy with offline reconcilers and you want to ensure that the previous data is available for use for offline reconcilers.

jdbarnes · February 6, 2024, 12:32am

if the reconciler is dependent on key 0 and key 129 but only watches key 0, that seems to be an issue… it should probably be watching both or tracking both somehow … but i guess if the reconciler is only aware of the transaction number in etcd, it would have no concept of which is the ‘latest’ or ‘correct’ when 1 of them updates and the other does not unless you embed that information in the key’s value (or the key location/name).

in my own mind, the point of a reconciler architecture is that it essentially should not depend on a previous state to function. there should be a defined state that is reached (eventually) and if there is some kind of time constraint for the eventuality, that has to be optimized somehow independent of the architecture.

i do have a feeling i am beginning to understand why k8s uses a randomized string suffix on entities though.