2. State-based CRDT
Date: 2025-02-27
Status
Accepted
Context
When distributing state within the cluster, we need a deterministic outcome when merging remote state with local state, regardless of the order of operations. For this, we will require some form of Conflict-free Replicated Data Type (CRDT).
CRDTs fall within one of two categories:
- state-based
- operation-based
Both rely upon a merge function used to reconcile remote state with local state. It is required to be associative, commutative and idempotent, ensuring an eventually consistent outcome.
Where they differ is how they replicate state changes.
State-based CRDTs are straightforward but bandwidth heavy, requiring us to ship the entirety of the local state within a node for merging with the local state of another node.
Operation-based CRDTs on the other hand are more sophisticated, relying upon an operation log that captures the changes being made to local state. It is the operations themselves which are replicated and then applied remotely.
See here for more information on CRDTs and their implementation.
Decision
We have decided to implement a state-based CRDT as it is simpler to implement and easier to reason about.
An operation-based CRDT would require more complicated logic to keep the log history a manageable size, involving constructs such as vector clocks as well as a more complex replication protocol.
Consequences
We will have to ship more state across the network. However, Data Mesher is not intended to store large amounts of data, but instead things such as hostnames and application settings.
Therefore, the increased bandwidth is considered an acceptable trade-off.
The frequency with which we exchange state between nodes will have to be balanced against the size of the state. This will affect the speed with which changes propagate and determine how quickly the cluster converges.