Strong Consitency Claims in Distributed Graph Databases

When is strongly consistent not strongly consistent? When it’s not consistently used.

Motivation

For several years, I’ve been working with Paul Ezhilchelvan, Isi Mitrani, and Jack Waudby from Newcastle University. We’ve working on consistency models specifically for graph databases, trying to understand whether consistency models for graph databases are different from other database types and if so how we can preserve correctness while still being useful for demanding production systems. This post is a summary of the work we’ve done for a technical generalist audience, and I hope you enjoy reading it.

The work has been both fun and fruitful. Our focus has been on edge-distributed graph databases. In this kind of system, nodes (or vertices) reside on a single shard in the system, and relationships (or edges) can connect nodes within that shard and across shards over the network. This results in a model where no denormalisation of the data model is necessary.

Nodes representing Karl and Rosa are on different servers, and there is a FOLLOWS relationship/edge between them which crosses the network.

During our collaboration, we’ve identified an important safety property called Reciprocal Consistency. Put simply, if the record that stores the Karl node on Server 1 has an entry for an outgoing FOLLOWS relationship to the Rosa node on Server 2 then the record representing the Rosa node should have an entry for an incoming relationship from the Karl node. This stems from the labelled proprety graph model where relationships can be traversed in either direction. Generally, the nodes at the head and tail of a relationship hold reciprocal information, hence are reciprocaly consistent.


If you’re not familar with graphs, an analogy is to think of foreign key relationships in relational databases. Foreign keys constraints are used to prevent actions that would destroy links between tables and render the model incorrect. Foreign key constraints enforce integrity between a parent table and a child table. The parent table must reference some value in a column in the child table otherwise the constraint is violated (and the transaction causing the update will be rejected). The difference with reciprocal consistency for graphs is that both tables would be parent and child to each other to ensure the integrity of the relationship between them.


Maintaining reciprocal consistency isn’t well addressed in the literature. Some distributed relational databases, like YugabyteDB, preserve foreign key constraints by using strongly consistent (serializable) transaction protocols only where strictly necessary for correctiness and using less expensive (snapshot) protocols everywhere else.

In the case of graph databases, the equivalent of foreign key constraints aren’t a special case. Relationships in a graph are plentiful and the chance of any individual relationship being distributed across servers (or shards) is around 30% with a good partitioning algorithm like Hugo Firth’s Taper.

To preserve correctness in an edge-distributed graph database we could choose to fallback to a conservative transaction protocol (like two-phase commit), but we know that will harm performance as a blocking protocol. If we use cheaper models like eventual consistency, we might lose reciprocal consistency. In fact we found that eventually-consistent edge-distributed graph databases will corrupt data under normal operation to the extent that a large, busy production database may be rendered useless a matter of months.

But this article isn’t really about eventual consistency. Although we started with eventual consistency in mind, our enquiries lead to a more fundamental question: if eventual consistency leads to corruption, are so-called strongly consistent systems immune from corruption for edge-distributed graph databases? Given that question, we’ve lately focussed on systems designs that claim to be strongly consistent (using words like ACID) but which nonetheless fail to uphold reciprocal consistency, causing corruption in normal (no-fault) operation.

Corruption by design is a shocking (and fascinating) result in a strongly consistent system. I’ll dig much deeper into the root cause in a moment. But first a little history as to how we found ourselves here.

The trouble with eventually-consistent graph databases

A few years ago when the (now abandoned) TitanDB was launched I recall thinking that its design was problematic from a safety and performance viewpoint. For performance, the problems were around locality. TitanDB spreads bits of the graph across servers to boost write performance, with less regard to how the data would be queried - the expectation is that most queries will have to cross servers to complete, adding complexity and latency. But a slow database is might still be a safe database, but TitanDB wasn’t safe - in allowing uncoordinated writes across servers, race conditions could occur which would leave the database in inconsistent state.

This is partially understandable. The primary motivation for TitanDB, its successor JanusGraph, and other similar graph databses was scalability. I think it was a rational decision: the market places a premium on scalability and scalability non-trivial if you also want safety and performance. What I think also helped is that the market tended to assume that features like performance and safety were innately part of anything called a database.

Of course, scale does not necessarily mean high-performance in absolute terms. Scale means you can add more computers to a system and perform more processing relative to fewer computers in the same system. It’s often why Neo4j on a laptop can humiliate scale-out graph databases on many powerful servers.

In our work we were less concerned about performance, and much more about the safety properties of eventually-consistent graph databases. You probably already know that eventual consistency provides a straightforward guarantee for a single replica set that it will converge to the same value eventually (perhaps even before the next read in good systems). The value converged upon is influenced by the chosen concurrency control mechanisms (e.g. Last Writer Wins versus Logical Clocks etc), but nonetheless the guarantee is strictly for one replica set/shard. This might be fine for some data types, but we strongly suspected that it is not fine for edge-distributed graphs. Spoiler: it isn’t.

Sources of corruption in eventually-consistent graph databases

Eventually consistent databases scale because they avoid costly/centralised operations. Because of that, there’s no coordination/ordering across the database, just local concurrency controls on each server. The concurrency control might be simple and cheap like Last-Writer-Wins (LWW) or it might be a fancy merge function, but whatever mechanism is chosen, it acts locally on each server. Fortunately anti-entropy functions are available on most databases so that divergence between replicas within a set can be salvaged (though such functions tend not to be aggresively used in case they inhibit performance).

Even with anti-entropy functionality, with an eventually consistent distributed graph database there are two primary sources of potential corruption:

  1. The current transaction happens to read from a replica which is stale and uses that stale data to for updates. This is more complex than the same situation in a KV, document, or column store because graph database queries touch arbitrary keys and so undoing logical corruption is impractical and typically impossible. However, this is solvable with approaches like quorum reads though they will slow down the operation of the database considerably.
  2. Two transactions contend on a single relationship/egde which happens to span replica sets (shards from here on) and are applied in a mutually inconsistent order. This is also solvable but requires some functionality to impose ordering, which is expensive completely undermines eventual consistency, and so nobody does this because it implies transactions.

The TitanDB team were rigorous enough to call out this problem, but the language they used downplays its signficance. The documentation uses phrases like “temporary inconsistency” rather than “corruption that might be repaired, but might also cause further irreperable corruption” and “half edges” rather than “corrupted edge records”. While I don’t think this mollifying language has directly lead to a slew of copycat databases, it certainly set a tone for ignoring safety properties (which is still followed by other databases in the family).

Quantifying corruption in no-fault systems

It’s unsettling to be looking for data corruption for databases operating in a fault-free environment. The word “database” is closely associated with the notion of safety - we expect to be able to retrieve what we’ve stored.

Our starting point was the prior tacit admission of a possible technical error case from the TitanDB developers. We didn’t know, however, if it would be a problem im real systems running in production for a long time. We could just have run the database with a representative workload for many months, but waiting around in real time to see if corruption would occur is expensive in both time and infrastructure bills.

Instead, I asked academics Paul Ezhilchelvan and Isi Mitrani for help. In a way it was a kind of homecoming for me, being able to work with two of the lecturers who’d taught me Computer Science as an undergraduate (a gruelling task no doubt). On that note I’ll point out that any errors or misunderstandings are mine alone since Paul and Isi are outstanding computer scientists.

We decided to model the effects of failing reciprocal consistency for a busy, long-lived, eventually-consistent distributed graph database. We chose a scale-free graph as the basis for our model, and informally thought of it a social network with a few nodes representing popular celebrities with very high degree, the next few layers with increasing freqency and decreasing degree representing various levels of sub-celebrity influence, scaling all the down to many nodes with very low degree representing most ordinary people.

Our model treats a graph database as a collection of replica sets or shards each holding some fraction of the overall graph. It has to be distributed, of course, so that we can elicit the mechanical failure modes of distrbuted-edge partitioned graph databases. In our model, usually two connected nodes will be held on a single shard, and by implication so to will the relationships joining them be stored on that shard. However according to research on graph partitioning algorithms by my colleague Hugo Firth, a good partitioning still results in around 30% of relationships connecting nodes that are hosted on different shards - these relationships are said to be distributed.

The model allows for write-after-stale-read from a replica as one source of corruption, and also allowed mutually out of order update processing for relationship records that span shards as the other primary source of corruption. Out of order updates have a chance of being overwritten correctly before they’re ever seen, but they also have a chance of the stale side of the relationship being read and used to compute future incorrect updates from where it becomes impossible to repair the data in the general case.

In our model, relationship records which cross shards are physically implemented as a two reciprocal records on each shard (a common real-world implementation strategy). That is if shard A has a node representing Rosa with an outgoing FOLLOWS relationship to a node representing Karl on shard B, then the Karl node on shard B should have an incoming FOLLOWS from Rosa on shard A.

// TODO: picture needed

Equally any properties on that relationship must be identical whether the relationship is traversed head to tail or tail to head.

// TODO: picture needed

We term this reciprocal consistency and it is an invariant on the database so that data can be safely stored. Any deviation from a reciprocally consistent state is corruption, and that corruption has a chance to spread through the graph.

Using the model, we showed that for a billions-node scale-free graph and ~10,000 TPS and with some conservative estimates for the likelihood of transactions being applied mutually out of order, that the system would be garbage (defined as 10% corruption) in around 12-18 months.

Since the model was computationally expensive (many hours of compute time needed for each minor parameter change) we used it to calibrate a computationally cheap numerical approximation which you can parameterise to see how your eventually consistent graph database might behave in production.

It’s not about eventual consistency, it’s about uncoordinated updates

Our suspicion going into this work was that eventual consistency was the problematic factor. Clearly if you can read a stale value and write back to the database based on that stale value, you’re going to get logical corruption. But that can be mitigated, albeit at a cost in competent eventually consistent systems.

So, our focus on eventual consistency wasn’t wrong, but it wasn’t quite right either. In fact we found that the consistency model for how a replication within a shard was less important than how updates are delivered across shards.

With hindsight, eventually consistent databases were obvious because they have no coordination at all and so by implication no consistent ordering of arrival/processing by receiving shards. What’s surprising though is if we make the model more generous, we still get corruption.

Let’s assume that the replicas in a shard converge immediately and that faults can never occur. This is an impossible, magical algorithm for replication, and still the database would suffer corruption.

The problem is of reciprocal consistency being undermined by race conditions in the presence of concurrent contended transactions. While we have fewer opportunities to read stale data in a single shard (because we assumed instant convergence), it’s still possible that transactions will be delivered mutually out of order to a pair of shards.

Let’s play this out. Consider two transactions T1 and T2 which both update a common cross-shard relationship R in some mutally incompatible manner. Drawing on the TitanDB example, this could be where one transaction tries to delete R while the other tries to update it.

There are two ways these transactions can achieve a consistent outcome and two ways in which they won’t, and since we don’t have any coordination logic to impose an ordering (on the basis that it inhibits scalability) then we have to rely on natural arrival order.

Shard 1 Shard 2 Consistent
T1->T2 T1->T2 Yes
T2->T1 T2->T1 Yes
T2->T1 T1->T2 Maybe
T1->T2 T2->T1 Maybe

Corruption can occur when T1 and T2 are processed in different orders on different shards for operations that don’t commute. In such cases reciprocal consistency will have been violated.

For example, if T1 adds property since:2010 to the FOLLOWS relationship between Rosa and Karl while concurrently T2 adds property since:2015 to the FOLLOWS relationship between Rosa and Karl, then we have a problem. Ideally we’d like the values to converge to one of these, but they never will because there is no protocol to enforce consistency across the shards.

Shard 1 (T1->T2) Shard 2 (T2->T1)
(Rosa)-[:FOLLOWS {since:2010}]->(remote) (remote)-[:FOLLOWS {since:2015}]->(Karl)

This is a type of non-deterministic corruption where one shard holds the “correct” record as seen in a strictly serializable history and the other shard holds the “incorrect” record. In this case the two transactions with inconsistent arrival order do not commute: one adds a property value to a relationship and one deletes that relationship.

It’s possible that a subsquent write might immediately correct this error with no side-effects, and it’s possible that the shard with the “correct” record might be read. But it’s also possible that the shard with the “incorrect” record will be read, and those incorrect values used to seed further writes, propagating the corruption around the graph. In general, there is no recovery from this corruption when it spreads. It’s often a coin toss as to which record is the correct one in the instant after it is written

I’ll reiterate that: without any hardware or software failures, your data will be irreversibly garbage within its production lifetime. That’s a shocking result for a class of technology that has a degree of market buy-in.

Our work has now been multiply peer reviewed and published, with the most recent analysis being in the Journal Queueing Models and Service Management.

When is strongly consistent not strongly consistent? When it’s not consistently applied.

Other entrants to the graph database market must have also seen the risks of eventual consistency for graphs have and opted to build their offerings atop strongly consistent models to avoid these pitfalls. The approach that I’ve seen in two databases (one closed source, one open) is to build strong guarantees into the replica sets (shards) that store the data. This is a very good starting point and is used in their marketing materials to give a sense of safety. Words like “ACID” and “strongly consistent” are used frequently to convey the safety properties of the systems.

But all is not well here. In fact, despite having better guarantees for consistency within a single replica set than eventually consistent databases, the same global ordering problem haunts these newer databases.

// They fall foul of consistency within a shard and anything goes between shards.

Acknowledgements

The research work was made possible by a continuing joint effort between Newcastle University and Neo4j. Many thanks to my academic collaborators Paul Ezhilchelvan, Isi Mitrani, and Jack Waudby for entertaining my hunches and putting them on a solid academic foundation.

My colleage Hugo Firth (ex-Newcastle Ph.D., like me) provided a great deal of valuable feedback in writing up this piece as well as day to day sensible discussions on reliability in distributed systems.

Like this? Want to get more involved?

I’ll make a final pitch and remind you that Neo4j are hiring. So if you want to work with Hugo and me, then take a look at our open roles.

Related