The graph database space is a peculiar corner of the NOSQL universe. In general, the NOSQL movement has pushed towards simpler data models with more sophisticated computation infrastructure compared to traditional RDBMS. In contrast graph databases like Neo4j actually provide a far richer data model than a traditional RDBMS and a search-centric rather than compute-intensive method for data processing.
Strangely the expressive data model supported by graphs can be difficult to understand amid the general clamour of the simpler-is-better NOSQL movement. But what makes this doubly strange is that other NOSQL database types can support limited graph processing too.
This strange duality where non-graphs stores can be used for limited graph applications was the subject of a thread on the Neo4j mailing list, which was the inspiration for this post. In that thread, community members discussed the value of using non-graph stores for graph data particularly since prominent Web services are known to use this approach (like Twitter’s FlockDB). But as it happens the use-case for those graphs tends to be relatively shallow – “friend” and “follow” relationships and suchlike. In those situations, it can be a reasonable solution to have information in your values (or document properties, columns, or even rows in a relational database) to indicate a shallow relation as we can see in this diagram:
At runtime, the application using the datastore (remember: that’s code you typically have to write) follows the logical links between stored documents and creates a logical graph representation. This means the application code needs to understand how to create a graph representation from those loosely linked documents.
If the graphs are shallow, this approach can work reasonably well. Twitter’s FlockDB is an existential proof of that. But as relationships between data become more interesting and deeper, this is an approach that rapidly runs out of steam. This approach requires graphs to be structured early on in the system lifecycle (design time), meaning a specific topology is baked into the datastore and into the application layer. This implies tight coupling between the code that reifies the graphs and the mechanism through which they’re flattened in the datastore. Any structural changes to the graph now require changes to the stored data and the logic that reifies the data.
Neo4j takes a different approach: it stores graphs natively and so separates application and storage concerns. That is, where your documents have relationships between them, that’s they way they’re stored, searched, and processed in Neo4j even if those relationships are very deep. In this case, the logical graph that we reified from the document store can be natively (and efficiently) persisted in Neo4j.
What’s often deceptive is that in some use-cases, projecting a graph from a document or KV store and using Neo4j might begin with seemingly similar levels of complexity. For example, we might create an e-commerce application with customers and items they have bought. In a KV or document case we might store the identifiers of products our customers had bought inside the customer entity. In Neo4j, we’d simply add relationships like PURCHASED between customer nodes and the product nodes they’d bought. Since Neo4j is schema-less, adding these relationships doesn’t require migrations, nor should it affect any existing code using the data. The next diagram shows this contrast: the graph structure is explicit in the graph database, but implicit in a document store.
Even at this stage, the graph shows its flexibility. Imagine that a number of customers bought a product that had to be recalled. In the document case we’d run a query (typically using a map/reduce framework) that grabs the document for each customer and checks whether a customer has the identifier for the defective product in their purchase history. This is a big undertaking if each customer has to be checked, though thankfully because it’s an embarrassingly parallel operation we can throw hardware at the problem. We could also design a clever indexing scheme, provided we can tolerate the write latency and space costs that indexing implies.
With Neo4j, all we need to do is locate the product (by graph traversal or index lookup) and look for incoming PURCHASED relations to determine immediately which customers need to be informed about the product recall. Easy peasy!
As the e-commerce solution grows, we want to evolve a social aspect to shopping so that customers can receive buying recommendations based on what their social group has purchased. In the non-native graph store, we now have to encode the notion of friends and even friends of friends into the store and into the logic that reifies the graph. This is where things start to get tricky since now we have a deeper traversal from a customer to customers (friends) to customers (friends of friends) and then into purchases. What initially seemed simple, is now starting to look dauntingly like a fully fledged graph store, albeit one we have to build.
Conversely in the Neo4j case, we simply use the FRIEND relationships between customers, and for recommendations we simply traverse the graph across all outgoing FRIEND relationships (limited to depth 1 for immediate friends, or depth 2 for friends-of-friends), and for outgoing PURCHASED relationships to see what they’ve bought. What’s important here is that it’s Neo4j that handles the hard work of traversing the graph, not the application code as we can see in the diagram above.
But there’s much more value the e-commerce site can drive from this data. Not only can social recommendations be implemented by close friends, but the e-commerce site can also start to look for trends and base recommendations on them. This is precisely the kind of thing that supermarket loyalty schemes do with big iron and long-running SQL queries – but we can do it on commodity hardware very rapidly using Neo4j.
For example, one set of customers that we might want to incentivise are those people who we think are young performers. These are customers that perhaps have told us something about their age, and we’ve noticed a particular buying pattern surrounding them – they buy DJ-quality headphones. Often those same customers buy DJ-quality decks too, but there’s a potentially profitable set of those customers that – shockingly – don’t yet own decks (much to the gratitude of their flatmates and neighbours I suspect).
With a document or KV store, looking for this pattern by trawling through all the customer documents and projecting a graph is laborious. But matching these patterns in a graph is quite straightforward and efficient – simply by specifying a prototype to match against and then by efficiently traversing the graph structure looking for matches.
This shows a wonderful emergent property of graphs – simply store all the data you like as nodes and relationships in Neo4j and later you’ll be able to extract useful business information that perhaps you can’t imagine today, without the performance penalties associated with joins on large datasets.
In these kind of situations, choosing a non-graph store for storing graphs is a gamble. You may find that you’ve designed your graph topology far too early in the system lifecycle and lose the ability to evolve the structure and perform business intelligence on your data. That’s why Neo4j is cool – it keeps graph and application concerns separate, and allows you to defer data modelling decisions to more responsible points throughout the lifetime of your application.
So if you’re fighting with graph data imprisoned in Key-Value, Document or relational datastores, then try Neo4j.