There’s recently been a great deal of discussion on the subject of graph processing. For those of us in the graph database space, this is an exciting development since it reinforces the utility of graphs as both a storage and a computational model. Confusingly however, processing graph-like data is often mistakenly conflated with graph databases because they share the same data model, yet each tool addresses a fundamentally different problem.
For example, graph processing platforms like Google’s Pregel achieve high aggregate computational throughput by adopting the Bulk Synchronous Processing (BSP) model from the parallel computing community. Pregel supports large-scale graph processing by partitioning a graph across many machines and allowing those machines to efficiently compute at vertices using localised data. Only during synchronisation phases is localised information exchanged (c.f. the BSP model). This gives Google the ability to process huge volumes of interconnected data, albeit at relatively high latencies, to gain greater business insight than with traditional (non-graph optimised) map-reduce approaches.
Sadly few of us have Google-scale resources at our disposal to invent novel platforms on demand. In enterprise-scale scenarios, Hadoop (incidentally an implementation of Google’s earlier map-reduce framework) has become a popular platform for batch processing large volumes of data. Like Pregel, Hadoop is a high-latency, high-throughput processing tool that optimises computational throughput by processing large volumes of data in parallel outside the database.
Unlike Pregel, Hadoop is a general purpose framework which means that while it can be used for graph processing, it’s not optimised for that purpose nor are the underlying storage mechanisms HDFS (a distributed file system) and HBase (a distributed tabular database designed for large numbers of rows and columns) graph-oriented in nature (though interestingly the Ravel Golden Orb platform claims to add a Pregel-like programming model above Hadoop).
What Pregel and Hadoop have in common is their tendency towards the data analytics (OLAP) end of the spectrum, rather than being focussed on transaction processing. This is in stark contrast to graph databases like Neo4j which optimise storage and querying of connected data for online transaction processing (OLTP) scenarios – much like a regular RDBMS, only with a more expressive and powerful data model. We can visualise these differing capabilities easily as in the figure below:
In this breakdown, Pregel is positioned firmly in the OLAP graph processing space, much as Hadoop is positioned in the general-purpose OLAP space (though closer to the OLTP axis because of recent advances in so-called real-time Hadoop). Relational databases are positioned as general purpose OLTP engines that can be somewhat adapted to the OLAP needs. Neo4j has strong graph affinity and is designed primarily for OLTP scenarios, though as a native graph database with strong read-scalability, it can also be suited to OLAP work.
However the Hadoop community continues to foster innovation in the area of graph processing, and there are regular announcements about how Hadoop can be adapted towards solving graph problems. Recently Daniel Abadi publicised work on solving graph problems more efficiently with Hadoop from his team at Yale University.
This work is novel empirical science and presents an important observation: by skillfully partitioning data in HBase to exploit locality, (graph) computational throughput in Hadoop can be substantially increased. And yet for casual observers of the NOSQL community, this is easily inferred as the demise of graph databases, which appear to have much more modest throughput. I don’t believe this is a valid comparison however:
- Hadoop is a batch processing framework, and operates at high latencies compared to graph databases (even real-time Hadoop involves seconds of latencies, compared to the millisecond scale at which Neo4j operates). The work done to improve graph processing through data locality means that batches will be executed more efficiently, and so throughput will be higher (or similar throughput will be achievable with fewer computational resources). Yet latency will remain comparatively high and so this approach is unlikely to be well-suited to on-demand processing (OLTP) that is the mainstay of most applications where data latency is more helpfully measured in milliseconds. Instead it is likely to remain firmly in the OLAP domain for the foreseeable future.
- For generating regular reports from a data warehouse or pre-computing results, batch processing can be a sensible strategy, especially if it can be made efficient through laying out data carefully. Making this efficient comes at a cost, namely that data has to be denormalised within HBase, expanding the cognitive gap between your data and how it is represented for processing. Conversely Neo4j works in OLAP scenarios consistently with how it works in OLTP scenarios – your OLTP database is your OLAP database (usually a read slave, with the same data model). This means Neo4j doesn’t need denormalisation or special processing infrastructure, and for large read-queries like reporting jobs scales very well even under heavy and unpredictable online loads.
- Batch-oriented approaches are best suited where data can be read and processed outside the database rather than manipulated in place. That is, efficiently processing static graph-like data (or triples), not only requires careful placement of data in HBase, but practically rules out mutating the graph during processing. In contrast Neo4j supports in-place graph mutation graphs, which is a more powerful tool for Web real-time analytics than (even efficiently processed) batches.
Bringing all of these sentiments together, it’s clear that we’re looking at two different tools for two different sets of problems. The Hadoop-based solution is batch-oriented processing at high throughput with correspondingly high latency with substantial denormalisation. The Neo4j approach emphasises OLTP native graph processing with real-time OLAP and more modest throughput at very low latency (ms), and since work happens in the database it’s always consistent.
So if you need OLTP and deep insight (OLAP-style) in near real-time at enterprise scale then Neo4j is a sensible choice. For niche problems where you can afford high latency in exchange for higher throughput, then the graph processing platforms like Pregel or Hadoop could be beneficial. But it’s important to understand that they are not the same.