As I’ve discussed before, graph databases like Neo4j can be lack the same predictability in terms of scaling when compared to other kinds of NOSQL stores (that’s the cost of a rich data model). But with a little thought, we’ve seen how both cache-sharding and the application of domain-specific strategies can help to improve throughput and increase the dataset size that Neo4j can store and process.
Those blog posts triggered some very useful discussion on the Neo4j mailing list, with several community members adding their own thoughts and experiences. In particular Mark Harwood suggested a simple heuristic for deciding a scaling strategy, that I thought was so useful I’d share it (with his permission) here.
- Dataset size: Many tens of Gigabytes
Strategy: Fill a single machine with RAM
Reasoning: Modern racks contain a lot of RAM, of the order of 128GB typically for a typical machine. Since Neo4j likes to use RAM to cache data, where O(dataset) ≈ O(memory) we can keep all the data cached in-memory and operate on it at extremely high speed.
Weaknesses: Still need to cluster for availability; write scalability limited by disk performance.
- Dataset size: Many hundreds of Gigabytes
Reasoning: The dataset is too big to hold all in RAM on a single server but small enough to allow replicating it on disk across machines. Using cache sharding improves the likelihood of reaching a warm cache which maintains high performance. The cost of a cache miss is not catastrophic (a local disk read), and can be mitigated by replacing spinning disks with SSDs.
Weaknesses: Consistent routing requires a router on top of the Neo4j infrastructure; write-master in the cluster is a limiting factor in write-scalability.
- Dataset size: Terabytes and above
Strategy: Domain-specific sharding
Reasoning: At this scale, the dataset is too big for a single memory space and it’s too big to practically replicate across machines so sharding is the only viable alternative. However given there is no perfect algorithm (yet) for arbitrary sharding of graphs, we rely on domain-specific knowledge to be able to predict which nodes should be allocated to which machines.
Weaknesses: Not all domains may be amenable to domain-specific sharding
As an architect, I really like this heuristic. It’s easy to find where I am on the scale and plan a Neo4j deployment accordingly. It also provides quite an exciting pointer towards the future – while I think most enterprise-scale deployments are currently in the tens to hundreds of gigabytes range, there are clearly applications out there for connected data that do – or will – require more horsepower.
Neo4j 2.0 will address these challenges, it’s going to be a fun ride. But until then, I hope you’ll find this as helpful as heuristic as I did.