Distributed Systems and Fault Tolerance Engineering Jobs at Neo4j

It’s been a while since I last blogged because I’ve been heads-down with my team developing a novel clustering architecture and consistency model for Neo4j. The new cluster architecture is designed to provide data safety and scale for graph workloads while providing a simpler programming model to the user based on causal consistency.

In essence what we’ve developed is a database cluster architecture where we use the Raft consensus protocol to ensure data is safely replicated; asynchronous replication to read-replicas to enable massive scale; and wrapped it all in a user-facing protocol that means you will always be able to read the effects of your writes – causal consistency. To learn more, here is a recent talk from a Neo4j 3.1 webinar (video /slides)

The productivity leap enabled by causal consistency versus eventual consistency is amazing, and we’re proud of what we’ve achieved with Neo4j Causal Clustering. But we’re not resting there. We’re now building beyond Causal Clustering to a homogeneous, large-scale distributed database for native Peta-scale graph data.

We have several openings on my team for people with an interest in distributed systems and fault tolerance. Our goal is to make Neo4j a the default mission-critical, fault-tolerant data platform for all kinds of systems. So if you’re interested in things like transactions, consensus, quorums, graph partitioning, fault injection, or know your partial ordering from your total, then we should talk.

The roles are based in (central) London, UK or Malmö, Sweden. If you’d like to do some amazing computing science for your day job then please get in touch, I’d love to work with you and I think you’d enjoy working with us very much!

Plugin Hybrid Test Drives: Electric Boogaloo

In my last post, I discussed the BMW i3 and why I’d never buy one. And while I really enjoyed that car, I bemoaned the oddly large-ish sizing of battery capacity to the thimble-sized petrol tank (in the range extender model).

Since I’m still looking for a vehicle that can do short commute/school run jobs cheaply and cleanly (are you listening VW?), I still want something with more electric capability than my middle-aged non-plugin Prius.

So, I set off to test some popular PHEVs (Plug-in Electic Hybrid Vehicles) including the Golf GTE, Audi A3 e-tron, and the Mitsubishi Outlander PHEV. To try to keep some level of objectivity, I test-drove all three of them on the same day, with broadly the same mixture of driving conditions.


Let’s start with the Golf GTE. It looks like a Golf which I find off-putting since Golfs have a less than benign reputation. Also Golf itself is a dull game. Not a great start. VW’s design aesthetic definitely seems to be that the future looks a lot like the past – maybe Theresa May was a design consultant. Anyway…

The Golf GTE is a quick vehicle. It has an all-electric mode (quoted 30 mile range), an auto hybrid mode, and a power mode that mixes electrical and petrol motors for antisocial driving thrills (I’m not a fan). My test-drive was mostly on petrol because the dealer (VW Guildford) seemingly couldn’t be bothered to charge the battery before the test drive.


Top tip for showrooms: when someone is looking at dropping £30k + finance on a car, not charging the battery and failing to demonstrate a key selling point is a sure way to alienate the customer.

Testament to the car’s low-speed power, at my first junction I caused a wheel-spin. I hadn’t done that since I was 17 in a battered Ford Fiesta. Rather than be impressed with the power that can be sent to the front wheels, I was surprised at the poor traction control. The Golf made me drive like a Golf driver. The machines have already won.

Anyway, off we set mixing A-road and residential conditions and the car was quite satisfactory (as any new vehicle tends to feel). But the retro, rotary, analogue dials on the car didn’t please me at all. I don’t want to have to study a dial to figure out my speed because that’s a distraction from the road. I think VW have made quite an error in not emphasising the car’s modernity. My middle-aged Prius has digital speedometer and GPS in a heads-up display like a jet fighter. A car 5 years newer could at least manage parity.

Remarkably VW thinks it’s essential to fit a CD player in a car in 2015, in an electric vehicle. That kerb weight isn’t needed for the kind of buyer that buys electric: we carry phones for that task in the early 21st century. The Phil Collins brigade are much happier, I suspect, with dirty diesels and ironed jeans. I rapidly came to the conclusion that the VW Golf GTE isn’t for me.

Audi A3 e-tron

The first thing to know about the Audi setup is that getting a test drive is like pulling teeth. My wife and I both had numerous attempts to get a test drive from a local (and not so local) showroom and were thwarted, ignored, fobbed off, lied to and variously dropped by the Audi call centre team. Audi showrooms are being let down terribly by their call centre.

The second thing to know is that the Audi brand is practically as toxic as Golf. As a cyclist, I know that if someone is going going to overtake me dangerously, it’s going to be an Audi. As a motorist, I know that the impatient man behind me is driving an Audi. They are the Stella Artois of car brands: needlessly costly and consumed by yobs.

Audi A3 e-tronThe third thing you should know about the Audi A3 e-tron (which should be the main thing if it wasn’t for the poor customer service and branding) is that it’s the same powertrain as the Golf GTE, thanks to VW being Audi’s parent company.

Like the Golf it has a quoted 30 mile range on electric power only, plenty for the daily commute or school run. But the car feels qualitatively different. The aesthetics are similar to any other A3 that you might see through your rear-view mirror far too close to your bumper: reasonably inoffensive. Handling is nice, with tuneable settings for ride/handling and a helpful  feature that automatically stiffens things up as you drive faster. Front and back the ride and interior space are comfortable and pleasant to be in. Despite my reservations about Audi, it really was a car I generally cared for.

The downside is that it’s quite a large vehicle for my purposes as a second car, approaching the size of my family-friendly Prius. Great if we did need to go on longer journeys, slightly less good for pottering around town though it would be unfair to label it un-agile. Ultimately I was left rather impressed with the vehicle, but desperately underwhelmed by Audi pre-sales service. Still, all things considered it was a strong a contender.

Mitsubishi Outlander PHEV

The Mitsubishi is a real outlier in my cohort. It’s an SUV for a start (and I’m looking for a city car) with two electric motors and a single petrol engine arranged in parallel rather than the sequential drive train in the Golf and A3. Unlike the other cars I’ve tried it doesn’t have an all electric mode that a user can engage, instead power sources are selected for you by the drive computer which means in practice most situations are battery-powered unless you really floor it. The one caveat is that the driver can choose not to use battery power, for example when driving on motorways where petrol power is relatively efficient.

The driving experience wasn’t at all bad. The electric motors felt punchy at low speeds, and the handling was on a par for similarly sized SUVs (I owned an X-Trail ages back, it felt similar: a little soft). The gadgets were great: 360 degree parking cameras makes reversing like a video game, and I can’t help be impressed by the software that joins the camera feeds.Mitsubishi Outlander PHEV

On the downside, the cabin is far more spartan and workaday than the other vehicles, even on the more expensive models. The backseat area is downright plain with no real attention to comfort being lavished there (not even a charging point for a phone, or an AC vent). On the plus side, the boot space was enormous, and I could imagine storing all my family’s kit in there without running out of space.

Bu the big negative is that we already have a family car. And so what finally ruled out the Mitsubishi is that I didn’t want another big family car, but runaround for the city. Despite the fact that it’d be on electric power practically all of the time, it was the size of the Mitsubishi that ultimately counted against it.

So what did I opt for?

You made it this far in what has been the longest blog post I’ve written in ages. So let me recap:

  • the Golf GTE is unappealing;
  • the Mitsubishi Outlander PHEV is too big;
  • The Audi A3 e-tron is nice, but Audi provided shockingly poor service.

So what did I buy? I bought (ok, financed) the BMW i3 (REX).

Yup, I bought a glorified milk float with a motorbike engine instead of an alternator.

BMW i3“But you said it was all wrong with its batteries and petrol and what not!”

And I did. But since we have a petrol hybrid already for longer journeys, it’s really unlikely we’d need the i3 for motorway duty and as a city car it’s a real pleasure. I adore the science-project feel of the car and enjoy the handling. I’m not really into cars or driving, but I enjoyed rather than endured the test drive. That was quite unexpected.

Yes, the batteries are the wrong size, but it means I’ll charge it every couple of weeks rather than every couple of days. By the time the finance is finished on this car, the next version will have way better battery range and I might just opt into that one too.

I’ll take delivery soon, and I’ll post my thoughts as an owner.

Next challenge: learn how to ground-mount and wire more solar panels to fuel the car!

Why I love the BMW i3 (and will never buy one.)

I’m looking for a runaround for school drop-offs (such is my exciting life). I’ve discounted small petrol cars (mechanically complex, inefficient for small journeys, unexciting, yesteryear’s tech), and am looking at small fully electric or plugin hybrid cars.

To that end, I recently test drove a BMW i3 from BMW Guildford (where I live now) since . The i3 is one of a handful of battery or small plugin hybrid cars that I’m considering (which would sit alongside an ageing Prius in the driveway).

Customer Service

The BMW staff at BMW Guildford were super nice and promptly arranged a good test drive with some A-roads, some winding little roads and so on. As a side note: they were immeasurably better than the neighbouring Audi Guildford who, after several unreturned phone calls, have still not even bothered to set up a test drive for an A3 e-tron.

Shut up and take my money indeed!


But I digress. I rather like the looks of the BMW. To some it’s an abominable science project but I rather like it’s a car that wears its technology loud and proud.

I really enjoyed driving the BMW. Though it’s no slouch in town, it’s not a warp-speed car like the Tesla Model S, but the controls are simple, intuitive (though the first few minutes were full of kangaroo-ing!) and worry free. It’s turning circle is amazing: it basically drives like a luxury high-speed dodgem!

Putting your foot to the floor gives a very competent amount of acceleration, and moving from a standing start was sure footed and smooth. The instrument panel for driving is sensible and accessible. The auxiliary panel could do with some Jonny Ive love though: it’s full of pointless energy production/consumption stuff that detracts from the dodgem-like simplicity of actually driving the vehicle.

At the end of the test drive of around 6 miles, I adored the vehicle and totally wanted it. I also knew that I would never, ever buy one.

Paradox of Range

The BMW i3 has an 80 mile quoted range on battery power, and if you add the range extender (which is a little petrol generator in the boot area) you can get another 80 miles out of the vehicle before coming to a halt. This is the exact sweet spot of suck.

For town cars, most driving is a handful of miles per day. That’s the attraction of an electric vehicle: they cover a handful of miles really efficiently. My drive in the car would be 6-8 miles per day, nothing really. And yet the car can go for 80 miles: 10x more than I need for a school run.

But: if I choose to commute into London, suddenly 80 miles leaves me nervous. It’s around a 60 mile round-trip from my house to the Neo4j office in central London. While the i3 would be exempt from congestion charging, if I couldn’t hook up to a charger while I parked I would be very nervous on the way home. If I add the range extender then my nerves dissipate for the commute but I’m chewing through petrol which I don’t like

By contrast the more modern plugin hybrids are generally quoted as having a 30 mile range (Plug-in Prius is lower until next year’s model comes out). This means I can still do a school drop off on electrical power (easily 90% of my planned trips), but I don’t have any range anxiety about longer trips.

The BMW i3 with the range extender is – from a logical perspective – still just a plugin hybrid despite BMW’s protestations to the contrary (the drive train is purely electrical). But it just has too small a petrol tank. If I wanted to take the BMW on a trip to see family or friends, suddenly I’m stuck with a car that is great for the first hour, will continue on petrol for the second hour (assuming range extender) and then makes me stop for petrol around every hour or so after that which is super inconvenient even for a small number of trips (probably less than 6 per year).

Final thoughts

I love the BMW i3. If I had infinite money, then I’d buy one just for doing shopping and mooching around my small home town and embarrassing the neighbours with their ghastly 4×4 VW smog machines. But the i3’s range is far too generous for most days when I don’t need it, and far too constrained for the 1% of journeys when I need hundreds rather than 10s of miles range. So my search continues: I’ll now look at the Audi A3 e-tron (assuming the local garage ever shuts up and takes my money), and maybe the Mitsubishi Outlander PHEV (though it’s rather large for my liking). Stay tuned.

Airport Duty Free VAT: The Scandal Continues

(I haven’t blogged in ages, so here’s a really dull post about corporate malfeasance and VAT to get you back in to the swing of things!)

There’s been a bit of a scandal lately in UK airports where retailers have been pocketing the duty-free savings on their sales rather passing on savings to customers. While the retailers have tried to justify this, the practice of demanding boarding cards so that the retailer can keep the VAT part of the sale has raised heckles and forced retailers to change their practices.

Rather naively, I’d thought that “changing their practices” meant “stop shovelling VAT money from customers into their own pockets” but my experience today suggested I was quite wrong. It really means “keep taking the VAT money, but leave the customer non the wiser about it.”

While passing through Heathrow airport today (on my way to HPTS), I bought a pair of cheapy headphones and a iPhone battery backup from Dixons – a large UK electrical retailer and popular airport haunt for those looking for a tech fix.

At the checkout I noticed that no-one was being asked for boarding passes. In light of the scandal, I suspect lots of passengers are wise to this. Instead though I watched all the customers each be asked for their destination which they freely gave as if part of a friendly conversation.

But knowing your destination is just a surrogate for a boarding pass as I suspected and was soon to be confirmed. Once the retailer knows your destination, they can carry on the same VAT swindle as before so when asked I declined to give my destination.

A colleague of my server then tried to commit rather a unsavoury action by telling my server that he should enter  “Zurich” as my destination since it’s not in a European Union country and so the VAT would accrue to Dixons. My server, to his credit, put my destination as Heathrow – the location of the shop.

I’m grateful to that server for his integrity. However I suspect by pointedly refusing to give my destination he correctly assumed that I would kick up a stink and he didn’t need the hassle. I suspect that he’ll have to ask the same question to many more people today and most of them will politely reveal destination information facilitating tax money leaving the public purse and being deposited in Dixons privatised hands.

I wonder how many thousands of pounds Dixons will take from the UK exchequer today? And tomorrow…

New Releases of REST in Practice and Graph Databases

Over the course of my career, I’ve co-penned three technical books. While the one on WS-* isn’t going to make a massive comeback any day now, the books on REST and Neo4j are doing rather well.

The O’Reilly book “Graph Databases” by Ian Robinson, Emil Eifrem and me was first published back in 2013 as we were working towards the monumental Neo4j 2.0 release. In the elapsed time, Neo4j has continued to grow and evolve and meaning the book is no longer an accurate reflection of contemporary use. At the start of the year, we hit the keyboard and with the help of the awesome Neo4j community team (thanks especially to Michael Hunger) and rewrote the book to be completely up to date with Neo4j 2.x, including new schema indexes and Cypher language improvements. Right now you can get a full, free ebook version of the 2nd edition by signing up at http://graphdatabases.com.

And as if one wasn’t enough, O’Reilly will now release a second printing of REST in Practice by me, Savas Parastatidis, and Ian Robinson (yes, there is a graph forming here). While REST in Practice isn’t technically a second edition (much of the content has stood the test of time) we’ve taken the opportunity to smarten it up a bit and fix up all the errata submissions. Many thanks to those of you that took the time to submit those!

So head on over to my books page an grab yourself a headful of words!

Getting Started with Neo4j 2.0

With the recent release of Neo4j 2.0 and it’s a great time to get familiar with graphs and graph databases. Neo4j is quite different from relational databases, and it’s also quite different from most of the other NoSQL databases. The reason for its differences is that it addresses more complex challenges with interconnected (joined-up) data.  This makes Neo4j ideal for high fidelity modelling and high-performance querying of rich, real-world domains.

In this article we’ll work through some retail recommendations problems with Neo4j and see how we can use graphs to store and query complex interconnected data for fun and profit.

What’s Neo4j?

Aside from being strangely named software, Neo4j is an enterprise-grade, open source, ACID transactional, graph database. First deployed in 2003, Neo4j is by far the leader in the graph database world with a vibrant community, active development, and huge numbers of successful commercial and open source deployments.

As a graph database, data model is radically different from relational databases, and although it is considered to be part of the NoSQL space, its data model is also radically different from denormalised aggregate stores. In Neo4j, data is stored in nodes that hold key-value pairs. In turn, those nodes are linked through relationships. There can be many relationships between nodes, and each of them is named and directed (always with a start and end node). Although these are simple tools to understand, they can be used to create very expressive data models where the semantic glue between entities (nodes) is a  first-class citizen like the entities themselves. This model is known as a property graph and a simple example is shown in Figure 1.


Figure 1 The Property Graph Data Model

The property graph shown in Figure 1 depicts a small but typical subgraph where nodes (representing things) are connected by relationships that set the semantic context for those things. In this case we can see that Alice loves Bob, and that Bob loves Alice as well as cookies. Equally property graphs could model unrequited love where Alice loves Bob, but there is no reciprocal relationship (how sad!). At runtime however relationships can be traversed in either direction so we can equally ask questions like “Who does Alice love?” as well as “Who loves Alice?” with no computational penalty.

However, our modelling tools don’t stop here. Although the property graph model has been stable for over a decade, the Neo4j team continues to innovate in the graph space. In particular, Neo4j 2.0 has added a new twist to the classic property graph model in the form of labels.

Labels allow the modeller to “tag” nodes with names, allowing us to identify similar kinds of nodes like “customer” or “product” to the database. In turn queries can take advantage of that extra data to make them clearer (and faster too). Labels also allow us to specify structural constraints and indexing (for performance) on the graph in a way that is sympathetic rather than external to the domain..

An example with labelled nodes is shown in Figure 2 where we can clearly see that Alice and Bob are the only people who love each other, and that Bob’s other love – cookies – isn’t an interloper but just a product for a tasty teatime snack.


Figure 2 The Labelled Property Graph Data Model supports taxonomies of nodes

From even these simple examples, it’s clear that the property graph model whiteboard friendly, in fact typically what you draw is what you store. However the model also supports efficient queries without having to reduce the expressivity of the domain (e.g. through denormalisation) even when the domain itself is messy like the real world.  To demonstrate that, let’s start to work directly with Neo4j.

Downloading Neo4j 2.0

The first thing to do is download and install the database, the latest stable version is the newly minted Neo4j 2.0. A comprehensive guide for downloading, installing and running Neo4j on Linux, Mac, and Windows is available at: http://www.neo4j.org/download and the only prerequisite is Java 7.

Once we have the database installed, it’s a simple command line or double-click to get it running and then we’re all set to pour data into it and get insight through graph queries. So let’s start by introducing the domain.

Using Neo4j for Retail Recommendations

While property graphs are a very general-purpose data model, to kick-start our learning with Neo4j 2.0 we’ll use retail as our example domain since it’s simple to understand and a very popular use-case (especially when it comes to product recommendations).

Firstly let’s get the basics out of the way: how do we model an individual customer and the product purchases they make over time? This scenario very illustrative of graph modelling since it covers three discrete dimensions: time, customers, and products, which can be seamlessly integrated when modelled as a graph and later any or all of the dimensions can be intertwined when querying.


Figure 3 Purchase history for a customer

The diagram in Figure 3 shows how we can use a simple linked list of shopping baskets connected by NEXT relationships to create a purchase history for the customer. In that graph we see that the customer has visited three times, saved their first purchase for later (the SAVED relationship between customer and basket nodes) and ultimately bought one basket (indicated by the BOUGHT relationship between customer and basket node) and is currently assembling a basket, shown by the CURRENT relationship that points to an active basket at the head of the linked list. It’s important to understand this isn’t a schema or ER-diagram but represents actual data for a single customer. A real graph of many such customers will be huge (far too big to draw in this article) but exhibit the same kind of structure.

In graph form, it’s easy to figure out the customer’s behaviour: they became a potential new customer but failed to commit to buying toothpaste and came back one day later and bought toothpaste, bread, and butter. Finally the customer settled on buying bread and butter in next purchase – which is a repeated pattern in their purchase history we could ultimately use to serve them better.

Now we understand a simple labelled property graph model, it’s time to push it into Neo4j. The primary way of managing data in Neo4j 2.0 is the Cypher query language, which is used query, alter, label, and index the graph. While the programmatic APIs from Neo4j 1.x remain (and are used to express arbitrary graph algorithms, or enable close-to-the-metal programming) Cypher is often the best choice and is definitely the right place for us to start learning. The first thing we need to do is load our customer graph into Neo4j via Cypher as in Listing 1.

CREATE (alice:Customer {name: 'Alice'})
CREATE (b1:Basket {id: 'a332deb', date: 20130928})
CREATE (b1)-[:NEXT]->(b2:Basket {id: 'a332deb', date: 20130929})
CREATE (b2)-[:NEXT]-> (b3:Basket {id: 'ffda309', date: 20131010})
CREATE (alice)-[:SAVED]->(b1)
CREATE (alice)-[:BOUGHT]->(b2)
CREATE (alice)-[:CURRENT]->(b3)
CREATE (toothpaste:Product {name: 'Toothpaste'})
CREATE (toothpaste)-[:IN]->(b1)
CREATE (toothpaste)-[:IN]->(b2)
CREATE (bread:Product {name: 'Bread'})
CREATE (bread)-[:IN]->(b2)
CREATE (butter:Product {name: 'Butter'})
CREATE (butter)-[:IN]->(b2)
CREATE (bread)-[:IN]->(b3)
CREATE (butter)-[:IN]->(b3)

Listing 1 Building a purchase history for a customer

Usually it’s an application that generates the kind of Cypher statements we see in Listing 1, but writing it out longhand works pretty well too for learning purposes. Cypher’s philosophy is to draw ASCII-art pictures of the graph for both describing structures for storage and looking for patterns. In Cypher, nodes are represented as (node) while relationships take the form of arrows like
–[:LOVES]-> while properties on either nodes or relationships use a JSON-like map syntax such as { name: ‘Alice’, age : 33}.

To create a customer, we use the syntax CREATE (alice:Customer {name: ‘Alice’}) which creates a node with the label Customer and populates that node with a property with key name and value Alice. It then introduces an identifier for that node alice into the scope of the current query. The identifier alice is then used to build out connections to other nodes created in the same way. For instance CREATE (alice)-[:BOUGHT]->(b2) creates an outgoing BOUGHT relationship from the node representing Alice to the basket node identified as b2. Though each statement is itself rather simple, in the aggregate we can create very sophisticated and large graph structures with these primitives.

Now that we have a graph of customers, and the past products they’ve bought we can think about recommendations to influence their future buying behaviour. By far the simplest recommendation we can make is to show popular products across the store. This is trivial in Cypher as we can see in Listing 2.

MATCH (customer:Customer)-[:BOUGHT]->(:Basket)<-[:IN]-(product:Product)
RETURN product, count(product) ORDER BY count(product) DESC LIMIT 5

Listing 2 Query for the 5 most popular products

The Cypher query in Listing 2 showcases much about Cypher. Firstly the MATCH clause shows how ASCII-art is once again used to declare the graph structure that we’re looking for. In this case it can be read as “customers who bought a basket that had a product in it” except since baskets aren’t particularly important for this query we’ve elided them using the anonymous node (). Then we RETURN the data that matched the pattern, and operate on it with some (familiar-looking) aggregate functions. That is, we return the node representing the product(s) and the count of how many product nodes matched, then order by the number of nodes that matched in a descending fashion, limiting to the top 5 which gives us the most popular products in the purchasing data as shown in Listing 3.

| product                                     | count(product) |
| Node[69]{sku:11122398,desc:"whole milk"}    | 5              |
| Node[68]{sku:75620009,desc:"unsliced loaf"} | 4              |
| Node[70]{sku:32986438,desc:"tagliatelle"}   | 3              |
| Node[66]{sku:95645311,desc:"butter"}        | 3              |
| Node[71]{sku:32405923,desc:"tea bags"}      | 2              |

Listing 3 Results of the query for the 5 most popular products

There are a couple of downsides to the query in Listing 2 though. Firstly it’s not really contextualized by the customer, but by all customers and so isn’t very accurate for any given individual, though it might be very useful for supply chain management.  Secondly since it matches the purchasing behaviour of all customers, it’s not likely to be a sub-millisecond query – the kind of thing that we’d put between a Web request and response.

We can do better though, and without much additional work. The next simplest recommendation we can make is simply to show historically popular purchases that the customer has made themselves as we see in Listing 4.

MATCH (customer:Customer {name: 'Alice'})-[:BOUGHT]->(:Basket)<-[:IN]-(product:Product)
RETURN product, count(product) ORDER BY count(product) DESC LIMIT 5

Listing 4 A query to retrieve a customer’s favourite products

The only difference in Listing 4 compared to Listing 2 is the inclusion of a constraint on the customer node that it must contain a key name and a value Alice. This is actually a far better query from the customer’s point of view but it’s also faster too since the database has to do less work since it rules out most of the graph (all the parts that Alice isn’t involved in).

Of course in an age of social selling, it’d be even better to show the customer popular products in their social network rather than just their own purchases since this has a strongly influences buying behaviour. As you’d expect adding a social dimension to a Neo4j graph is easy. To that end, in Listing 5 we add the social connections, query those connections in Listing 6 and get the substantially more useful results in Listing 7.

CREATE (alice)-[:FRIEND]->(bob)
CREATE (alice)-[:FRIEND]->(dora)
CREATE (bob)-[:FRIEND]->(alice)
CREATE (bob)-[:FRIEND]->(dora)
CREATE (charlie)-[:FRIEND]->(dora)
CREATE (dora)-[:FRIEND]->(alice)
CREATE (dora)-[:FRIEND]->(bob)
CREATE (dora)-[:FRIEND]->(charlie)

Listing 5 Adding friends in a social network for customers

MATCH (customer:Customer {name: 'Alice'})-[:FRIEND*1..2]->(friend:Customer)
WHERE customer <> friend
MATCH (friend)-[:BOUGHT]->(:Basket)<-[:IN]-(product:Product)
RETURN product, count(product) ORDER BY count(product) DESC LIMIT 5

Listing 6 Cypher query for popular products bought by friends and friends of friends in a social network

To retrieve the purchased products of both direct friends and friends-of-friends we use the Cypher WITH clause to divide the query in Listing 6 into two logical parts, piping results from the first part into the second.

In the first part of the query, we see the family syntax where we find the current customer (Alice) and traverse the graph matching for either Alice’s direct friends or their friends (her friend-of-friends).

This is a straightforward query since Neo4j supports a flexible path length notation that we see connecting a customer to friends at either depth one or depth 2 like so: -[:FRIEND*1..2]->. In this case we get all friends (depth one) and friend-of-friend (at depth two) but the notation can be parameterised for any maximum and minimum depth.

However the overall structure is not sufficient to arrive at a correct result in this case. In matching we must care not to include Alice herself in the results, which is possible because Alice is a friend of Bob who is a friend of Alice making her a friend-of-friend to herself! It is the second line of the WHERE clause which enforces this by ensuring there is only a match when the customer and candidate friend are not the same node.

Furthermore, we don’t want to get duplicate friends-of-friends that are also direct friends (or vice-versa since the outcome is the same). This can happen because Bob is both a direct friend and a friend-of-friend through Dora. Using the DISTINCT keyword ensures that we don’t get duplicate results from equivalent pattern matches.

Once we have the friends and friends-of-friends of the customer, the WITH clause allows us to pipe the results from the first part of the query into the second. In the second half of the query, we’re back in familiar territory, matching against customers (the friends and friends-of-friends) who bought products and ranking them by sales (the number of bought baskets each product appeared in). Running the query over our sample data set reveals that whole milk is the most popular product amongst this social group as we see in Listing 7. Since Alice hasn’t bought any whole milk, she should probably add some to her basket in case any of her friends visit.

| product                                      | count(product) |
| Node[155]{sku:11122398,desc:"whole milk"}    | 5              |
| Node[154]{sku:75620009,desc:"unsliced loaf"} | 4              |
| Node[156]{sku:32986438,desc:"tagliatelle"}   | 3              |
| Node[157]{sku:32405923,desc:"tea bags"}      | 2              |
| Node[152]{sku:95645311,desc:"butter"}        | 1              |

Listing 7 Popular products bought by friends and friends of friends

From here it’s easy to see how we can add other dimensions to the query. We can use spatial information about the customer’s address, their social demographic, the store’s supply chain and stock situation and many other things. This seamless blending of multidimensional data is they key strength of modelling with property graphs and the key strength of Neo4j is that it makes working with property graphs pleasant (because of Cypher), fast (because of index-free adjacency), and reliable (because of ACID transactions and HA clustering).

Declarative Indexing

If you’ve used Cypher in Neo4j 1.x, you’ll have noticed a that the START clause has been missing from the queries in this article. It’s still there if you need it (e.g. when you need to access legacy indexes for integration with other data stores), but in Neo4j 2.0 finding starting nodes for a query is handled declaratively by the database rather than calling imperatively calling out to an index.

This means we can declare indexes on certain labels and have Neo4j use those indexes, where appropriate, to speed up queries. Selecting a starting node is a typical example, but the Neo4j query planner is free to use indexes anywhere that a scan might otherwise be used.

To declare an index we simply use a little Cypher, as we can see in Listing 8.

CREATE INDEX ON :Customer(name)

Listing 8 Creating an index of Customers based on their names

The index creation clause is simple to understand, it needs a label like :Customer and a property associated with that label to index on like name. Issuing this command causes Neo4j to start to build out an index of named customers in the background that will become available to the query planner once it completes. The index will also be automatically managed in future as nodes (and relationships) are added, removed, and changed. Without indexes queries will still work (of course!), but with the indexes discrete access to indexed nodes (like starting nodes) will be substantially quicker. It is also less code to write.

Ultimately this means that operations folks don’t need to be able to code to control the performance of their Neo4j instances at runtime. Instead all they need is a basic grasp of Cypher, and they are all set.


Now that we have labels in Neo4j 2.0, we have an interesting opportunity to begin to apply pragmatic governance to the data model. While Neo4j is traditionally at the “schema-less” end of the database spectrum, the addition of constraints can move us towards the valuable centre ground of “less schema.” In essence this means that we can have light-schema where we need to, and allow organic growth of the graph where we don’t.

In Neo4j 2.0 we see the first such constraint in the form of Uniqueness. For example to create or drop a uniqueness constraints on a product’s SKU (Stock-Keeping Unit) we can use the Cypher in Listing 9.


Listing 9 Creating and dropping uniqueness constraints

Future versions of Neo4j 2.x will build more useful constraints. For example, we could imagine implementing property constraints such that any node labelled Person must have a first_name and last_name property. Or we could go a step further and introduce structural constraints whereby any node labelled, say, Householder must be connected via a LIVES_AT relationship to another node labelled Address.

Importantly, this is all accessible via the Cypher query language (as well as the programmatic APIs) making it available to operators and developers.


In this article we’ve seen how Neo4j 2.0 and the new version of the Cypher query language can be used to store and query a range of retail data from product catalogue to customer purchases. We also saw how straightforward it was to quickly gain insight from that data, despite the domain being highly and intricately connected.

Neo4j makes it easy to use graphs to analyse data in near real-time for online systems and is widely deployed in large enterprises and start-ups alike. If all this has piqued your interest, Neo4j 2.0 is ready to download, and with the recent release of the O’Reilly book “Graph Databases” by Ian Robinson, me, and Emil Eifrem (a full free eBook version is available at http://graphdatabases.com) this is a great time to get acquainted with Neo4j, Cypher and graphs.

(with thanks to Mark Washeim for spotting a copy-n-paste error in the data set)

My New Book on Graph Databases

Over the last few months as well as working on the day job developing Neo4j, Ian Robinson, Emil Eifrem and I have been working on writing a new book for O’Reilly that showcases the expressive power and technical capabilities of graph databases. I’m really happy to announce that the fruits of that labour has now been released as a full, free early-access eBook, aptly named Graph Databases.

The cover of the book is an octopus, chosen because it looks a bit like a node in a graph connected via some relationships. The book’s available for free download from the GraphDatabases.com site, so head on over there and get downloading.

Neo4j: Facebook GraphSearch for the Rest of Us

The recent big announcement from Facebook was a search platform that provides answers to contextual questions. They’ve called it “Facebook Graph Search” which is a pretty big deal for those of us into graph computing since it moves the notion of graphs from an interesting niche to centre stage for many developers.

The key aspect of Facebook’s Graph Search – at least from an external perspective – is that the quality of the answers stems from querying data and relationships, enabling applications to reason about multiple overlapping facts, and in turn enable both discrete (path-centric) and probabilistic (weighted) information to be returned. Importantly, the ability to, for example, find friends from a particular city and who like a certain food or movie genre is important move beyond the large but simple graph of friends and likes and into rich semantic data. Naturally Facebook have applied this technology first on their core domain – social – and their platform now supports applications that now connect us even better.

But Facebook isn’t alone in needing better insight into its domain. Whether our domains consist, like Facebook’s, of users or whether it’s widgets, data centres, or protein interactions, technologies like Graph-Search could be hugely valuable for the rest of us too. And yet so few of us can hope to operate at the same operational and intellectual scale as Facebook. Hiring in world-class graph boffins to build and run a platform like Graph Search is a non-starter for practically everyone except the handful of global technology giants.

And yet the technology exists in the form of graph databases like Neo4j for business IT solutions and Web applications and APIs to take advantage of graph data. To demonstrate that point, let’s see how we could implement Facebook’s current social Graph-Search features atop Neo4j. It won’t quite have the same operational scale of the Facebook implementation, since Neo4j has only(!) been used to store around half the Facebook graph to-date, but it’ll show how Neo4j provides those same powerful graph features for the rest of us.

Let’s start at the beginning, by solving the kind of question that Facebook posed when it announced Graph Search: find all of the Sushi restaurants in New York that my friends like. However, since I don’t live in New York, nor am I keen on sushi, I’m going to localise it to my conditions here in London. Let’s see what happens when I try to find the curry houses in Southwark (the borough where I live) which my friends like.

Parsing the previous sentence carefully reveals the interesting intersection of several domains: social, geospatial, taxonomical, and so on. Each can be considered both as a graph in isolation and composed within a larger graph. The first domain is the familiar social graph where I’m simply looking for my friends, and (thankfully) find a few by following the relationships marked FRIEND to other people.


Note that here I’m using the diagrammatic shortcut of double-ended relationships to show reciprocal friendships. However in real life relationships aren’t always reciprocal, and so in Neo4j we explicitly model this with two relationships, which is expressed as (Jim)-[:FRIEND]->(Simon) and (Simon)-[:FRIEND]->(Jim) in Cypher, Neo4j’s query language.

The next aspect to consider is the curry houses themselves which can be easily represented by the following graph:


We can see that these restaurants are indeed curry houses since they’re connected to a node representing that food category via CUISINE relationships. This arrangement acts as a kind of simple index or tag cloud, allowing us to find a all the establishments offering a particular kind of cuisine (Neo4j’s indexes work here too, but want to stay explicitly in the graph for now). It’s noteworthy the restaurant “Indian Mischief” is easily identifiable as a vegetarian Indian restaurant by being connected to two such category nodes.

Now things get more interesting when we bring those domains together to see the curry houses my friends like, itself being easily expressed with the LIKES relationship between the people and the restaurants they’ve enjoyed. In this case we can see that both Kath and I like Tandoori Nights, while Martin and Simon both like Babur (hat tip to the real-life Martin Fowler – co-author of NoSQL Distilled – and Simon Stewart – a Facebook employee no less – for recommending Babur), while nobody explicitly likes Indian Mischief (I can’t attest to it, I’ve never eaten there but it looks nice from the outside).


Finally we have the geospatial aspects of the domain, something you might not immediately recognise as a graph. However geospatial data is easily represented as a tree, and in our example we can see that our target restaurants are in the desired borough (though in different neighbourhoods). Furthermore if I’d used an R-Tree (the canonical structure for spatial data, see: http://en.wikipedia.org/wiki/R-tree) to represent bounded boxes rather than using a simple  hierarchy, we’d actually see these neighbourhoods are close by.


Finally we can bring the whole domain together ready to query it for detailed insight.


To find Indian restaurants in Southwark which my friends like is a trivial Cyper query:

START jim = node:node_auto_index(name='Jim'), 
      southwark = node:node_auto_index(borough='Southwark'),
      indian = node:node_auto_index(cuisine='Indian')
MATCH jim-[:FRIEND]->friend-[:LIKES]->restaurant-[:IN]->()-[:IN]->southwark, restaurant-[:CUISINE]->indian
WHERE friend-[:FRIEND]->jim
RETURN restaurant

The START clause looks up some well-known nodes in the graph to act as starting points for our query. Under the covers it uses Neo4j indexing, but we can conveniently think of it as a naming server here, where it’s used to remember the node representing me (that is name=’Jim’), my borough (borough=’Southwark’) and the type of cuisine I’m interested in (cuisine=’Indian’). Each of these start nodes is bound to a name so they can be referred to in the rest of the query.

The MATCH clause is the heart of any Cypher query. In here we describe, using ASCII art, the patterns that we want the database to discover. In this MATCH clause, I’m firstly asking to match jim-[:FRIEND]-friend which reads “match from the bound node jim those other nodes connected by an outgoing FRIEND relationship and bind them to the identifier friend.” Following on I’m then asking to match where any friend of mine LIKES a restaurant somewhere in Southwark. That’s expressed as friend-[:LIKES]->restaurant-[:IN]->()-[:IN]->(southwark) where friend and restaurant are nodes matched in the graph, and the node southwark is bound to a particular node in the start clause. The other interesting piece of syntax in there are the empty brackets () which indicate an anonymous node that we’re not interested in naming for this query. The reason we use that syntax here is that we’re interested in any borough in Southwark, but aren’t interested in knowing the specifics, meaning we don’t bother to name nodes that will match on boroughs in the graph. Finally in the second part of the MATCH clause (after the comma) we specify that the restaurant matched must serve Indian food.

Remember that Neo4j is a schemaless database. In this example I have safely inferred that certain nodes represent boroughs, restaurants, people and so on by the way they connect to the graph. However to be totally certain, it can be wise in some situations to check the contents of nodes too, particularly where the relationships in the domain tend to be very homogeneous.

Following the match clause we have a WHERE filter that ensures we only accept recommendations from people who declare that reciprocate our friendship. Since friendship is otherwise unilateral, this seems a sensible thing to do – it might be unwise to go places where people we like, but who don’t reciprocate, recommend.

So far, so good. And adding an engaging user interface that understands natural language (like the one my colleague Max de Marzi wrote in a weekend), we could declare functional equivalence to the fundamental Facebook Graph Search functionality. And yet with graphs it’s so very easy to continue adding dimensions into our data and support increasingly sophisticated query functionality just like Graph Search.

Since graphs allow us to explore any number of dimensions in a domain separately or together they provide exceptional expressive power and insight. Once you’re hooked on that expressive power, it’s easy to see how you can go so much further with relatively little effort. We can trivially extend the Facebook examples to encompass other facts encoded in the graph. Whether that’s musical preference, language, job history or any other facet that we choose to store in the graph, we can query that structure efficiently and gain great insight.

Now, why not try it yourself? Installing Neo4j takes just a minute job and is only a click away.

Neo4j Koans update

Over the last couple of years, Ian Robinson and I (along with a little help from friends in the community) have been building a set of koans for learning Neo4j. These koans follow the same mantra as the Ruby Koans, where you’re given a set of failing unit tests and as you make them green you learn something. For the Ruby Koans that learning is the Ruby language and idioms, for the Neo4j Koans you learn the Neo4j APIs and Cypher query language as a side-effect of making the unit tests go green.

Oh, and you learn lots about the TV show “Doctor Who” too since that’s the (lovingly and extensively curated) data set that underlies the tests. But we’ve now decided to update the Neo4j Koans so if you make use of them, read on.

The Koans themselves along with all the scripts required to turn them from “teacher’s version” into “students’ version” and the associated dependency management and tool download actions are staying in the public repository. These will form the basis for chapters for our next book project (since our current book project Graph Databases (O’Reilly) is nearing completion). However all the teaching materials (and in particular that big, slow-to-clone PowerPoint deck) are going away since the teaching materials will be focussed in the book itself rather than as slides. The deck will – of course – remain in version control history but over time it will become inconsistent with the koans so be careful if you use historical materials with current koans.

We don’t think this is going to inconvenience too many people, but if it’s going to affect you then don’t hesitate to reach out to me.


jimwebber.org is live again

After a decade or so of being somewhat surreptitiously hosted by the formidable folks at Newcastle University (thanks guys, you’re awesome) my blog site finally had to move to a proper commercial provider.

Following his own lead, I’ve retired Savas‘s trusty PBlog (ASP.NET!) engine and moved to WordPress mostly so that I can ask him for tech support when I can’t do mod rewrite syntax :-)

Thanks to those of you (especially on Twitter) that took time to tell me my blog was down, this is my hat tip to you to say it’s back up again. My apologies for those of you who’ve commented on my blog over the years, those comments are still languishing in a relational database (ha!) and I’ll port ’em over on my next sick day.