Getting Started with Neo4j 2.0

With the recent release of Neo4j 2.0 and it’s a great time to get familiar with graphs and graph databases. Neo4j is quite different from relational databases, and it’s also quite different from most of the other NoSQL databases. The reason for its differences is that it addresses more complex challenges with interconnected (joined-up) data.  This makes Neo4j ideal for high fidelity modelling and high-performance querying of rich, real-world domains.

In this article we’ll work through some retail recommendations problems with Neo4j and see how we can use graphs to store and query complex interconnected data for fun and profit.

What’s Neo4j?

Aside from being strangely named software, Neo4j is an enterprise-grade, open source, ACID transactional, graph database. First deployed in 2003, Neo4j is by far the leader in the graph database world with a vibrant community, active development, and huge numbers of successful commercial and open source deployments.

As a graph database, data model is radically different from relational databases, and although it is considered to be part of the NoSQL space, its data model is also radically different from denormalised aggregate stores. In Neo4j, data is stored in nodes that hold key-value pairs. In turn, those nodes are linked through relationships. There can be many relationships between nodes, and each of them is named and directed (always with a start and end node). Although these are simple tools to understand, they can be used to create very expressive data models where the semantic glue between entities (nodes) is a  first-class citizen like the entities themselves. This model is known as a property graph and a simple example is shown in Figure 1.

1

Figure 1 The Property Graph Data Model

The property graph shown in Figure 1 depicts a small but typical subgraph where nodes (representing things) are connected by relationships that set the semantic context for those things. In this case we can see that Alice loves Bob, and that Bob loves Alice as well as cookies. Equally property graphs could model unrequited love where Alice loves Bob, but there is no reciprocal relationship (how sad!). At runtime however relationships can be traversed in either direction so we can equally ask questions like “Who does Alice love?” as well as “Who loves Alice?” with no computational penalty.

However, our modelling tools don’t stop here. Although the property graph model has been stable for over a decade, the Neo4j team continues to innovate in the graph space. In particular, Neo4j 2.0 has added a new twist to the classic property graph model in the form of labels.

Labels allow the modeller to “tag” nodes with names, allowing us to identify similar kinds of nodes like “customer” or “product” to the database. In turn queries can take advantage of that extra data to make them clearer (and faster too). Labels also allow us to specify structural constraints and indexing (for performance) on the graph in a way that is sympathetic rather than external to the domain..

An example with labelled nodes is shown in Figure 2 where we can clearly see that Alice and Bob are the only people who love each other, and that Bob’s other love – cookies – isn’t an interloper but just a product for a tasty teatime snack.

2

Figure 2 The Labelled Property Graph Data Model supports taxonomies of nodes

From even these simple examples, it’s clear that the property graph model whiteboard friendly, in fact typically what you draw is what you store. However the model also supports efficient queries without having to reduce the expressivity of the domain (e.g. through denormalisation) even when the domain itself is messy like the real world.  To demonstrate that, let’s start to work directly with Neo4j.

Downloading Neo4j 2.0

The first thing to do is download and install the database, the latest stable version is the newly minted Neo4j 2.0. A comprehensive guide for downloading, installing and running Neo4j on Linux, Mac, and Windows is available at: http://www.neo4j.org/download and the only prerequisite is Java 7.

Once we have the database installed, it’s a simple command line or double-click to get it running and then we’re all set to pour data into it and get insight through graph queries. So let’s start by introducing the domain.

Using Neo4j for Retail Recommendations

While property graphs are a very general-purpose data model, to kick-start our learning with Neo4j 2.0 we’ll use retail as our example domain since it’s simple to understand and a very popular use-case (especially when it comes to product recommendations).

Firstly let’s get the basics out of the way: how do we model an individual customer and the product purchases they make over time? This scenario very illustrative of graph modelling since it covers three discrete dimensions: time, customers, and products, which can be seamlessly integrated when modelled as a graph and later any or all of the dimensions can be intertwined when querying.

3

Figure 3 Purchase history for a customer

The diagram in Figure 3 shows how we can use a simple linked list of shopping baskets connected by NEXT relationships to create a purchase history for the customer. In that graph we see that the customer has visited three times, saved their first purchase for later (the SAVED relationship between customer and basket nodes) and ultimately bought one basket (indicated by the BOUGHT relationship between customer and basket node) and is currently assembling a basket, shown by the CURRENT relationship that points to an active basket at the head of the linked list. It’s important to understand this isn’t a schema or ER-diagram but represents actual data for a single customer. A real graph of many such customers will be huge (far too big to draw in this article) but exhibit the same kind of structure.

In graph form, it’s easy to figure out the customer’s behaviour: they became a potential new customer but failed to commit to buying toothpaste and came back one day later and bought toothpaste, bread, and butter. Finally the customer settled on buying bread and butter in next purchase – which is a repeated pattern in their purchase history we could ultimately use to serve them better.

Now we understand a simple labelled property graph model, it’s time to push it into Neo4j. The primary way of managing data in Neo4j 2.0 is the Cypher query language, which is used query, alter, label, and index the graph. While the programmatic APIs from Neo4j 1.x remain (and are used to express arbitrary graph algorithms, or enable close-to-the-metal programming) Cypher is often the best choice and is definitely the right place for us to start learning. The first thing we need to do is load our customer graph into Neo4j via Cypher as in Listing 1.

CREATE (alice:Customer {name: 'Alice'})
CREATE (b1:Basket {id: 'a332deb', date: 20130928})
CREATE (b1)-[:NEXT]->(b2:Basket {id: 'a332deb', date: 20130929})
CREATE (b2)-[:NEXT]-> (b3:Basket {id: 'ffda309', date: 20131010})
CREATE (alice)-[:SAVED]->(b1)
CREATE (alice)-[:BOUGHT]->(b2)
CREATE (alice)-[:CURRENT]->(b3)
CREATE (toothpaste:Product {name: 'Toothpaste'})
CREATE (toothpaste)-[:IN]->(b1)
CREATE (toothpaste)-[:IN]->(b2)
CREATE (bread:Product {name: 'Bread'})
CREATE (bread)-[:IN]->(b2)
CREATE (butter:Product {name: 'Butter'})
CREATE (butter)-[:IN]->(b2)
CREATE (bread)-[:IN]->(b3)
CREATE (butter)-[:IN]->(b3)

Listing 1 Building a purchase history for a customer

Usually it’s an application that generates the kind of Cypher statements we see in Listing 1, but writing it out longhand works pretty well too for learning purposes. Cypher’s philosophy is to draw ASCII-art pictures of the graph for both describing structures for storage and looking for patterns. In Cypher, nodes are represented as (node) while relationships take the form of arrows like
–[:LOVES]-> while properties on either nodes or relationships use a JSON-like map syntax such as { name: ‘Alice’, age : 33}.

To create a customer, we use the syntax CREATE (alice:Customer {name: ‘Alice’}) which creates a node with the label Customer and populates that node with a property with key name and value Alice. It then introduces an identifier for that node alice into the scope of the current query. The identifier alice is then used to build out connections to other nodes created in the same way. For instance CREATE (alice)-[:BOUGHT]->(b2) creates an outgoing BOUGHT relationship from the node representing Alice to the basket node identified as b2. Though each statement is itself rather simple, in the aggregate we can create very sophisticated and large graph structures with these primitives.

Now that we have a graph of customers, and the past products they’ve bought we can think about recommendations to influence their future buying behaviour. By far the simplest recommendation we can make is to show popular products across the store. This is trivial in Cypher as we can see in Listing 2.

MATCH (customer:Customer)-[:BOUGHT]->(:Basket)<-[:IN]-(product:Product)
RETURN product, count(product) ORDER BY count(product) DESC LIMIT 5

Listing 2 Query for the 5 most popular products

The Cypher query in Listing 2 showcases much about Cypher. Firstly the MATCH clause shows how ASCII-art is once again used to declare the graph structure that we’re looking for. In this case it can be read as “customers who bought a basket that had a product in it” except since baskets aren’t particularly important for this query we’ve elided them using the anonymous node (). Then we RETURN the data that matched the pattern, and operate on it with some (familiar-looking) aggregate functions. That is, we return the node representing the product(s) and the count of how many product nodes matched, then order by the number of nodes that matched in a descending fashion, limiting to the top 5 which gives us the most popular products in the purchasing data as shown in Listing 3.

+--------------------------------------------------------------+
| product                                     | count(product) |
+--------------------------------------------------------------+
| Node[69]{sku:11122398,desc:"whole milk"}    | 5              |
| Node[68]{sku:75620009,desc:"unsliced loaf"} | 4              |
| Node[70]{sku:32986438,desc:"tagliatelle"}   | 3              |
| Node[66]{sku:95645311,desc:"butter"}        | 3              |
| Node[71]{sku:32405923,desc:"tea bags"}      | 2              |
+--------------------------------------------------------------+

Listing 3 Results of the query for the 5 most popular products

There are a couple of downsides to the query in Listing 2 though. Firstly it’s not really contextualized by the customer, but by all customers and so isn’t very accurate for any given individual, though it might be very useful for supply chain management.  Secondly since it matches the purchasing behaviour of all customers, it’s not likely to be a sub-millisecond query – the kind of thing that we’d put between a Web request and response.

We can do better though, and without much additional work. The next simplest recommendation we can make is simply to show historically popular purchases that the customer has made themselves as we see in Listing 4.

MATCH (customer:Customer {name: 'Alice'})-[:BOUGHT]->(:Basket)<-[:IN]-(product:Product)
RETURN product, count(product) ORDER BY count(product) DESC LIMIT 5

Listing 4 A query to retrieve a customer’s favourite products

The only difference in Listing 4 compared to Listing 2 is the inclusion of a constraint on the customer node that it must contain a key name and a value Alice. This is actually a far better query from the customer’s point of view but it’s also faster too since the database has to do less work since it rules out most of the graph (all the parts that Alice isn’t involved in).

Of course in an age of social selling, it’d be even better to show the customer popular products in their social network rather than just their own purchases since this has a strongly influences buying behaviour. As you’d expect adding a social dimension to a Neo4j graph is easy. To that end, in Listing 5 we add the social connections, query those connections in Listing 6 and get the substantially more useful results in Listing 7.

CREATE (alice)-[:FRIEND]->(bob)
CREATE (alice)-[:FRIEND]->(dora)
CREATE (bob)-[:FRIEND]->(alice)
CREATE (bob)-[:FRIEND]->(dora)
CREATE (charlie)-[:FRIEND]->(dora)
CREATE (dora)-[:FRIEND]->(alice)
CREATE (dora)-[:FRIEND]->(bob)
CREATE (dora)-[:FRIEND]->(charlie)

Listing 5 Adding friends in a social network for customers

MATCH (customer:Customer {name: 'Alice'})-[:FRIEND*1..2]->(friend:Customer)
WHERE customer <> friend
WITH DISTINCT friend
MATCH (friend)-[:BOUGHT]->(:Basket)<-[:IN]-(product:Product)
RETURN product, count(product) ORDER BY count(product) DESC LIMIT 5

Listing 6 Cypher query for popular products bought by friends and friends of friends in a social network

To retrieve the purchased products of both direct friends and friends-of-friends we use the Cypher WITH clause to divide the query in Listing 6 into two logical parts, piping results from the first part into the second.

In the first part of the query, we see the family syntax where we find the current customer (Alice) and traverse the graph matching for either Alice’s direct friends or their friends (her friend-of-friends).

This is a straightforward query since Neo4j supports a flexible path length notation that we see connecting a customer to friends at either depth one or depth 2 like so: -[:FRIEND*1..2]->. In this case we get all friends (depth one) and friend-of-friend (at depth two) but the notation can be parameterised for any maximum and minimum depth.

However the overall structure is not sufficient to arrive at a correct result in this case. In matching we must care not to include Alice herself in the results, which is possible because Alice is a friend of Bob who is a friend of Alice making her a friend-of-friend to herself! It is the second line of the WHERE clause which enforces this by ensuring there is only a match when the customer and candidate friend are not the same node.

Furthermore, we don’t want to get duplicate friends-of-friends that are also direct friends (or vice-versa since the outcome is the same). This can happen because Bob is both a direct friend and a friend-of-friend through Dora. Using the DISTINCT keyword ensures that we don’t get duplicate results from equivalent pattern matches.

Once we have the friends and friends-of-friends of the customer, the WITH clause allows us to pipe the results from the first part of the query into the second. In the second half of the query, we’re back in familiar territory, matching against customers (the friends and friends-of-friends) who bought products and ranking them by sales (the number of bought baskets each product appeared in). Running the query over our sample data set reveals that whole milk is the most popular product amongst this social group as we see in Listing 7. Since Alice hasn’t bought any whole milk, she should probably add some to her basket in case any of her friends visit.

+---------------------------------------------------------------+
| product                                      | count(product) |
+---------------------------------------------------------------+
| Node[155]{sku:11122398,desc:"whole milk"}    | 5              |
| Node[154]{sku:75620009,desc:"unsliced loaf"} | 4              |
| Node[156]{sku:32986438,desc:"tagliatelle"}   | 3              |
| Node[157]{sku:32405923,desc:"tea bags"}      | 2              |
| Node[152]{sku:95645311,desc:"butter"}        | 1              |
+---------------------------------------------------------------+

Listing 7 Popular products bought by friends and friends of friends

From here it’s easy to see how we can add other dimensions to the query. We can use spatial information about the customer’s address, their social demographic, the store’s supply chain and stock situation and many other things. This seamless blending of multidimensional data is they key strength of modelling with property graphs and the key strength of Neo4j is that it makes working with property graphs pleasant (because of Cypher), fast (because of index-free adjacency), and reliable (because of ACID transactions and HA clustering).

Declarative Indexing

If you’ve used Cypher in Neo4j 1.x, you’ll have noticed a that the START clause has been missing from the queries in this article. It’s still there if you need it (e.g. when you need to access legacy indexes for integration with other data stores), but in Neo4j 2.0 finding starting nodes for a query is handled declaratively by the database rather than calling imperatively calling out to an index.

This means we can declare indexes on certain labels and have Neo4j use those indexes, where appropriate, to speed up queries. Selecting a starting node is a typical example, but the Neo4j query planner is free to use indexes anywhere that a scan might otherwise be used.

To declare an index we simply use a little Cypher, as we can see in Listing 8.

CREATE INDEX ON :Customer(name)

Listing 8 Creating an index of Customers based on their names

The index creation clause is simple to understand, it needs a label like :Customer and a property associated with that label to index on like name. Issuing this command causes Neo4j to start to build out an index of named customers in the background that will become available to the query planner once it completes. The index will also be automatically managed in future as nodes (and relationships) are added, removed, and changed. Without indexes queries will still work (of course!), but with the indexes discrete access to indexed nodes (like starting nodes) will be substantially quicker. It is also less code to write.

Ultimately this means that operations folks don’t need to be able to code to control the performance of their Neo4j instances at runtime. Instead all they need is a basic grasp of Cypher, and they are all set.

Constraints

Now that we have labels in Neo4j 2.0, we have an interesting opportunity to begin to apply pragmatic governance to the data model. While Neo4j is traditionally at the “schema-less” end of the database spectrum, the addition of constraints can move us towards the valuable centre ground of “less schema.” In essence this means that we can have light-schema where we need to, and allow organic growth of the graph where we don’t.

In Neo4j 2.0 we see the first such constraint in the form of Uniqueness. For example to create or drop a uniqueness constraints on a product’s SKU (Stock-Keeping Unit) we can use the Cypher in Listing 9.

CREATE CONSTRAINT ON (p:Product) ASSERT p.sku IS UNIQUE
DROP CONSTRAINT ON (p:Product) ASSERT p.sku IS UNIQUE

Listing 9 Creating and dropping uniqueness constraints

Future versions of Neo4j 2.x will build more useful constraints. For example, we could imagine implementing property constraints such that any node labelled Person must have a first_name and last_name property. Or we could go a step further and introduce structural constraints whereby any node labelled, say, Householder must be connected via a LIVES_AT relationship to another node labelled Address.

Importantly, this is all accessible via the Cypher query language (as well as the programmatic APIs) making it available to operators and developers.

Summary

In this article we’ve seen how Neo4j 2.0 and the new version of the Cypher query language can be used to store and query a range of retail data from product catalogue to customer purchases. We also saw how straightforward it was to quickly gain insight from that data, despite the domain being highly and intricately connected.

Neo4j makes it easy to use graphs to analyse data in near real-time for online systems and is widely deployed in large enterprises and start-ups alike. If all this has piqued your interest, Neo4j 2.0 is ready to download, and with the recent release of the O’Reilly book “Graph Databases” by Ian Robinson, me, and Emil Eifrem (a full free eBook version is available at http://graphdatabases.com) this is a great time to get acquainted with Neo4j, Cypher and graphs.

(with thanks to Mark Washeim for spotting a copy-n-paste error in the data set)

Posted in Books, neo4j, NOSQL, Programming
3 comments on “Getting Started with Neo4j 2.0
  1. Maarten says:

    You say that ‘Future versions of Neo4j 2.x will build more useful constraints. For example, we could imagine …’. Can you give more details about the actual planned constraints?

  2. jim says:

    @Maarten

    I can give my view of what would be useful. There’s probably some overlap with that and what will ship with Neo4j in the 2.x series. Off the top of my head:

    1. Property exists constraints. E.g. for a label :Person, there must be first_name and last_name properties present on that node.
    2. Structural constraints. E.g. if a node has a label :Car then it must be connected via an incoming :DRIVES relationship to at least one other node with label :Person.

    Hope that’s useful.

    Jim

  3. Ramesh says:

    This use case is well explained and its good for beginner to understand development in Neo4j..!
    Thanks for posting this kind of use case.

    –Ramesh

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>