Neo4j: Facebook GraphSearch for the Rest of Us

The recent big announcement from Facebook was a search platform that provides answers to contextual questions. They’ve called it “Facebook Graph Search” which is a pretty big deal for those of us into graph computing since it moves the notion of graphs from an interesting niche to centre stage for many developers.

The key aspect of Facebook’s Graph Search – at least from an external perspective – is that the quality of the answers stems from querying data and relationships, enabling applications to reason about multiple overlapping facts, and in turn enable both discrete (path-centric) and probabilistic (weighted) information to be returned. Importantly, the ability to, for example, find friends from a particular city and who like a certain food or movie genre is important move beyond the large but simple graph of friends and likes and into rich semantic data. Naturally Facebook have applied this technology first on their core domain – social – and their platform now supports applications that now connect us even better.

But Facebook isn’t alone in needing better insight into its domain. Whether our domains consist, like Facebook’s, of users or whether it’s widgets, data centres, or protein interactions, technologies like Graph-Search could be hugely valuable for the rest of us too. And yet so few of us can hope to operate at the same operational and intellectual scale as Facebook. Hiring in world-class graph boffins to build and run a platform like Graph Search is a non-starter for practically everyone except the handful of global technology giants.

And yet the technology exists in the form of graph databases like Neo4j for business IT solutions and Web applications and APIs to take advantage of graph data. To demonstrate that point, let’s see how we could implement Facebook’s current social Graph-Search features atop Neo4j. It won’t quite have the same operational scale of the Facebook implementation, since Neo4j has only(!) been used to store around half the Facebook graph to-date, but it’ll show how Neo4j provides those same powerful graph features for the rest of us.

Let’s start at the beginning, by solving the kind of question that Facebook posed when it announced Graph Search: find all of the Sushi restaurants in New York that my friends like. However, since I don’t live in New York, nor am I keen on sushi, I’m going to localise it to my conditions here in London. Let’s see what happens when I try to find the curry houses in Southwark (the borough where I live) which my friends like.

Parsing the previous sentence carefully reveals the interesting intersection of several domains: social, geospatial, taxonomical, and so on. Each can be considered both as a graph in isolation and composed within a larger graph. The first domain is the familiar social graph where I’m simply looking for my friends, and (thankfully) find a few by following the relationships marked FRIEND to other people.

1

Note that here I’m using the diagrammatic shortcut of double-ended relationships to show reciprocal friendships. However in real life relationships aren’t always reciprocal, and so in Neo4j we explicitly model this with two relationships, which is expressed as (Jim)-[:FRIEND]->(Simon) and (Simon)-[:FRIEND]->(Jim) in Cypher, Neo4j’s query language.

The next aspect to consider is the curry houses themselves which can be easily represented by the following graph:

2

We can see that these restaurants are indeed curry houses since they’re connected to a node representing that food category via CUISINE relationships. This arrangement acts as a kind of simple index or tag cloud, allowing us to find a all the establishments offering a particular kind of cuisine (Neo4j’s indexes work here too, but want to stay explicitly in the graph for now). It’s noteworthy the restaurant “Indian Mischief” is easily identifiable as a vegetarian Indian restaurant by being connected to two such category nodes.

Now things get more interesting when we bring those domains together to see the curry houses my friends like, itself being easily expressed with the LIKES relationship between the people and the restaurants they’ve enjoyed. In this case we can see that both Kath and I like Tandoori Nights, while Martin and Simon both like Babur (hat tip to the real-life Martin Fowler – co-author of NoSQL Distilled - and Simon Stewart – a Facebook employee no less – for recommending Babur), while nobody explicitly likes Indian Mischief (I can’t attest to it, I’ve never eaten there but it looks nice from the outside).

3

Finally we have the geospatial aspects of the domain, something you might not immediately recognise as a graph. However geospatial data is easily represented as a tree, and in our example we can see that our target restaurants are in the desired borough (though in different neighbourhoods). Furthermore if I’d used an R-Tree (the canonical structure for spatial data, see: http://en.wikipedia.org/wiki/R-tree) to represent bounded boxes rather than using a simple  hierarchy, we’d actually see these neighbourhoods are close by.

4

Finally we can bring the whole domain together ready to query it for detailed insight.

5

To find Indian restaurants in Southwark which my friends like is a trivial Cyper query:

START jim = node:node_auto_index(name='Jim'), 
      southwark = node:node_auto_index(borough='Southwark'),
      indian = node:node_auto_index(cuisine='Indian')
MATCH jim-[:FRIEND]->friend-[:LIKES]->restaurant-[:IN]->()-[:IN]->southwark, restaurant-[:CUISINE]->indian
WHERE friend-[:FRIEND]->jim
RETURN restaurant

The START clause looks up some well-known nodes in the graph to act as starting points for our query. Under the covers it uses Neo4j indexing, but we can conveniently think of it as a naming server here, where it’s used to remember the node representing me (that is name=’Jim’), my borough (borough=’Southwark’) and the type of cuisine I’m interested in (cuisine=’Indian’). Each of these start nodes is bound to a name so they can be referred to in the rest of the query.

The MATCH clause is the heart of any Cypher query. In here we describe, using ASCII art, the patterns that we want the database to discover. In this MATCH clause, I’m firstly asking to match jim-[:FRIEND]-friend which reads “match from the bound node jim those other nodes connected by an outgoing FRIEND relationship and bind them to the identifier friend.” Following on I’m then asking to match where any friend of mine LIKES a restaurant somewhere in Southwark. That’s expressed as friend-[:LIKES]->restaurant-[:IN]->()-[:IN]->(southwark) where friend and restaurant are nodes matched in the graph, and the node southwark is bound to a particular node in the start clause. The other interesting piece of syntax in there are the empty brackets () which indicate an anonymous node that we’re not interested in naming for this query. The reason we use that syntax here is that we’re interested in any borough in Southwark, but aren’t interested in knowing the specifics, meaning we don’t bother to name nodes that will match on boroughs in the graph. Finally in the second part of the MATCH clause (after the comma) we specify that the restaurant matched must serve Indian food.

Remember that Neo4j is a schemaless database. In this example I have safely inferred that certain nodes represent boroughs, restaurants, people and so on by the way they connect to the graph. However to be totally certain, it can be wise in some situations to check the contents of nodes too, particularly where the relationships in the domain tend to be very homogeneous.

Following the match clause we have a WHERE filter that ensures we only accept recommendations from people who declare that reciprocate our friendship. Since friendship is otherwise unilateral, this seems a sensible thing to do – it might be unwise to go places where people we like, but who don’t reciprocate, recommend.

So far, so good. And adding an engaging user interface that understands natural language (like the one my colleague Max de Marzi wrote in a weekend), we could declare functional equivalence to the fundamental Facebook Graph Search functionality. And yet with graphs it’s so very easy to continue adding dimensions into our data and support increasingly sophisticated query functionality just like Graph Search.

Since graphs allow us to explore any number of dimensions in a domain separately or together they provide exceptional expressive power and insight. Once you’re hooked on that expressive power, it’s easy to see how you can go so much further with relatively little effort. We can trivially extend the Facebook examples to encompass other facts encoded in the graph. Whether that’s musical preference, language, job history or any other facet that we choose to store in the graph, we can query that structure efficiently and gain great insight.

Now, why not try it yourself? Installing Neo4j takes just a minute job and is only a click away, and the source code for the examples in this post is available at my GitHub repo.

Posted in neo4j
2 comments on “Neo4j: Facebook GraphSearch for the Rest of Us
  1. Nick says:

    Nice post. I’m curious about a large-scale graph (larger than FB). What are the limitations with Neo4j to support the big data of big data?

    • jim says:

      Hi Nick,

      In terms of raw numbers, Neo4j currently scales to 32B nodes/rels and 62B properties. But those are somewhat arbitrary limits stemming from the choice of pointer size on disk which will be removed in our 2.0 release (still finite, but effectively infinite).

      The bigger challenge is dataset size – Neo4j is a master/slave cluster at the moment meaning a full (or nearly full) replica of the data is on each participating machine. Practically I think this is operationally inconvenient much past a few tens of TB.

      In future releases, Neo4j will drop master/slave in favour of masterless (peer to peer) clusters. At this point we’ll take the opportunity to partition the data between the machines in the cluster so that the footprint of each is more modest. This will also provide horizontal scale for writes (within Amdahl’s limits) as well as reads.

      HTH.

      Jim

Leave a Reply

Your email address will not be published. Required fields are marked *

*

You may use these HTML tags and attributes: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>