Governments should not subsidise business through the working poor

I don’t often blog about politics, but in the decline of the UK exchequer and the shift of blame from the corporate greedy to the working needy has seen some fundamentally unfair and downright nasty pieces of legislation being progressed by our minority government.

The latest of these policies designed to help prop up the exchequer (which we all know is carrying enormous debt caused by bailing out the greedy and reckless, and is now being bullied by the same because of its debts) is the idea to cut council tax benefits by 10% and federate the blame away from central government to local.

For those of you who are not familiar with the UK taxation system (and I’m no expert), council tax is a local tax paid by residents according to the value of the homes they live in, and funds local amenities like refuse collection. However, some people in the UK are on sufficiently low incomes that they receive benefits to help them pay their share of this tax.

Let me re-iterate that: some people in the UK are so poor that they government steps in to help them pay their tax.

On the surface, you could assert that the government is currently doing a fine thing by helping those with few means, and that by reducing the council tax help it gives to some of its most needy citizens it would be doing them quite a disservice. That’s an easy argument to make, especially since the relatively few recipients of this benefit need all the help they can get.

But the easy answer is not necessarily the right answer. I disagree with this benefit, not because I think it is unworthy, but because I think it places the burden in the wrong place. By subsidising a citizen’s tax obligations, the government is effectively subsidising employers and continuing to allow employment for far below a living wage (including the ability to meet the tax obligations that implies).

This is an outrageous situation where the government (on behalf of the citizens) subsidises corporations through beneficial corporate tax regimes, provides them with a stable and regulated market, educates and medicates their employees, and provides a relatively competent transport network to move their goods and services around. But by paying its own tax, the government is also effectively subsidising industry’s wage bill.

Governments shouldn’t be in the business of paying taxes to themselves. Governments should be in the business of provisioning services through taxes they collect. We must undo the moral and financial decay that has allowed business to free-ride from the tax payer and force them to pay a living wage and end this perversion where we subsidise the rich by using the working poor as a government-sponsored investment vehicle.

It’s time for business to pay their fair share.

Posted in Politics

Graph Processing versus Graph Databases

There’s recently been a great deal of discussion on the subject of graph processing. For those of us in the graph database space, this is an exciting development since it reinforces the utility of graphs as both a storage and a computational model. Confusingly however, processing graph-like data is often mistakenly conflated with graph databases because they share the same data model, yet each tool addresses a fundamentally different problem.

For example, graph processing platforms like Google’s Pregel achieve high aggregate computational throughput by adopting the Bulk Synchronous Processing (BSP) model from the parallel computing community. Pregel supports large-scale graph processing by partitioning a graph across many machines and allowing those machines to efficiently compute at vertices using localised data. Only during synchronisation phases is localised information exchanged (c.f. the BSP model). This gives Google the ability to process huge volumes of interconnected data, albeit at relatively high latencies, to gain greater business insight than with traditional (non-graph optimised) map-reduce approaches.

Sadly few of us have Google-scale resources at our disposal to invent novel platforms on demand. In enterprise-scale scenarios, Hadoop (incidentally an implementation of Google’s earlier map-reduce framework) has become a popular platform for batch processing large volumes of data. Like Pregel, Hadoop is a high-latency, high-throughput processing tool that optimises computational throughput by processing large volumes of data in parallel outside the database.

Unlike Pregel, Hadoop is a general purpose framework which means that while it can be used for graph processing, it’s not optimised for that purpose nor are the underlying storage mechanisms HDFS (a distributed file system) and HBase (a distributed tabular database designed for large numbers of rows and columns) graph-oriented in nature (though interestingly the Ravel Golden Orb platform claims to add a Pregel-like programming model above Hadoop).

What Pregel and Hadoop have in common is their tendency towards the data analytics (OLAP) end of the spectrum, rather than being focussed on transaction processing. This is in stark contrast to graph databases like Neo4j which optimise storage and querying of connected data for online transaction processing (OLTP) scenarios – much like a regular RDBMS, only with a more expressive and powerful data model. We can visualise these differing capabilities easily as in the figure below:

Slide1

In this breakdown, Pregel is positioned firmly in the OLAP graph processing space, much as Hadoop is positioned in the general-purpose OLAP space (though closer to the OLTP axis because of recent advances in so-called real-time Hadoop). Relational databases are positioned as general purpose OLTP engines that can be somewhat adapted to the OLAP needs. Neo4j has strong graph affinity and is designed primarily for OLTP scenarios, though as a native graph database with strong read-scalability, it can also be suited to OLAP work.

However the Hadoop community continues to foster innovation in the area of graph processing, and there are regular announcements about how Hadoop can be adapted towards solving graph problems. Recently Daniel Abadi publicised work on solving graph problems more efficiently with Hadoop from his team at Yale University.

This work is novel empirical science and presents an important observation: by skillfully partitioning data in HBase to exploit locality, (graph) computational throughput in Hadoop can be substantially increased. And yet for casual observers of the NOSQL community, this is easily inferred as the demise of graph databases, which appear to have much more modest throughput. I don’t believe this is a valid comparison however:

  • Hadoop is a batch processing framework, and operates at high latencies compared to graph databases (even real-time Hadoop involves seconds of latencies, compared to the millisecond scale at which Neo4j operates). The work done to improve graph processing through data locality means that batches will be executed more efficiently, and so throughput will be higher (or similar throughput will be achievable with fewer computational resources). Yet latency will remain comparatively high and so this approach is unlikely to be well-suited to on-demand processing (OLTP) that is the mainstay of most applications where data latency is more helpfully measured in milliseconds. Instead it is likely to remain firmly in the OLAP domain for the foreseeable future.
  • For generating regular reports from a data warehouse or pre-computing results, batch processing can be a sensible strategy, especially if it can be made efficient through laying out data carefully. Making this efficient comes at a cost, namely that data has to be denormalised within HBase, expanding the cognitive gap between your data and how it is represented for processing. Conversely Neo4j works in OLAP scenarios consistently with how it works in OLTP scenarios – your OLTP database is your OLAP database (usually a read slave, with the same data model). This means Neo4j doesn’t need denormalisation or special processing infrastructure, and for large read-queries like reporting jobs scales very well even under heavy and unpredictable online loads.
  • Batch-oriented approaches are best suited where data can be read and processed outside the database rather than manipulated in place. That is, efficiently processing static graph-like data (or triples), not only requires careful placement of data in HBase, but practically rules out mutating the graph during processing. In contrast Neo4j supports in-place graph mutation graphs, which is a more powerful tool for Web real-time analytics than (even efficiently processed) batches.

Bringing all of these sentiments together, it’s clear that we’re looking at two different tools for two different sets of problems. The Hadoop-based solution is batch-oriented processing at high throughput with correspondingly high latency with substantial denormalisation. The Neo4j approach emphasises OLTP native graph processing with real-time OLAP and more modest throughput at very low latency (ms), and since work happens in the database it’s always consistent.

So if you need OLTP and deep insight (OLAP-style) in near real-time at enterprise scale then Neo4j is a sensible choice. For niche problems where you can afford high latency in exchange for higher throughput, then the graph processing platforms like Pregel or Hadoop could be beneficial. But it’s important to understand that they are not the same.

Posted in neo4j, NOSQL

Square Pegs and Round Holes in the NOSQL World

The graph database space is a peculiar corner of the NOSQL universe. In general, the NOSQL movement has pushed towards simpler data models with more sophisticated computation infrastructure compared to traditional RDBMS. In contrast graph databases like Neo4j actually provide a far richer data model than a traditional RDBMS and a search-centric rather than compute-intensive method for data processing.

Strangely the expressive data model supported by graphs can be difficult to understand amid the general clamour of the simpler-is-better NOSQL movement. But what makes this doubly strange is that other NOSQL database types can support limited graph processing too.

This strange duality where non-graphs stores can be used for limited graph applications was the subject of a thread on the Neo4j mailing list, which was the inspiration for this post. In that thread, community members discussed the value of using non-graph stores for graph data particularly since prominent Web services are known to use this approach (like Twitter’s FlockDB). But as it happens the use-case for those graphs tends to be relatively shallow – “friend” and “follow” relationships and suchlike. In those situations, it can be a reasonable solution to have information in your values (or document properties, columns, or even rows in a relational database) to indicate a shallow relation as we can see in this diagram:

1

At runtime, the application using the datastore  (remember: that’s code you typically have to write) follows the logical links between stored documents and creates a logical graph representation. This means the application code needs to understand how to create a graph representation from those loosely linked documents.

2

If the graphs are shallow, this approach can work reasonably well. Twitter’s FlockDB is an existential proof of that. But as relationships between data become more interesting and deeper, this is an approach that rapidly runs out of steam. This approach requires graphs to be structured early on in the system lifecycle (design time), meaning a specific topology is baked into the datastore and into the application layer. This implies tight coupling between the code that reifies the graphs and the mechanism through which they’re flattened in the datastore. Any structural changes to the graph now require changes to the stored data and the logic that reifies the data.

Neo4j takes a different approach: it stores graphs natively and so separates application and storage concerns. That is, where your documents have relationships between them, that’s they way they’re stored, searched, and processed in Neo4j even if those relationships are very deep. In this case, the logical graph that we reified from the document store can be natively (and efficiently) persisted in Neo4j.

3

What’s often deceptive is that in some use-cases, projecting a graph from a document or KV store and using Neo4j might begin with seemingly similar levels of complexity. For example, we might create an e-commerce application with customers and items they have bought. In a KV or document case we might store the identifiers of products our customers had bought inside the customer entity. In Neo4j, we’d simply add relationships like PURCHASED between customer nodes and the product nodes they’d bought. Since Neo4j is schema-less, adding these relationships doesn’t require migrations, nor should it affect any existing code using the data. The next diagram shows this contrast: the graph structure is explicit in the graph database, but implicit in a document store.

4

Even at this stage, the graph shows its flexibility. Imagine that a number of customers bought a product that had to be recalled. In the document case we’d run a query (typically using a map/reduce framework) that grabs the document for each customer and checks whether a customer has the identifier for the defective product in their purchase history. This is a big undertaking if each customer has to be checked, though thankfully because it’s an embarrassingly parallel operation we can throw hardware at the problem. We could also design a clever indexing scheme, provided we can tolerate the write latency and space costs that indexing implies.

With Neo4j, all we need to do is locate the product (by graph traversal or index lookup) and look for incoming PURCHASED relations to determine immediately which customers need to be informed about the product recall. Easy peasy!

As the e-commerce solution grows, we want to evolve a social aspect to shopping so that customers can receive buying recommendations based on what their social group has purchased. In the non-native graph store, we now have to encode the notion of friends and even friends of friends into the store and into the logic that reifies the graph. This is where things start to get tricky since now we have a deeper traversal from a customer to customers (friends) to customers (friends of friends) and then into purchases. What initially seemed simple, is now starting to look dauntingly like a fully fledged graph store, albeit one we have to build.

6

Conversely in the Neo4j case, we simply use the FRIEND relationships between customers, and for recommendations we simply traverse the graph across all outgoing FRIEND relationships (limited to depth 1 for immediate friends, or depth 2 for friends-of-friends), and for outgoing PURCHASED relationships to see what they’ve bought. What’s important here is that it’s Neo4j that handles the hard work of traversing the graph, not the application code as we can see in the diagram above.

But there’s much more value the e-commerce site can drive from this data. Not only can social recommendations be implemented by close friends, but the e-commerce site can also start to look for trends and base recommendations on them. This is precisely the kind of thing that supermarket loyalty schemes do with big iron and long-running SQL queries – but we can do it on commodity hardware very rapidly using Neo4j.

For example, one set of customers that we might want to incentivise are those people who we think are young performers. These are customers that perhaps have told us something about their age, and we’ve noticed a particular buying pattern surrounding them – they buy DJ-quality headphones. Often those same customers buy DJ-quality decks too, but there’s a potentially profitable set of those customers that – shockingly – don’t yet own decks (much to the gratitude of their flatmates and neighbours I suspect).

With a document or KV store, looking for this pattern by trawling through all the customer documents and projecting a graph is laborious. But matching these patterns in a graph is quite straightforward and efficient – simply by specifying a prototype to match against and then by efficiently traversing the graph structure looking for matches.

69

This shows a wonderful emergent property of graphs – simply store all the data you like as nodes and relationships in Neo4j and later you’ll be able to extract useful business information that perhaps you can’t imagine today, without the performance penalties associated with joins on large datasets.

In these kind of situations, choosing a non-graph store for storing graphs is a gamble. You may find that you’ve designed your graph topology far too early in the system lifecycle and lose the ability to evolve the structure and perform business intelligence on your data. That’s why Neo4j is cool – it keeps graph and application concerns separate, and allows you to defer data modelling decisions to more responsible points throughout the lifetime of your application.

So if you’re fighting with graph data imprisoned in Key-Value, Document or relational datastores, then try Neo4j.

Posted in neo4j, NOSQL

Neo4j 1.3GA: Free as in Beer!

Over at Neo Tech HQ, we’ve been working away for the last 3 months, and today we’re finally releasing Neo4j 1.3 GA. That in itself is usually cause enough for a celebration and a bit of a hangover, but this release marks an important turning point in Neo4j’s licensing, and I think for the NOSQL space in general.

From here on the core database (or community edition) is licensed under the GPL! That is, you can use as many instances of Neo4j community edition (except as OEM) for ever, for free!

For more sophisticated deployments, we’ve simplified our product structure and are now offering Neo4j Advanced (with management features) and Neo4j Enterprise (with Advanced features plus HA) both under a dual AGPL and commercial license (so you can still stay open source, or simply see what you’re buying).

The release announcement has just been posted to the Neo4j blog with all the juicy details on features and licensing, so what are you waiting for? Go download and install everywhere!

Posted in neo4j, NOSQL

Strategies for Scaling Neo4j

As I’ve discussed before, graph databases like Neo4j can be lack the same predictability in terms of scaling when compared to other kinds of NOSQL stores (that’s the cost of a rich data model). But with a little thought, we’ve seen how both cache-sharding and the application of domain-specific strategies can help to improve throughput and increase the dataset size that Neo4j can store and process.

Those blog posts triggered some very useful discussion on the Neo4j mailing list, with several community members adding their own thoughts and experiences. In particular Mark Harwood suggested a simple heuristic for deciding a scaling strategy, that I thought was so useful I’d share it (with his permission) here.

  1. Dataset size: Many tens of Gigabytes
    Strategy: Fill a single machine with RAM
    Reasoning: Modern racks contain a lot of RAM, of the order of 128GB typically for a typical machine. Since Neo4j likes to use RAM to cache data, where O(dataset) ≈ O(memory) we can keep all the data cached in-memory and operate on it at extremely high speed.
    Weaknesses: Still need to cluster for availability; write scalability limited by disk performance.
  2. Dataset size: Many hundreds of Gigabytes
    Strategy:Cache sharding
    Reasoning: The dataset is too big to hold all in RAM on a single server but small enough to allow replicating it on disk across machines. Using cache sharding improves the likelihood of reaching a warm cache which maintains high performance. The cost of a cache miss is not catastrophic (a local disk read), and can be mitigated by replacing spinning disks with SSDs.
    Weaknesses: Consistent routing requires a router on top of the Neo4j infrastructure; write-master in the cluster is a limiting factor in write-scalability.
  3. Dataset size: Terabytes and above
    Strategy: Domain-specific sharding
    Reasoning: At this scale, the dataset is too big for a single memory space and it’s too big to practically replicate across machines so sharding is the only viable alternative. However given there is no perfect algorithm (yet) for arbitrary sharding of graphs, we rely on domain-specific knowledge to be able to predict which nodes should be allocated to which machines.
    Weaknesses: Not all domains may be amenable to domain-specific sharding

As an architect, I really like this heuristic. It’s easy to find where I am on the scale and plan a Neo4j deployment accordingly. It also provides quite an exciting pointer towards the future – while I think most enterprise-scale deployments are currently in the tens to hundreds of gigabytes range, there are clearly applications out there for connected data that do – or will – require more horsepower.

Neo4j 2.0 will address these challenges, it’s going to be a fun ride. But until then, I hope you’ll find this as helpful as heuristic as I did.

Posted in neo4j, NOSQL

Scaling Neo4j with Cache Sharding and Neo4j HA

In the Neo4j world, we consider large datasets to be those which are substantially larger than main memory. With such large datasets, it’s impossible for a Neo4j instance to cache the whole database in RAM and therefore provide extremely rapid traversals of the graph, since we’ll have to hit the disk eventually. In those situations we’ve previously recommended scaling vertically by opting for solid-state storage to provide constant, low seek times for data on disk (avoiding the high seek penalty incurred by spinning disks). While the SSD approach provides a substantial performance boost in most scenarios, even the fastest SSD isn’t a replacement for RAM.

I wrote previously that partitioning graphs across physical instances is a notoriously difficult way to scale graph data. Yet  we still want the ability to service large workloads and host large data sets in Neo4j. Until the recent 1.2 release I think this was a weak point for the database, but with the advent of Neo4j High Availability (HA) I don’t think it is anymore. In fact Neo4j HA gives us considerably more options for designing solutions for availability, scalability and very large data sets.

One pattern that I’ve seen using Neo4j HA for large deployments is what we’re calling “Cache Sharding” to maintain high performance with a dataset that far exceeds main memory space. Cache sharding isn’t sharding in the traditional sense, since we expect a full data set to be present on each database instance. To implement cache sharing, we partition the workload undertaken by each database instance, to increase the likelihood of hitting a warm cache for a given request – and warm caches in Neo4j are ridiculously high performance.

The solution architecture for this setup is shown below. We move from the hard problem of graph sharding, to the simpler problem of consistent routing, something which high volume Web farms have been doing for ages.

1

The strategy we use to implement consistent routing will vary by domain. Sometimes it’s good enough to have sticky sessions, other times we’ll want to route based on the characteristics of the data set.  A simple strategy is where the instance which first serves requests for a particular user will serve subsequent requests for that user ensuring a good chance that the request will be processed by a warm cache. Other domain-specific approaches will also work, for example in geographical data system we can route requests about particular locations to specific database instances which will be warm for that location. Either way, we’re increasing the likelihood of the required nodes and relationships already being cached in RAM, and therefore quick to access and process.

Of course reading from the database is only half of the story. If we’re going to run a number of servers to exploit their large aggregate caching capability, we need to keep those servers in sync. This is where the Neo4j HA software becomes particularly useful. A Neo4j HA deployment effectively creates a multi-master cluster.

Writes to any node will result in all other nodes eventually receiving that write through the HA protocol. Writing to the elected master in the cluster causes the data to be persisted (strictly ACID, always), and then changes are propagated to the slaves through the HA protocol for eventual consistency.

2

If a write operation is processed by a slave, it enrols the elected master in a transaction and both instances persist the results (again strictly ACID). Other slaves then catch up through the HA protocol.

3

By using this pattern, we can treat a Neo4j HA cluster as a performant database for reads and writes, knowing that with a good routing strategy we’re going to be keeping traversals in memory and extremely fast – fast enough for very demanding applications.

Posted in neo4j, NOSQL

CFP: ECOWS 2011 – The 9th IEEE European Conference on Web Services

14-16 September 2011, Lugano, Switzerland

http://ecows2011.inf.usi.ch

Call for Research, Industry, and PhD Symposium Papers

Abstract submission: 11 April 2011
Paper submission: 15 April 2011

The IEEE European Conference on Web Services (ECOWS) is the premier conference on the advances in the state of the art and practice of Web services. The main objectives of this conference are to facilitate the exchange between researchers and practitioners and to foster future collaborations in Europe and beyond. The ECOWS 2011 conference will include invited speakers, presentations of contributed research papers and an industrial track with the participation of top researchers from industry. ECOWS 2011 will also include a PhD symposium and satellite workshops.

Background

The success encountered by the Web has shown that tightly coupled software systems are only good for niche markets, whereas loosely coupled software systems can be more flexible, more adaptive and often more appropriate in practice. Loose coupling makes it easier for a given system to interact with other systems, possibly legacy systems that share very little with it. Web services are at the crossing of distributed computing and loosely coupled systems. When applications are developed with service-oriented architectures, they can evolve more easily during their lifespan and adapt better to changing or even unpredictable environments. Services today can be implemented such that they are discovered and invoked dynamically using non-proprietary mechanisms. Such services are developed and deployed in a black-box manner. This is of particular importance from a business perspective since services are implemented in a variety of technologies. Essential is agreement on integration technology and consensus has emerged in today’s middleware market: Customers want to use Web technologies. Despite these promises, however, service integrators, developers, and providers need to create methods, tools and techniques to support cost-effective development and use of dependable services and service-oriented applications.

Topics of interest

The ECOWS 2011 program committee seeks original, high quality papers related to all aspects of Web Services, which constitute the main technology available to date for implementing service-oriented architectures and computing.

Topics of interest to the Research and Industry Tracks include, but are not limited to, the following:

  • Business Process Management and Web Services
  • Cloud Services Management and Composition using Web Services
  • Dynamic and Adaptive Web Services
  • Economics Models and Web Services
  • Enterprise Architecture and Web Services
  • Emerging Trends of Service Composition and Mashups
  • Experience reports of novel applications of Web Services in Health, Commerce, Finance, Telecom, Scientific Computing and other domains
  • Service modeling, service-oriented analysis and design
  • Formal Methods for Web Services
  • Frameworks for Building Web Service-Based Applications
  • Identity and Access Management using Web Services
  • Mobile Web Services
  • Model-Driven Web Service Engineering
  • Next Generation Web Services Middleware and Service Repositories
  • Service quality and service interface design guidelines
  • RESTful Web Services
  • Self-Organizing Service Oriented Architectures
  • Semantic Web Services
  • Service Level Agreements for Web services
  • Service-Oriented Business Collaboration
  • SOA Governance and Web Services
  • Social Web Services
  • Web Services for Grids
  • Web Services in Service-Oriented Environments
  • Web Services Life-Cycles
  • Web Services Security and Privacy

It should be noted that papers on existing product descriptions or product marketing information are not within the scope of the ECOWS 2011 Industrial Track.

PhD Symposium

The ECOWS 2011 PhD Symposium is an international forum for PhD students working in any of the areas addressed by the ECOWS conference. The main aim of the Symposium is to give PhD students an opportunity to present their research activity and perspectives, to critically discuss them with other PhD students and with established researchers in the area, and to get fruitful feedback and advices on their research activity.

PhD students working in any area addressed by the ECOWS conference can submit a short report providing a clear statement of the problem they intend to address, motivating the interest and novelty of the underlying research challenges, demonstrating the ideas by examples, and describing the proposed research plan and expected results. Reports should not exceed 6 pages formatted according to the IEEE proceedings guidelines. The papers should be authored by the PhD student and indicate the name of her/his supervisor. Submissions must be sent by email to the symposium chair Wolf Zimmermann (wolf.zimmermann@informatik.uni-halle.de).

Submission Guidelines

Original papers, not submitted for publication elsewhere, can be submitted via EasyChair.

Submissions should be formatted according to the IEEE proceedings guidelines and the templates available at: http://www.ieee.org/conferences_events/conferences/publishing/templates.html and they should not exceed 8 pages.

A paper might be accepted as a full paper (8 pages), short paper (4 pages) or as a poster (2 pages abstract in the proceedings).

The conference proceedings are expected to be published by the IEEE Computer Society Press as in previous years.

Workshop Proposals should be submitted following the instructions found at: http://ecows2011.inf.usi.ch/cfw

Important Dates

Deadlines for research papers:

  • Abstract submission: Monday, April 11, 2011
  • Papers due: Friday, April 15, 2011
  • Notifications: Wednesday, May 25, 2011
  • CR versions due: Tuesday, July 1, 2011
  • ECOWS 2011 conference: September 14-16, 2011

Deadlines for industrial papers:

  • Industrial Papers: Saturday, April 30, 2011
  • Notifications: Wednesday, June 1, 2011
  • Camera Ready version due: Tuesday, July 1, 2011
  • ECOWS 2011 conference: September 14-16, 2011

Deadlines for PhD Symposium papers:

  • Paper submission: Wednesday, June 15, 2011
  • Notification of acceptance: Friday, July 15, 2011
  • Camera Ready: Sunday, July 31, 2011
  • ECOWS 2011 conference: September 14-16, 2011

Deadlines for workshop proposals:

  • Workshop proposal submission: Friday, April 1, 2011
  • Notification of acceptance: Friday, April 15, 2011

Organization

General Chair

Program Chairs

Industry Chair

Workshop Chair

  • Flavio de Paoli, University of Milano-Bicocca, Italy

PhD Symposium Chair

ECOWS 2011 Program Committee

  • Achilleas Achilleos, University of Cyprus, Cyprus
  • Marco Aiello, University of Groningen, the Netherlands
  • Farhad Arbab, CWI, The Netherlands
  • Luciano Baresi, Politecnico di Milano, Italy
  • Sami Bhiri, National University of Ireland, Ireland
  • Walter Binder, University of Lugano, Switzerland
  • Antonio Brogi, University of Pisa, Italy
  • Christoph Bussler, Xtime, Inc., USA
  • Fabio Casati, University of Trento, Italy
  • Siobhán Clarke, Trinity College Dublin, Ireland
  • Juergen Dunkel, FH Hannover, Germany
  • Schahram Dustdar, TU Wien, Austria
  • Rik Eshuis, Eindhoven University of Technology, The Netherlands
  • David Eyers, University of Otago, New Zealand
  • Christofer Giblin, IBM Zurich Research Lab, Switzerland
  • Claude Godart, Universiy of Lorraine, France
  • Paul Grefen, Eindhoven University of Technology, The Netherlands
  • Thomas Gschwind, IBM Zurich Research Lab, Switzerland
  • Reiko Heckel, University of Leicester, UK
  • Martin Henkel, Stockholm University, Sweden
  • Birgitta Koenig-Ries, Universität Jena, Germany
  • Ernoe Kovacs, NEC Europe Network Labs, Germany
  • Akhil Kumar, Pennsylvania State University, USA
  • Peep Küngas, University of Tartu, Estonia
  • Frank Leymann, Universität Stuttgart, Germany
  • Welf Löwe, Växjö Universitet, Sweden
  • Heiko Ludwig, IBM TJ Watson Research Center, USA
  • Radu Mateescu, INRIA, France
  • Ingo Melzer, DaimlerChrysler Research, Germany
  • Dirk Neumann, Universität Freiburg, Germany
  • Roy Oberhauser, Aalen University, Germany
  • Guadalupe Ortiz, University of Cádiz, Spain
  • Claus Pahl, Dublin City University, Ireland
  • George Angelos Papadopoulos, University of Cyprus, Cyprus
  • Ernesto Pimentel, University of Malaga, Spain
  • Manfred Reichert, University of Ulm, Germany
  • Wolfgang Reisig, Humbold-Universität Berlin, Germany
  • Heiko Schuldt, University of Basel, Switzerland
  • Marten van Sinderen, University of Twente, the Netherlands
  • Rainer Unland, University of Duisburg-Essen, Germany
  • Jim Webber, Neo Technology, UK
  • Erik Wilde, UC Berkeley, USA
  • Ümit Yalçınalp, Adobe Systems, USA
  • Olaf Zimmermann, IBM Zurich Research Lab, Switzerland
  • Wolf Zimmermann, Universität Halle, Germany

ECOWS Steering Committee

  • Antonio Brogi, University of Pisa, Italy
  • Siobhán Clarke, Trinity College Dublin, Ireland
  • Rik Eshuis, Eindhoven Univ. of Technology, The Netherlands
  • Paul Grefen, Eindhoven Univ. of Technology, The Netherlands
  • Claus Pahl, Dublin City University, Ireland
  • George Angelos Papadopoulos, University of Cyprus, Cyprus
  • Liang-Jie Zhang, IBM Research, USA
Posted in Call for Papers, REST, Web Services

On Sharding Graph Databases

One of the substantial benefits of the family of NOSQL (Not only SQL) databases is their tendency to be able to scale predictably. One of the key elements to being able to scale NOSQL databases is sharding, where a large dataset can be broken up and distributed across a number of (typically replicated) shards. For simple data models like Key-Value, Column-family, and Document databases this is a sensible solution. In such stores, the key to retrieve any item stored is well known and stable, while the lookup mechanism is both predictable (e.g. consistent hashing) and rapid (e.g. using Bloom filters to statistically locate a hot replica). Given those conditions, it’s straightforward for a client looking to store or retrieve data to be directed to an appropriate shard.

Unfortunately for graph databases like Neo4j, sharding isn’t straightforward at all, leaving most solutions with a bit of a scaling headache. Unlike other classes of NOSQL database, a graph doesn’t have predictable lookup since it is a highly mutable structure. Worse, there are competing requirements when sharding a graph across physical Neo4j instances. For example, we want to co-locate related nodes for traversal performance, but we don’t want to place so many connected nodes on the same database that it becomes heavily loaded.

1

The opposite is also true, that we don’t want too widely connected nodes across different database instances since it will incur a substantial performance penalty at runtime as traversals cross the (relatively latent) network.

2

Given that a naive solution will lead to hotspots causing contention for local resources, or incur many network hops as the graph is traversed thus substantially increasing query time, we need to look for a better approach. The holy grail of graph algorithms is to balance a graph across database instances by creating a minimum point cut for a graph, where graph nodes are placed such that there are few relationships that span shards.

3

The trouble with this picture is that it’s hard to achieve in practice, especially since connected graphs can mutate rapidly and unpredictably at runtime. We can help to maintain a balanced graph by applying domain-specific knowledge to place nodes on shards; we can use insert-time algorithms to help us select the most appropriate shard to place a node; and we can use re-balancing algorithms periodically (sort of like a graph defrag) to help maintain a good node-to-shard mapping at runtime. We can also keep heuristic information on which relationships are traversed most frequently and optimise around those. However, as you might guess, much of this is an ongoing research problem and our community and customers have pressing needs to scale even while diligent computer scientists and mathematicians are working towards ever better graph algorithms. So, in the meantime, the Neo4j team has observed a pattern that we’re calling “cache sharding” to help solve the problem of efficiently querying large datasets with graph databases, which I’ll cover in my next posting.

Posted in neo4j, NOSQL

Upcoming Talks

QCon London is almost upon us, where Ian Robinson and I will be giving our REST in Practice tutorial on the 7th March, where we’ll cover RESTful systems from the ground up. Previous response to the tutorial has been really positive, and we’re looking forward to delivering another belter this time around. Later in the week I’ll be hosting the REST track at QCon, which has a fabulous line up, with plenty of brand spanking new material from each presenter. You heard it here first.

In May, Ian and I hit the road again. On Monday 9th May, we’ll be repeating our REST in Practice tutorial at the first GotoCON Copenhagen conference (a spinoff of the conference formerly known as JAOO). We’ll also be giving a brand new tutorial on Neo4j the following day, where we’ll mix up a little theory and plenty of programming tasks to give attendees a rich introduction to the fabulous Neo4j graph database.

Later that same week I’ll be giving a keynote on large RESTful systems at this year’s GeeCon in Krakow, Poland and also whipping up a code-centric Neo4j talk on the dev track. This’ll be my first trip to Poland and something I’m really looking forward to.

Hope to see you somewhere out there.

[There’s a separate feed for my conference stuff, and a natty little map too]

Posted in Talks

WS-REST 2011 Call for Papers

March 28, 2011 – Hyderabad, India

http://ws-rest.org/2011/

The Second International Workshop on RESTful Design (WS-REST 2011) aims to provide a forum for discussion and dissemination of research on the emerging resource-oriented style of Web service design.

Background

Over the past few years, several discussions between advocates of the two major architectural styles for designing and implementing Web services (the RPC/ESB-oriented approach and the resource-oriented approach) have been mainly held outside of the research and academic community, within dedicated mailing lists, forums and practitioner communities. The RESTful approach to Web services has also received a significant amount of attention from industry as indicated by the numerous technical books being published on the topic.

This second edition of WS-REST, co-located with the WWW2011 conference, aims at providing an academic forum for discussing current emerging research topics centered around the application of REST, as well as advanced application scenarios for building large scale distributed systems.

In addition to presentations on novel applications of RESTful Web services technologies, the workshop program will also include discussions on the limits of the applicability of the REST architectural style, as well as recent advances in research that aim at tackling new problems that may require to extend the basic REST architectural style. The organizers are seeking novel and original, high quality paper submissions on research contributions focusing on the following topics:

  • Applications of the REST architectural style to novel domains
  • Design Patterns and Anti-Patterns for RESTful services
  • RESTful service composition
  • Inverted REST (REST for push events)
  • Integration of Pub/Sub with REST
  • Performance and QoS Evaluations of RESTful services
  • REST compliant transaction models
  • Mashups
  • Frameworks and toolkits for RESTful service implementations
  • Frameworks and toolkits for RESTful service consumption
  • Modeling RESTful services
  • Resource Design and Granularity
  • Evolution of RESTful services
  • Versioning and Extension of REST APIs
  • HTTP extensions and replacements
  • REST compliant protocols beyond HTTP
  • Multi-Protocol REST (REST architectures across protocols)

All workshop papers are peer-reviewed and accepted papers will be published as part of the ACM Digital Library. Two kinds of contributions are sought: short position papers (not to exceed 4 pages in ACM style format) describing particular challenges or experiences relevant to the scope of the workshop, and full research papers (not to exceed 8 pages in the ACM style format) describing novel solutions to relevant problems. Technology demonstrations are particularly welcome, and we encourage authors to focus on “lessons learned” rather than describing an implementation.

Papers must be submitted electronically in PDF format.

Easychair page: http://www.easychair.org/conferences/?conf=wsrest2011

Important Dates

  • Submission deadline: January 31, 2011, 23.59 local time in San Francisco, CA
  • Notification of acceptance: February 15, 2011
  • Camera-ready versions of accepted papers: February 28, 2011
  • WS-REST 2011 Workshop: March 28, 2011

Program Committee Chairs

  • Cesare Pautasso, Faculty of Informatics, USI Lugano, Switzerland
  • Erik Wilde, School of Information, UC Berkeley, USA
  • Rosa Alarcon, Computer Science Department, Pontificia Universidad de Chile, Chile

Program Committee

  • Jan Algermissen, Nord Software Consulting, Germany
  • Subbu Allamaraju, Yahoo Inc., USA
  • Mike Amudsen, USA
  • Benjamin Carlyle, Australia
  • Stuart Charlton, Elastra, USA
  • Duncan Cragg, Thoughtworks, UK
  • Joe Gregorio, Google, USA
  • Michael Hausenblas, DERI, Ireland
  • Ralph Johnson, University of Illinois, USA
  • Rohit Khare, 4K Associates, USA
  • Yves Lafon, W3C, USA
  • Frank Leymann, University of Stuttgart, Germany
  • Ian Robinson, Thoughtworks, UK
  • Stefan Tilkov, innoQ, Germany
  • Steve Vinoski, Verivue, USA
  • Jim Webber, Neo Technologies
  • Olaf Zimmermann, IBM Zurich Research Lab, Switzerland

Contact

WS-REST Web site: http://ws-rest.org/2011/
WS-REST Twitter: http://twitter.com/wsrest2011
WS-REST Email: chairs@ws-rest.org

Posted in Call for Papers, REST, Web Services