Wednesday, September 02, 2009

Project Voldemort

There were three interesting trends exposed in the talk about Project Voldemort at the August meeting of the SDForum SAM SIG. Firstly Voldemort is another tuple store as opposed to a relational database, the trend that interested me the most. The second trend is implementation of systems described in academic papers. The final trend is to use Open Source as a support mechanism for a large software project. Lets break down each of these trends one at a time. By the way, the presentation was given by Bhupesh Bansal and Jay Kreps, of LinkedIn.

The relational databases have been the reliable store for serious computing for the last 20 years, but recently tuple stores and tuple processing like Map-Reduce have appeared and are starting to challenge the relational database hegemony. In the simplest terms, a tuple store is just a very degenerate relational database. Relations are based on the n-tuple, that is each row in a table contains a number of data items whereas a plain tuple is two data items, a key and a value.

As Jay Kreps explained, to get a web service application to scale, you need to distribute it over a over a cluster of computer systems, and to make this work with a relational database, you need to denormalize your database. The end point of database denormalization is the plain flat tuple store. Jay Kreps also complained that relational databases are not very good at handling data structures like the graphs of connections found in social networking applications, and semi-structured data like text.

In my opinion, tuple stores are no better or worse than relational databases at dealing with graphs between tuples. Tuple stores are more flexible for handling semi-structured data, but again this depends on the application (for more, read my comparison of Map-Reduce with relational databases). Tuple stores are certainly simpler, easier to use, more stable under load and cheaper than a relational database. I will write more about tuple stores at another time.

The second notable trend is for groups to pick up on systems described in academic papers and just implement them. Voldemort is an implementation of the Amazon Dynamo system as described in their paper at the ACM Symposium on Operating Systems Principles. We have seen several other examples of this recently. Google released a set of papers about their data processing systems including Map-Reduce, that has created a number of projects to emulate their functionality. I have written about Hadoop and Hypertable, two examples, and there are others. These are systems for doing very large scale analytic data processing, while Amazon Dynamo and Voldemort are systems for supporting rapid access to large volumes of data such as is needed to support large and complex web sites.

The final trend is Open Source as a support model. Voldemort was developed by LinkedIn, a company whose main business is providing a social and business network on the web. Their primary business is social networking, not writing and supporting a lot of complicated software. LinkedIn decided that they needed a tuple store like Amazon Dynamo and, as they could not buy it, they built it. However they decided they wanted help with support, so they released the software as an Open Source project. Now, Voldemort is being used by several organizations and at least half the people working on code are from outside LinkedIn. When Sandeep Giri started the OpenI project, I asked him why he was releasing it as an Open Source project and he gave the same reason.

No comments: