Saturday, April 26, 2008

Hypertable - A Massively Parallel Database System

Now everyone can have their own database system that scales to thousands of processors, as we heard at the April meeting of the SDForum Software Architecture and Modeling SIG. Doug Judd from zEvents, and the Hypertable Lead Developer spoke on "Architecting Hypertable-a massively parallel high performance database".

Hypertable is an Open Source database system designed to deal with the massive scale of data that is found in web applications such as processing the data returned by web crawlers as they crawl the entire internet. It is also designed to run on the massive commodity computer farms, which can consist of thousands of systems, that are employed to process such data. In particular Hypertable is designed so that its performance will scale with the number of computers used and to handle the unreliability problems that inevitably ensue from using large computer arrays.

From a user perspective, the data model has a database that contains tables. Each table consists of a set of rows. Each row has a primary key value and a set of columns. Each column contains a set of key value pairs commonly known as a map. A timestamp is associated with each key value pair. The number of columns in a table is limited to 256, otherwise there are no tight constraints on the size of keys or values. The only query method is a table scan. Tables are stored in primary key order, so a query easily accesses a row or group of rows by constraining on the row key. The query can specify which columns are returned, and the time range for key value pairs in each column.

The basic unit for inserting data is the key value pair, along with its row key and column. An insert will create a new row if none exist with that row key. More likely, an insert will add a new key value pair to an existing column map or have the existing value superseded if the new column key already exists in the column map.

As Doug explained, Hypertable is neither relational or transactional. Its purpose is to store vast amounts of structured data and make that data easily available. For example, while Hypertable does have logging to ensure that information does not get lost, it does not support transactions whose purpose is to make sure that multiple related changes either all happen together or none of them happen. Interestingly, many database systems switch off transactional behavior for large bulk loads. There is no mechanism for combining data from different tables as tables are expected to be so large that there is little point in trying to combine them.

The current status is that Hypertable is in alpha release. The code is there and works as Doug showed us in a demonstration, however it uses a distributed file system like Hadoop to store its data and while they are still developing they are also waiting for Hadoop to implement a consistency feature before they declare beta. Even then there are a number of places where they have a single point of failure, so there is plenty of work to make it a complete and resilient system.

Hypertable is closely modeled on Google Bigtable. At several times in the presentation when asked about a feature, Doug explained it as something that Bigtable does. At one point he even went so far as to say "if it is good enough for Google, then it is good enough for us".

No comments: