Saturday, April 26, 2008

Hypertable - A Massively Parallel Database System

Now everyone can have their own database system that scales to thousands of processors, as we heard at the April meeting of the SDForum Software Architecture and Modeling SIG. Doug Judd from zEvents, and the Hypertable Lead Developer spoke on "Architecting Hypertable-a massively parallel high performance database".

Hypertable is an Open Source database system designed to deal with the massive scale of data that is found in web applications such as processing the data returned by web crawlers as they crawl the entire internet. It is also designed to run on the massive commodity computer farms, which can consist of thousands of systems, that are employed to process such data. In particular Hypertable is designed so that its performance will scale with the number of computers used and to handle the unreliability problems that inevitably ensue from using large computer arrays.

From a user perspective, the data model has a database that contains tables. Each table consists of a set of rows. Each row has a primary key value and a set of columns. Each column contains a set of key value pairs commonly known as a map. A timestamp is associated with each key value pair. The number of columns in a table is limited to 256, otherwise there are no tight constraints on the size of keys or values. The only query method is a table scan. Tables are stored in primary key order, so a query easily accesses a row or group of rows by constraining on the row key. The query can specify which columns are returned, and the time range for key value pairs in each column.

The basic unit for inserting data is the key value pair, along with its row key and column. An insert will create a new row if none exist with that row key. More likely, an insert will add a new key value pair to an existing column map or have the existing value superseded if the new column key already exists in the column map.

As Doug explained, Hypertable is neither relational or transactional. Its purpose is to store vast amounts of structured data and make that data easily available. For example, while Hypertable does have logging to ensure that information does not get lost, it does not support transactions whose purpose is to make sure that multiple related changes either all happen together or none of them happen. Interestingly, many database systems switch off transactional behavior for large bulk loads. There is no mechanism for combining data from different tables as tables are expected to be so large that there is little point in trying to combine them.

The current status is that Hypertable is in alpha release. The code is there and works as Doug showed us in a demonstration, however it uses a distributed file system like Hadoop to store its data and while they are still developing they are also waiting for Hadoop to implement a consistency feature before they declare beta. Even then there are a number of places where they have a single point of failure, so there is plenty of work to make it a complete and resilient system.

Hypertable is closely modeled on Google Bigtable. At several times in the presentation when asked about a feature, Doug explained it as something that Bigtable does. At one point he even went so far as to say "if it is good enough for Google, then it is good enough for us".

Monday, April 21, 2008

SaaS, Cloud, Web 2.0... it’s time for Business Intelligence to evolve!

The most surprising phrase in Roman Bukary's presentation to the April meeting of the SDForum Business Intelligence SIG was "right time, not real time", and it was said more than once. Roman is Vice President of Marketing and Business Development at Truviso and his presentation entitled "SaaS, Cloud, Web 2.0... it’s time for Business Intelligence to Evolve!" brought a large audience to our new location at the SAP Labs on Hillview Avenue in Palo Alto.

Truviso provides software to continuously analyze huge volumes of data, enabling instant visibility, immediate action and more profitable decision making. In other words, their product is a streaming database system.

Over the years, the Business Intelligence SIG has heard about several streaming database systems. Truviso distinguishes themselves in a number of ways. Firstly it leverages the open source Postgres database system, so it is a real database system and real SQL. Other desirable characteristics are handling large volumes of data, large numbers of queries and the ability to change queries on the fly. They also have a graphics front end that can draw good looking charts. Roman showed us several Truviso applications including stock and currency trading applications that are both high volume and a rapidly changing environment.

Then we come to the "right time, not real time" phrase. In the past I have associated this phrase with business intelligence systems that could not present the data in a timely manner. Obviously, that is not a problem with streaming database systems that process and aggregate data on the fly and always have the most up to date information.

I think that Roman was trying to go in the other direction. He was suggesting that Truviso is not only useful for high pressure real time applications like stock trading, it also has a place in other applications where time is less pressing but the volume of data is high and there is still a need for a real time view of the current state. Such applications could include RFID, logistics and inventory management.

Tuesday, April 08, 2008

Open Source 10 Years Later

April 7 2008 marks 10 years since the landmark Freeware Summit that signaled the opening of the Open Source movement. By coincidence I recently read the manifesto of the Open Source movement, "The Cathedral and The Bazaar" by Eric S Raymond. The book, published in 1999 and revised in 2001, contains the namesake essay several others including "Revenge of the Hackers" which describes the events leading up to and following the Freeware Summit from an insiders point of view. The essay is valuable as a history of Open Source however its veracity is slightly marred because it dates summit meeting as happening on March 7.

One thing that the Revenge of the Hackers does not shy away from is explaining why Richard Stallman and the Free Software Foundation was not present at the Freeware Summit. In the past I have written on the distinction between Open Source and Free Software. Raymond is tactful but firm in explaining why creating a separation between these two ideas was essential to getting Open Source accepted by the mainstream.

On the other hand, the end of the essay that looks into the future of Open Source does suffer in hindsight. Open Source has advanced by leaps and bounds in the last 10 years. However it is still not in the position of ruling the world as the Revenge of the Hackers suggests it might. Lets give it at least another 10 years.

Thursday, April 03, 2008

An Evening with The Difference Engine

One day I write about Doron Swade and "The Cogwheel Brain". Three days later I get an invitation from the Computer History Museum. Doron Swade is coming to Silicon Valley with a Difference Engine!

The occasion is that another Difference Engine has been commissioned by Nathan Myhrvold, ex CTO of Microsoft. It is being exhibited at the Computer History Museum in Mountain View, and to celebrate its arrival, there is an "Evening with Nathan Myhrvold and Doron Swade" at the museum. We have signed up for the event, have you?