Tuesday, June 02, 2009

Databases in the Cloud

Last week was a busy week, with Databases in the Cloud on Tuesday followed by Hadoop and MapReduce with Cascading on Wednesday. These were both must attend SDForum SIG meetings for anyone who wants to keep up with new approaches to database and analytics systems. The two meetings had very different characteristics. MapReduce with Cascading was a technical presentation that required concentration to follow but did contain some real nuggets of information. The Cloud Services SIG meeting on Tuesday Demo Night: Databases in the Cloud was more accessible. This post is about Databases in the Cloud.

Roger Magoulas of O'Reilly Research started the meeting by discussing big data and their experience with it. A useful definition of "Big Data" is that when the size of the data becomes a problem, you have Big Data. O'Reilly has about 6 TBytes of data in their Job database, that is more than a billion rows. The data comes from the web and it is messy. They use GreenPlum, a scalable MPP database system suitable for cloud computing. It also has built in MapReduce. Like many people doing analytics, they are not really sure what they are going to do with the data so they want to keep things as flexible as possible with flexible schemas. Roger and the O'Reilly team believe that 'making sense of "Big Data" is a core competency of the information Age'. On the technology side, Big Data needs MPP parallel processing and compression. Map-Reduce handles big data with flexible schemas and is resilient by design.

After Roger came three demos. Ryan Barrett from Google showed us a Google App Engine application that uses the Google Store. Google App Engine is a service for building web applications that is free for building small applications, and paid when the application scales. The Google Store is BigTable, a sharded stateless tuple store for big data (see my previous posts on the Google Database System and Hypertable, a clone of BigTable). Like every other usable system, Google has its own high level language called GQL (Google Query language), whose statements start with the verb SELECT. To show that they are serious about supporting cloud applications, Google also provides bulk upload and download. Google App Engine is a service that allows you to build and test your cloud web application for free.

Cloudera is a software start up that provides training and support for the Open Source Hadoop MapReduce project. Christophe Bisciglia from Cloudera gave a an interesting analogy. First he compared the performance of a Ferrari and a freight train. A Ferrari has fast acceleration and a higher top speed but can only carry a light load. A freight train accelerates slowly and has a lower top speed, but it can carry a huge load. Then he told us that a database system is like a Ferrari, while Map-Reduce is like the freight train. Map-Reduce does batch processing and is capable of handling huge amounts of data, but it is certainly not fast and agile like a database system, which is capable of giving answers in real time.

Finally George Kong showed us the Aster Data Systems MPP database system with a Map-Reduce engine. They divide their database servers into three groups, the Queen that manages everything, Worker hosts that handle queries and Loader hosts that handle loading. This is a standard database system that works with standard tools such as Informatica, Business Objects, Microstratagy and Pentaho. It is also capable of running in the elastic cloud. For example, one of their customers is ShareThis which keeps a 10 TByte Aster Data Systems database in the cloud. This database uses Microstratagy and Pentaho for reporting.

1 comment:

Aster said...

Good summary!

Anyone interested in how Aster integrates MapReduce in the SQL-database can check out example code and applications here: http://www.asterdata.com/mapreduce