Saturday, October 25, 2008

The Google Database System

Google, Yahoo and others are taking the traditional database system and breaking it into pieces. Google has their own set of propriety database system components. Yahoo is working with the Hadoop Open Source project to make their system available to everyone. I came to this conclusion while doing research for my talk on "Models and Patterns for Concurrency" for the SDForum SAM SIG.

In this post I will talk about what is happening, why it is happening and at the end try to draw some conclusions about future directions for database systems. Note, I am using Google as an example here because their system is described in a set of widely accessible academic papers. Yahoo and many other large scale web sites have adopted a similar approach through use of Hadoop and other Open Source projects.

A traditional database system is a server. It takes care of persistent data storage, metadata management for the stored data, transaction management and querying the data, which includes aggregation. Google has developed its own set of applications which support these same functions, except instead of wrapping them into a single entity as a database server, they have been developed as a set of application programs that build on one another.

The are several reasons for Google developing their own database system. Firstly, they are dealing with managing and processing huge amounts of data. Conventional database systems struggle when the data gets really large. In particular the transaction model that underlies database operation starts to break down. This topic is worth a separate post of its own.

Secondly they their computing system is a distributed system built from thousands of commodity computer systems. Conventional database systems are not designed or tuned to run on this type of hardware. One issue is that at this scale the hardware cannot be assumed to be reliable and the database system has to be designed to work around the unreliable hardware. A final issue is that the cost of software licenses for running a conventional database system on the Google hardware would be prohibitive.

The Google internal application look like this. A the bottom is Chubby, a lock service that also provides reliable storage for small amounts of data. The Google File System provides file data storage in a distributed system of thousands of computers with local disks. It uses Chubby to ensure that there is a single master in the face of system failures. Bigtable is a system for storing and accessing structured data. While it is not exactly relational, it is comparable to storing data in a single large relational database table. Bigtable stores its data in the Google File System and uses Chubby for several purposes including metadata storage and to ensure atomicity of certain operations.

Finally Map Reduce is a generalized aggregation engine (and here I mean aggregation in the technical database sense). It uses the Google File System. Map Reduce is surprisingly closely related to database aggregation as found in the SQL language although it is not usually described in that way. I will discuss this in another post. In the mean time, it is interesting to note that Map Reduce has been subject to a rather intemperate attack by database luminaries David DeWitt and Michael Stonebreaker.

In total, these four applications: Chubby, Google File System, Bigtable and Map Reduce provide the capabilities of a database system. In practice there are some differences. The user writes program in a language like C++ that integrated the capabilities of these components as they need them. They do not need to use all the components. For example, Google can calculate the page rank for each page on the web as a series of 6 Map Reduce steps none of which necessarily uses Bigtable.

The concept of a Database System was invented in the late 60's by the CODASYL committee, shortly after their achievement of inventing the COBOL programming language. The Relational model and Transactions came later, however the concept of a server system that owns and manages data and much of the terminology originated with CODASYL. Since then, the world has changed.

Nowadays databases are often hidden behind frameworks such as Hibernate or Ruby on Rails that try to paper over the impedance mismatch between the database model on one side and an object oriented world looking for persistence on the other. These are mostly low end systems. At the other end of scale are the huge data management problems of Google, Yahoo and other web sites. New companies with new visions of database systems to meet these challenges are emerging. It is an exciting time.

No comments: