Saturday, October 25, 2008

The Google Database System

Google, Yahoo and others are taking the traditional database system and breaking it into pieces. Google has their own set of propriety database system components. Yahoo is working with the Hadoop Open Source project to make their system available to everyone. I came to this conclusion while doing research for my talk on "Models and Patterns for Concurrency" for the SDForum SAM SIG.

In this post I will talk about what is happening, why it is happening and at the end try to draw some conclusions about future directions for database systems. Note, I am using Google as an example here because their system is described in a set of widely accessible academic papers. Yahoo and many other large scale web sites have adopted a similar approach through use of Hadoop and other Open Source projects.

A traditional database system is a server. It takes care of persistent data storage, metadata management for the stored data, transaction management and querying the data, which includes aggregation. Google has developed its own set of applications which support these same functions, except instead of wrapping them into a single entity as a database server, they have been developed as a set of application programs that build on one another.

The are several reasons for Google developing their own database system. Firstly, they are dealing with managing and processing huge amounts of data. Conventional database systems struggle when the data gets really large. In particular the transaction model that underlies database operation starts to break down. This topic is worth a separate post of its own.

Secondly they their computing system is a distributed system built from thousands of commodity computer systems. Conventional database systems are not designed or tuned to run on this type of hardware. One issue is that at this scale the hardware cannot be assumed to be reliable and the database system has to be designed to work around the unreliable hardware. A final issue is that the cost of software licenses for running a conventional database system on the Google hardware would be prohibitive.

The Google internal application look like this. A the bottom is Chubby, a lock service that also provides reliable storage for small amounts of data. The Google File System provides file data storage in a distributed system of thousands of computers with local disks. It uses Chubby to ensure that there is a single master in the face of system failures. Bigtable is a system for storing and accessing structured data. While it is not exactly relational, it is comparable to storing data in a single large relational database table. Bigtable stores its data in the Google File System and uses Chubby for several purposes including metadata storage and to ensure atomicity of certain operations.

Finally Map Reduce is a generalized aggregation engine (and here I mean aggregation in the technical database sense). It uses the Google File System. Map Reduce is surprisingly closely related to database aggregation as found in the SQL language although it is not usually described in that way. I will discuss this in another post. In the mean time, it is interesting to note that Map Reduce has been subject to a rather intemperate attack by database luminaries David DeWitt and Michael Stonebreaker.

In total, these four applications: Chubby, Google File System, Bigtable and Map Reduce provide the capabilities of a database system. In practice there are some differences. The user writes program in a language like C++ that integrated the capabilities of these components as they need them. They do not need to use all the components. For example, Google can calculate the page rank for each page on the web as a series of 6 Map Reduce steps none of which necessarily uses Bigtable.

The concept of a Database System was invented in the late 60's by the CODASYL committee, shortly after their achievement of inventing the COBOL programming language. The Relational model and Transactions came later, however the concept of a server system that owns and manages data and much of the terminology originated with CODASYL. Since then, the world has changed.

Nowadays databases are often hidden behind frameworks such as Hibernate or Ruby on Rails that try to paper over the impedance mismatch between the database model on one side and an object oriented world looking for persistence on the other. These are mostly low end systems. At the other end of scale are the huge data management problems of Google, Yahoo and other web sites. New companies with new visions of database systems to meet these challenges are emerging. It is an exciting time.

Saturday, October 11, 2008

Its Called Risk, Have You Heard Of It?

Senator Phil Gramm famously called us a "nation of whiners" and he may be right. (Note, while I try to keep this blog about technology, the financial system seems to be so badly broken it is worthy of a comment or two.) I recently ran across a blog post on a financial site called "Our Timid Government Is Killing Us" by Michael Kao, CEO of Akanthos Capital Management. In it he complains about 4 things that the Government has not done to help resolve the financial crisis. I want to concentrate on one of there here. "Problem No. 2: Lehman's bankruptcy has severely eroded confidence between counterparties."

The problem is this. Over the last few weeks, financial institutions have become unwilling to trust one another and with good cause. The issue is Credit Default Swaps. This is a 60 Trillion dollar market (that is Trillion with a capital T) where financial institutions like banks and hedge funds (the parties) buy and sell insurance policies on bonds. The "This American Life" radio program and Podcast has very good and understandable explanation of the market and how it came to be.

There are two important things to understand about the credit default swap market. Firstly, it is completely unregulated. Senator Phil Gramm tacked a clause to keep the market unregulated to an appropriations bill in 2000 that was approved by 95 votes to 0 in the Senate. Secondly, the market is not transparent, that is the various parties to the market do not know what any other players position is. Note that these two features are the way the parties in the market wanted it. There has been great outcry about reducing financial regulations in the last few years.

Lack of transparency was not a problem until Lehman Brothers went bankrupt. They were a big player in the credit default swap market. Now all their credit default swap insurance policies are frozen by their bankruptcy. Anyone who sold a credit default swap policy and then laid off the bet by buying an equivalent credit default swap from Lehman Brothers is now on the hook to pay off the insurance policy on the bond without the compensation of being able to get Lehman Brothers to make good on their policy.

The lack of transparency means that nobody knows for sure about anyone else in the market. That is that anyone could go bankrupt tomorrow because they have bought credit default swaps from a bankrupt company like Lehman Brothers and they cannot make good on their promise of a credit default swaps that they have sold. Already AIG has needed a huge injection of government money to stay afloat, and others may be suffering as well. But no one knows what positions anyone else holds. So everyone is conserving their cash not lending it out to anyone else so that they will not lose it if the other party goes bankrupt. Thus are the credit markets constipated.

A final problem is that, because the market is unregulated, there are no capital requirements to back up a bet. I can sell a credit default swap insurance policy based on my good name. I immediately get a large sum of money which I can register as a profit. It is only later that I have to worry about a problem with the bond that I have insured defaulting (what are the chances of that?) This is how the market got to be 60 Trillion dollars in size.

The underlying issue is this. There is a large risk in trading in unregulated markets. The risk is made larger if the market is not transparent, because if one of the parties to the market goes bust nobody knows what their position is worth. These risks were not recognized in the credit default swap market and policies were sold at far too low a price to recognize these risks. If the market were regulated, like other markets are, these risks would not be there and the market could deal with usual events like a bankruptcy of a player.

Finally there is a risk to the nation in allowing an unregulated market to balloon to the size that the credit default swap market has. Bankruptcies happen all the time. the fact that the bankruptcy of a player has caused the entire financial marketplace to go into a swoon is bad for the nation. The players in the credit default swap market asked for an unregulated market and they got what they asked for. Now the risk of having an unregulated market has shown itself and as Senator Gramm tells them they should deal with it and stop whining.

I am not an apologist, I am a technologist who is interested in how things work.

Thursday, October 02, 2008

Articulate UML Modeling

Last week, Leon Starr spoke to the SDForum SAM SIG on "Articulate UML Modeling". Leon is an avid modeler and has been using UML for modeling software systems since it was first defined. He believes in building executable models and I applaud him for that. The very act of making something executable ensures that it is in some sense complete and free from many definitional errors. Executing the model allows it to be tested.

There are several advantages to building models rather than programs. A big part of many project is extracting requirements. Unlike a program, a model can describe requirements in a way that a non-technical user can understand and appreciate, so the user can provide feedback. Another advantage of a model is that is does not arbitrarily constrain the order in which things are done. Essentially, a model is asynchronous and captures opportunities for concurrency in its implementation. This struck a chord with me as I am going to speak to the SAM SIG in October on "Models and Patterns for Concurrency".

The other part of the talk that interested me was Leon's attack on building models as controllers. For example, he gave the example of a laser cutting machine. A common way of modeling this is the laser cutter controller that interprets patterns. He prefers to see this modeled as patterns that describe how they are cut by the laser cutter. Leon's experience is with modeling software to manage physical systems like the air traffic controller example that he used to illustrate his talk. His approach is certainly useful for the understanding and analysis of physical systems, however I have seen the problem argued both ways. It is worth a separate post on the issue.