Saturday, April 25, 2009

Mahout on Hadoop

No, this is not a tale of an elephant and his faithful driver, I am talking about Mahout, an Open Source project that is building a set of serious machine learning and analytic algorithms to run on the Hadoop Open Source Map-Reduce platform. We learned about this at the April meeting of the SDForum Business Intelligence SIG where Jeff Eastman spoke on "BI Over Petabytes: Meet Apache Mahout".

As Jeff explained, the Mahout project is a distributed group of about 10 committers who are working on implementing different types of analytics and machine learning algorithms. Jeff's interest is in clustering algorithms that are used for various purposes in analytics. One use is to generate the "customers who bought X also bought Y" come on that you see at an online retailer. Another use of Clustering is to create a small number of large groups of similar behavior to understand patterns and trends in customer purchasing behavior.

Jeff showed us the all the Mahout clustering algorithms, explaining what you need to provide to set up the algorithm and giving graphical examples of how they behaved on a example data set. He then went on to show how one algorithm was implemented on Hadoop. This implementation shows how flexible the Map Reduce paradigm is. I showed a very simple example of Map-Reduce when I wrote about it last year so that I could compare it to the same function implemented in SQL. Clustering using Map-Reduce is at the other end of the scale, a complicated big data algorithm that also can effectively use the Map-Reduce platform.

Most Clustering algorithms are iterative. From an initial guess at the clusters, an iteration moves data points from one cluster to another to make better clusters. Jeff suggested that a typical application may use 10 iterations or so to converge to a reasonable result. In Mahout, each iteration is a Map-Reduce step. He showed us the top level code for one clustering algorithm. Building on the Map-Reduce framework and the Mahout common libraries for data representation and manipulation, the clustering code itself is pretty straightforward.

Up to now, it has not been practical to do sophisticated analytics like clustering on datasets that exceed a few megabytes, so the normal approach is to sample the dataset to get a small representative sample and then do the analytics on that sample. Mahout enables the analytics on the whole data set, provided that you have the computer cluster to do it.

Given that most analysts are used to working with samples, is there any need for Mahout scale analytics? Jeff was asked this question when he gave the presentation at Yahoo, and he did not have a good answer then. Someone in the audience suggested that analytics on the long tail requires the whole dataset. After thinking about it, processing the complete dataset is also needed for collaborative filtering like the "customers who bought X also bought Y" example given above.

Note that at the BI SIG meeting Suzanne Hoffman of Star Analytics also gave a short presentation on the Gartner BI Summit. I will write about that in another post.

Wednesday, April 08, 2009

Ruby versus Scala

An interesting spat has recently emerged in the long running story of Programming Language wars. The Twitter team, who had long been exclusively a Ruby on Rails house, came out with the "shock" revelation that they were converting part of their back end code to use the Scala programming language. Several Ruby zealots immediately jumped in saying that the Twitter crew obviously did not know what they were doing because they had decided to turn their backs on Ruby.

I must confess to being amused by the frothing at the mouth from the Ruby defenders, but rather than laughing, lets take a calm look at the arguments. The Twitter developers are still using Ruby on Rails for its intended purpose of running a web site. However they are also developing back-end server software and have chosen the Scala programming language for that effort.

The Twitter crew offer a number of reasons for choosing a strongly typed language. Firstly, dynamic languages are not very good for implementing the kind of long running processes that you find in a server. I have experience with writing servers in C, C++ and Java. In all these languages there are problems with memory leaks that cause the memory footprint to grow over the days, weeks or months that the server is running. Getting rid of memory leaks is tedious and painful, but absolutely necessary. Even the smallest memory leak will be a problem with heavy usage and if you have a memory leak, the only cure is stopping and restarting the server. Note that garbage collection does not do away with memory leaks, it just changes the nature of the problem. Dynamic languages are designed for rapid implementation and hide the boring details. One detail that is missing is control over memory usage and memory usage left on its own tends to leak.

Another issue is concurrency. Server software needs to exploit concurrency, particularly now in the era of multi-core hardware. Dynamic languages have problems with concurrency. There are a bunch of issues, too many to discuss here. Sufficient to say that in the past Guido van Rossum has prominently argued against putting threads into Python, another dynamic language, and both Python and Ruby implementations suffer from poor thread performance.

A third issue is type safety. As the Twitter crew say, they found themselves building their own type manager into their server code. In a statically typed language, the type management is done at compile time, making the code more efficient and automatically eliminating the potential for a large class of bugs.

Related to this, many people commented on the revelation that the Twitter Ruby server code was full of calls to the Ruby kind_of method. It is normally considered bad form to have to use kind_of or its equivalent in other languages like the Java instanceof operator. After a few moments thought I understood what the kind_of code is for. If you look at any real server like a database server's code, it is full of assert statements. The idea is that if you are going to fail, you should fail as fast as you can and let the error management and recovery system get you out of trouble. Failing fast reduces the likelihood that the error will propagate and cause real damage like corrupting persistent data. Also with a fast fail it is easier to figure out why the error occurred. In a language with dynamic typing, checking parameters with a kind_of method is the first type of assert to put in any server code.

So the Twitter developers have opted to use Ruby on Rails for their web server and Scala for their server code. In the old days we would have said "horses for courses" and everyone would have nodded their heads in understanding. Nowadays , nobody goes racing, so nobody knows what the phrase means. Can anyone suggest a more up to date expression?

Sunday, April 05, 2009

Cloud to Ground

Everybody is talking about Cloud Computing, the idea that your computing needs can be done by utility computing resources out there on the internet. One sometimes overlooked issue with Cloud Computing is how do you get your data out of the cloud, summed up in the phrase "Cloud-to-Ground".

The issue is that you have your data in the cloud, but you need it down here in your local computing systems so that for example, you can prepare a presentation for the board, or generate a quarter end report, or confirm that a new customer can get the telephone support they just paid for. While it is not a hugely different from other data integration problem, it is one more thing to put on your check list when you think about how you are going to use Cloud Computing.

I first heard the phrase last year when Mike Pittaro of SnapLogic spoke to the SDForum Business Intelligence SIG on SaaS Data Integration. It was only later that I discovered the origin of the phrase is describing a type of lightning.