Saturday, April 25, 2009

Mahout on Hadoop

No, this is not a tale of an elephant and his faithful driver, I am talking about Mahout, an Open Source project that is building a set of serious machine learning and analytic algorithms to run on the Hadoop Open Source Map-Reduce platform. We learned about this at the April meeting of the SDForum Business Intelligence SIG where Jeff Eastman spoke on "BI Over Petabytes: Meet Apache Mahout".

As Jeff explained, the Mahout project is a distributed group of about 10 committers who are working on implementing different types of analytics and machine learning algorithms. Jeff's interest is in clustering algorithms that are used for various purposes in analytics. One use is to generate the "customers who bought X also bought Y" come on that you see at an online retailer. Another use of Clustering is to create a small number of large groups of similar behavior to understand patterns and trends in customer purchasing behavior.

Jeff showed us the all the Mahout clustering algorithms, explaining what you need to provide to set up the algorithm and giving graphical examples of how they behaved on a example data set. He then went on to show how one algorithm was implemented on Hadoop. This implementation shows how flexible the Map Reduce paradigm is. I showed a very simple example of Map-Reduce when I wrote about it last year so that I could compare it to the same function implemented in SQL. Clustering using Map-Reduce is at the other end of the scale, a complicated big data algorithm that also can effectively use the Map-Reduce platform.

Most Clustering algorithms are iterative. From an initial guess at the clusters, an iteration moves data points from one cluster to another to make better clusters. Jeff suggested that a typical application may use 10 iterations or so to converge to a reasonable result. In Mahout, each iteration is a Map-Reduce step. He showed us the top level code for one clustering algorithm. Building on the Map-Reduce framework and the Mahout common libraries for data representation and manipulation, the clustering code itself is pretty straightforward.

Up to now, it has not been practical to do sophisticated analytics like clustering on datasets that exceed a few megabytes, so the normal approach is to sample the dataset to get a small representative sample and then do the analytics on that sample. Mahout enables the analytics on the whole data set, provided that you have the computer cluster to do it.

Given that most analysts are used to working with samples, is there any need for Mahout scale analytics? Jeff was asked this question when he gave the presentation at Yahoo, and he did not have a good answer then. Someone in the audience suggested that analytics on the long tail requires the whole dataset. After thinking about it, processing the complete dataset is also needed for collaborative filtering like the "customers who bought X also bought Y" example given above.

Note that at the BI SIG meeting Suzanne Hoffman of Star Analytics also gave a short presentation on the Gartner BI Summit. I will write about that in another post.

No comments: