Tuesday, November 11, 2008

Understanding Map-Reduce

Map-Reduce is the hoopy new data management function. Google produced the seminal implementation. Start-ups are jumping on the gravy train. The old guard decry it. What is it? In my opinion it is just the old database concept of aggregation rewritten for extremely large scale data as I will explain in another post. But firstly we need to understand what Map-Reduce does, and I have yet to find a good clear explanation, so here goes mine.

Map Reduce is an application for performing analysis on very large data sets. I will give a brief explanation of what Map Reduce does conceptually and then give an example. The Map Reduce application takes three inputs. The first input is a map (note lower case). A map is a data structure, sometimes called a dictionary. A tuple is a pair of values and a map is a set of tuples. The first value in a tuple is called the key and the second is called the value. Each key in a map is unique. The second input to Map-Reduce is a Map function (note upper case) . The Map function takes as input a tuple, (k1, v1) and produces a list of tuples (list(k2, v2)) from data in its input. Note that the list may be empty or contain only one value. The third input is a Reduce function. The Reduce function takes a tuple where the value is a list of values and returns a tuple. In practice it reduces the list of values to a single value.

The Map Reduce application takes the input map and applies the Map function to each tuple in that map. We can think of it creating an intermediate result that is a single large list from the lists produced by each application of the Map function:
{ Map(k1, v1) } -> { list(k2, v2) }
Then for each unique key in the intermediate result list it groups all the corresponding values into a list associated with the key value:
{ list(k2, v2) } -> { (k2, list(v2)) }
Finally it goes through this structure and applies the Reduce function to the value list in each element:
{ Reduce(k2, list(v2)) } -> { (k2, v3) }
The output of Map Reduce is a map.

Now for an example. In this application we are going to take the World Wide Web and for each web page count the number of other domains that reference that page. A domain is the part of a URL between the first two sets of slashes. For example, the domain of this web page is "www.bandb.blogspot.com". A web page is uniquely identified by its URL, so a URL is a good key for for a map. The data input is a map of the entire web. The key for each map element is the URL of the page, and the value is the corresponding web page. Now I know that this is a data structure on a scale that is difficult to imagine, however this is the kind of data that Google has to process to organize the worlds information.

The Map function takes the URL, web page pair and adds an element to its output list for every URL that it finds on the web page. The key in the output list is the URL found on the web page and the value is the domain from the key value in the input URL. So for example, on this page, our Map function finds the link to the Google paper on Map Reduce and adds to its list of outputs the tuple ("research.google.com/archive/mapreduce.html", "www.bandb.blogspot.com"). Map-Reduce reorganizes its intermediate data so that for each URL it collects all the domains that reference that page and stores them as a list. The Reduce function goes through the list of domains and counts the number of different domain values that it finds. The result of Map-Reduce is a map where the key is a URL and the value is a number, the number of other domains on the web that reference that page.

While this example is invented, Google reports that they use a set of 5 to 10 such Map-Reduce steps to generate their web index. The point of Map Reduce is that a user can write a couple of simple functions and have them applied to data on a vast scale.

1 comment:

alex_woods said...

I agree. I think there is this strange warping of thinking just because there is, currently, no such thing as a petabyte database. I don't agree that Google has the best solution.