Sunday, May 01, 2011

Understanding MapReduce Performance: Part 2

Getting good performance out of MapReduce is a matter of understanding two concepts. I discussed the first one, that MapReduce is designed to run on large clusters, in a post last week. Here is the second concept and it is something that everyone who uses MapReduce needs to grapple with. MapReduce works by breaking the processing task into a huge number of little pieces so that the work can be distributed over the cluster to be done in parallel. Each Map task and each Reduce task is a separate task that is can be scheduled to run in parallel with other tasks. For both Map and Reduce, the number of tasks needs to be much larger than the number of nodes in the cluster.

The archetypal example of MapReduce is to count word frequency in a large number of documents. A Map task reads a document and outputs a tuple for each word with the count of occurrences of the word in the document. A Reduce task takes a word and accumulates a total count for the word from the per document count produced by each Map tasks. In this example, there are a large number of documents as input to the Map tasks and presumable a large number of words so that there are a large number of Reduce tasks. Another illustration of this principle is found in the Sort Benchmark disclosure that I discussed in the previous post. For the Gray sort, the 100 TB of data is broken into 190,000 separate Maps and there are 10,000 Reduces for a cluster of 3400 nodes.

While most users of MapReduce get the idea that MapReduce needs its input data broken into lots of little pieces so that there are many Map tasks, they forget about the same requirements for Reduce tasks. Searching the internet it is easy to find examples of MapReduce with a small number of Reduce tasks. One is a tutorial from the University of Wisconsin where there is ultimately only one Reduce task. It is particularly galling that this example comes from the University of Wisconsin where they have a large and prestigious program on parallel database system research. In their defense, the tutorial does show how to do intermediate reduction of the data, but that does not prevent it from being a bad example in general.

Sometimes the problem is too small. What do you do if the problem you are working on now just involves the computation of a single result? The answer is to enlarge the problem. In a large cluster it is better to compute more results even although they may not be of immediate use to you. Lets look at an example. Say you want to analyze a set of documents for the frequency of the word 'the'. The natural thing to do is process all the documents and in the Map function filter for the word 'the' and count the results in the Reduce function. This is how you are taught to use "valuable" computing resources. In practice, with MapReduce it is better to count the frequency of all the words in the documents and save the results. It is not a lot more effort for the MapReduce engine to count the frequency of all the words in the documents and if you then want to know how many uses there are of 'a' or any other word, they are there for you immediately.

A common analogy is MapReduce as a freight train as opposed to a relational database which is a racing car. The freight train carries a huge load but is slow to start and stop. A race car is very fast and nimble but it carries only one person. Relational database systems rely on you to use the where clause to reduce the data that it has to analyze, and in return gives you the answer in a short time. MapReduce does not give you an answer as quickly but it is capable of effectively processing a lot more data. With MapReduce you should process all the data and save the results, then use them as you need them. We can sum the way of thinking about how to use MapReduce with the slogan "no where clauses".

No comments: