Tuesday, December 23, 2008

Map Reduce Sort Benchmark

In November Google posted in a blog that they had beaten the Terabyte Sort Benchmark with a time of 68 seconds. I waited to comment on the result until it was confirmed on the Sort Benchmark Home Page and to see the technical details, but this has not happened, so here are some preliminary thoughts.

The Sort Benchmark Home Page has several results for for different races. The big division is between Daytona and Indy, where Daytona is for general purpose software like Map Reduce while Indy is for code specially written for the task. The other dimension is how much is sorted. There are competitions for how many records can be sorted for a penny, and in a minute. Then there is the big Kahuna prize, sorting a Terabyte, that is 10 Billion 100 byte records in the shortest time.

Map Reduce can be used to sort data. (See this previous post for a simple explanation of Map Reduce.) Most of the sorting work is done by the partitioning function that sits between between the map part of the process and the reduce part. Normally the partitioning function uses hashing to get reasonably even partition sizes in the face of skewed data. Map Reduce allows you to supply a custom partitioning function and for sorting, the default partitioning function is replaced by a range partitioning function so that each reduce process gets a set of results in the same range. The reduce process then sorts and outputs its group of results.

Here we come to a little issue. The Sort Benchmark uses artificial data with a randomly generated key that is guaranteed to be evenly distributed throughout the range. Range partitioning is fine for partitioning this synthetic data, however it will not work so well with real world skewed data. Thus, while the results are impressive they should be taken with a pinch of salt.

The Google result seems particularly impressive, because last summer Yahoo had used Hadoop, their Open Source implementation of Map Reduce to officially win the Terabyte Sort Benchmark with a time of 209 seconds. There has been plenty of speculation about why Google result is much faster the the Hadoop result. Here are some thoughts:
  • Bigger iron. Google have not disclosed the details of their sort (from what I can find), but their post suggests that they used 12 disks per system, as opposed to Yahoo with 4 disks per system. The total time is short so the difference in IO capacity could make a big difference. The Google systems may have had more memory and other resources per node as well.
  • Misconfigured Hadoop. The Hadoop Sort benchmark disclosure says "I configured the job with 1800 maps", on a 910 node system where each node has 4 dual core processors! The Hadoop pages says "The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very CPU-light map tasks." The map part of sorting with Map-Reduce is a very CPU light task.
  • Yahoo did not try very hard. They handily beat the previous record of 268 seconds. The benchmark disclosure says "Although I had the 910 nodes mostly to myself, the network core was shared with another active 2000 node cluster, so the times varied a lot depending on the other activity."
  • SSL Communication. Hadoop uses SSL to communicate between nodes. SSL provides good network security, however it has some setup time for each node in a communication intensive task. It is not clear what Google uses for communication between nodes.
Here are a couple of final comments. Firstly Hadoop is still the official world record holder for the Terabyte Sort. Secondly, a Terabyte is small amount of data these days. The real point of the Google post was to say that they had sorted a Petabyte in 6.8 hours. Now that is a real sort benchmark.

1 comment:

awootton said...

One has to ask, given your previous comments, whether a proper distributed rdbs could beat the benchmark. I have some experience with Hadoop and, however it is scaled, there is still disk io involved. What would happen if the data was in RAM during the sort? I can rent 1 terabyte of RAM from Amazon for $54/hour (67 machines). I can make them act like a distributed database. So, dear friend, what do you thing the winning algorithm is? Bear in mind that a distributed rdbs has an unlimited number of 'entry points' and many select-updates would happen at once.