Currently MapReduce is riding high on the hype cycle. The other day I saw a presentation that was nothing but breathless exhortation for MapReduce as the next big thing and that we had better all jump on the bandwagon as soon as possible. However, there are rumblings of performance problems. At the recent Big Data Camp, Greenplum reported that their MapReduce was 100 times slower than their database system. Searching the web finds many people complaining about MapReduce performance, particularly with NoSQL systems like MongoDB. That is a problem because MapReduce is the data analysis tool for processing NoSQL data. For MongoDB, anything more than the most trivial reporting will require the use of MapReduce.
At the same time there is plenty of evidence that MapReduce is no performance slouch. The Sort Benchmark is a prime measure of computer system performance and currently the Hadoop MapReduce system holds two out of 6 titles for which it is eligible. One title is the Gray test for sorting 100 Terabytes (TB) of data in 173 minutes. The other title is the Minute test for sorting 500 Gigabytes (GB) of data in under a minute. These results are as of May 2010 and the Sort Benchmark is run every year, so we can expect better performance in the future.
Understanding MapReduce performance is a matter of understanding two simple concepts. The first concept is that the design center for MapReduce systems like Hadoop is to run large jobs on a large distributed cluster. To get a feel of what this means, look at the Hadoop disclosure document for the Sort Benchmark. The run for sorting 100 TB was made on a cluster of about 3400 nodes. Each node had 8 cores, 4 disks, 16 GB of RAM and 1GB ethernet. For the Minute sort, a smaller cluster was used with 1400 node systems with the same configuration except 8GB of RAM on each node. That is not to say that MapReduce will only work on thousand node systems. Most systems are much smaller than this, however Hadoop is particularly designed so that it will scale to run on a huge cluster.
One problem with a large cluster is that nodes break down. Hadoop has several features that transparently work around the problem of broken nodes and continue processing in the presence of failure. From the Sort Benchmark disclosure, for the Gray sort run, every processing task is replicated. That is, for every processing task, two nodes are assigned to do it so that should a node break down, the sort can still continue with the data from the other node. This was not used for the Minute test because the likelihood of a node breaking down in the minute while the test is running is low enough to be ignored.
Another large cluster feature that has an important effect on performance is that all intermediate results are written to disk. The results of all the Mappers are written to disk and the sorted data for Reducers is written to disk. This is done so that if a node fails only a small piece of work needs to be redone. By contrast, relational database systems go to great length to ensure that after data has been read from disk, it does not touch the disk again before being delivered to the user. If a node fails in a relational database system, the whole system goes into an error state and then does a recovery which can take some time. This is extremely disruptive to performance when a node fails and much better for performance when there is no failure. Relational database systems were not designed to run on thousands of nodes so they treat the problem of a node failure as a very rare event whereas Hadoop is designed as if it a commonplace. The consequence is that Hadoop performance can look slow when compared to a relational database on a small cluster.
Unfortunately, there is not a lot that a user can do about this, except look for a bigger cluster to run their analysis on, or look for bigger data to analyze. That is the subject for the second part of this post where I will talk about the other simple concept for understanding MapReduce performance.