Thursday, April 28, 2011

Understanding MapReduce Performance: Part 1

Currently MapReduce is riding high on the hype cycle. The other day I saw a presentation that was nothing but breathless exhortation for MapReduce as the next big thing and that we had better all jump on the bandwagon as soon as possible. However, there are rumblings of performance problems. At the recent Big Data Camp, Greenplum reported that their MapReduce was 100 times slower than their database system. Searching the web finds many people complaining about MapReduce performance, particularly with NoSQL systems like MongoDB. That is a problem because MapReduce is the data analysis tool for processing NoSQL data. For MongoDB, anything more than the most trivial reporting will require the use of MapReduce.

At the same time there is plenty of evidence that MapReduce is no performance slouch. The Sort Benchmark is a prime measure of computer system performance and currently the Hadoop MapReduce system holds two out of 6 titles for which it is eligible. One title is the Gray test for sorting 100 Terabytes (TB) of data in 173 minutes. The other title is the Minute test for sorting 500 Gigabytes (GB) of data in under a minute. These results are as of May 2010 and the Sort Benchmark is run every year, so we can expect better performance in the future.

Understanding MapReduce performance is a matter of understanding two simple concepts. The first concept is that the design center for MapReduce systems like Hadoop is to run large jobs on a large distributed cluster. To get a feel of what this means, look at the Hadoop disclosure document for the Sort Benchmark. The run for sorting 100 TB was made on a cluster of about 3400 nodes. Each node had 8 cores, 4 disks, 16 GB of RAM and 1GB ethernet. For the Minute sort, a smaller cluster was used with 1400 node systems with the same configuration except 8GB of RAM on each node. That is not to say that MapReduce will only work on thousand node systems. Most systems are much smaller than this, however Hadoop is particularly designed so that it will scale to run on a huge cluster.

One problem with a large cluster is that nodes break down. Hadoop has several features that transparently work around the problem of broken nodes and continue processing in the presence of failure. From the Sort Benchmark disclosure, for the Gray sort run, every processing task is replicated. That is, for every processing task, two nodes are assigned to do it so that should a node break down, the sort can still continue with the data from the other node. This was not used for the Minute test because the likelihood of a node breaking down in the minute while the test is running is low enough to be ignored.

Another large cluster feature that has an important effect on performance is that all intermediate results are written to disk. The results of all the Mappers are written to disk and the sorted data for Reducers is written to disk. This is done so that if a node fails only a small piece of work needs to be redone. By contrast, relational database systems go to great length to ensure that after data has been read from disk, it does not touch the disk again before being delivered to the user. If a node fails in a relational database system, the whole system goes into an error state and then does a recovery which can take some time. This is extremely disruptive to performance when a node fails and much better for performance when there is no failure. Relational database systems were not designed to run on thousands of nodes so they treat the problem of a node failure as a very rare event whereas Hadoop is designed as if it a commonplace. The consequence is that Hadoop performance can look slow when compared to a relational database on a small cluster.

Unfortunately, there is not a lot that a user can do about this, except look for a bigger cluster to run their analysis on, or look for bigger data to analyze. That is the subject for the second part of this post where I will talk about the other simple concept for understanding MapReduce performance.

Sunday, April 24, 2011

The Truth about Smartphone Location Tracking

There is a wave of outrage over the internet about revelation that iPhones has a file with tracking information recording all the places it has been. How dare Apple track users of their products! I am afraid that this is a extremely naive attitude. The fact is that everybody is tracking you on iPhone and not only on a iPhone but on all smartphones and on many less than smart phones as well. Let me count the ways, starting off with the benign and moving to the egregious.

Firstly the carriers and handset makers collect data from phone to help improve their service. Last week we has a joint meeting of the SDForum Business Intelligence and Mobile SIGs on "Mobile Analytics". At that meeting Andrew Coward of CarrierIQ described how they embed their software in phones, usually at the carriers direction, to collect information that can be used to improve service. For example, he told us for example that it is quite normal for them to report to a carrier that their dropped call rate is 6% whereas the carrier's own engineers are telling management that their dropped call rate is 1%. They collect data on location so that the carrier knows where their users are using their phones from so that they can improve their service to that area.

In Europe, CDR laws require phone carriers to retain their Call Data Record (CDR) for all calls for a period of 1 or 2 years. The police can and do request information on all the calls made to or from a number to help with their enquiries into crime. While a CDR record does not usually contain specific location information, it can identify the cell tower and thus the approximate location of the caller. Police have successfully used location based CDR data to help with their investigations for well over a decade.

With the users permission, Google collects information from Android phones about their location. Google is the ultimate data collection company and I am always amazed at the creative ways they find for using that data. One Google service is the Traffic overlay on their Maps. This is derived from observing the change in location of Android phones. However, while Google says that they do not collect personally identifying information, they do need to distinguish between phones to make this application work, so they are tracking the movements of individuals, if only to provide the rest of us generic information on traffic flows. Google has plenty of other uses for this data. For example, they keep a database that locates every Wi-Fi hotspot is so that they can identify your location based on the Wi-Fi hotspot you using. Google can use data from Android phones to validate and update that database.

Mobile analytics and Apps is where the use of location based information starts to get interesting. Last year Flurry presented to the Business Intelligence SIG and we heard about their run in with Steve Jobs. You can read their press release to get the full story of what they did. In short Flurry has a free toolkit that developers install into their mobile Apps that collects information and sends the data back to Flurry. The developer can then access analytics reports about their app at the Flurry web site. However, Flurry retains the data that has been collected from the App, including location based data.

In January 2010, a couple of days before the iPad was announced, Flurry issued a press release saying that they saw a new Apple device that was was only being used in the Apple headquarters in Cupertino and gave some statistics on the number of different Apps that were being tested on this device. At this Steve Jobs blew his top and tried to get Flurry completely banned from iPhone Apps. Eventually Flurry and Apple settled their differences. The conclusion was that in the words of the iPhone developer agreement "The use of third party software in Your Application to collect and send Device Data to a third party for processing or analysis is expressly prohibited."

So lets parse this. Flurry is a company that has no direct relationship with the carriers, handset makers or the users of Apps, yet is is collecting data from all the Apps that it is included in. The data is available for use by the App developer and by Flurry. At the time of the iPad release they could identify that the device was different from all other devices and identify its location to within one set of buildings. Now, I am not trying to pick on Flurry specifically, there are several companies in this area. At the Business Intelligence SIG last week we heard from Apsalar, a recent start up in the same space, however, Flurry is the largest company that provides mobile analytics. Flurry estimates that they are included in up to 1 in 5 mobile Apps for the iPhone and Android. Because they are in so many Apps, they can provide aggregate data on all App usage.

The point of this is that we want location aware Apps, however we also want to preserve our privacy. As Apps are, these two goals are incompatible. To be location aware, the App has to know your location, and if the App knows your location, it can transmit that information back to the App developer or aggregator of analytics for the App developer. Thus they know where you are whether you want to or not. Android, has a profile that determines which information an App can access that is set when the App is installed. If it is allowed to access location information on installation, it can continue to do so until it is uninstalled.

Compared to what Apps know about what you are doing while you use the App, the location database that the iPhone is collecting seems to be a small matter. In fact it seems to be a good reason to limit the number of Apps that you can be running at any one time. At least if only one App is running then only one App knows where you are at any particular time.

Tuesday, April 12, 2011

The Business of Open Source Suites

I have often wondered how a commercial company builds an Open Source Suite out of a collection of open source projects. At the last BI SIG meeting Ian Fyfe Chief Technology Evangelist at Pentaho told us how they do it and gave some interesting insights on how Open Source really works. Pentaho offers a Open Source Business Intelligence suite that includes the Kettle data integration project, the Mondrian OLAP project and the WETA data mining project amongst other projects.

As Ian explained, Pentaho controls these Open Source projects because it employs the project leader and major contributors to each of the projects. In some cases Pentaho also owns the copyright of the code. In other cases, any ownership is in doubt, because there have been too many contributors and or what they have contributed has not been managed well enough to be able to say who owns the code. Mondrian is an example of an Open Source project where there have been enough contributors that it is not possible to take control of the whole source code and exert any real rights over it.

The real control that Pentaho exerts over the Open Source components of its suites is that it gets to say what their roadmap is and how they will evolve in the future. As I noted, Pentaho is driving the various projects to a common metadata layer so that they can become integrated as a single suite of products.

Saturday, April 09, 2011

The Fable of the Good King and the Bad King

A long time ago there were two countries. Each country had a King. One King was a good King and the other King was a bad King as we will find out. Now, as you all know a Kings main job is to go out and make war on his enemies. It is the reason that Kings exist. If a king is not out making war against his enemies, he will go out hunting and make war on the animals of the forest. A good war will enlarge the kingdom, enhance the King fame and gives him more subjects to rule over. But before a King can make war, he should make sure that his subjects provided for. For while the subjects of a King owe everything that they have to their King, the King is also responsible for the welfare and good being of his subjects.

There are many parts to taking care of subjects: making good laws, passing down sound judgements, but the most important one is making sure that the granaries are filled in times of plenty. For as surely as fat times follow lean times, lean times follow fat times. In times of plenty, the excess harvest should be saved so that in times of need the subjects do not starve. Subjects who are starving are weak and cannot praise their King nor defend his kingdom.

Now in our two countries, these were years of plenty, and the Kings knew that they would go to war. The good King also knew that it was his duty to make sure the granaries were filled, and so he did. However, the bad King wanted to win the battle so badly that he sold off all the grain in his granaries to buy expensive war machines. A little incident happened, it was blown up into a huge crisis and the two countries went to war. Each King assembled his army and let it to the battleground at the border of their countries as had happened so many times before. The armies were evenly matched and they fought all day. At the end of the day the army of the bad King held its ground and he was declared the victor. The expensive war machines had helped, but less than hoped for. However, both armies were so weakened and exhausted by the fight that they turned around and went home, as they had so many times before.

The years after this battle were years of want. The harvest had failed and both kingdoms suffered. However, the kingdom of the bad King suffered much more than the kingdom of the good King for there was no grain in their granaries. When the little incident happened that blew up into a huge crisis, both Kings assembled their armies and marched to the battleground on the border. This time the good King won the battle because his men were stronger.

The good King advanced his army into the country of the bad King. They may not be able to take the whole country, but the good King had to let his men do a little rape and pillage as a reward for winning the battle. The bad King realizing his precarious position came out to parley with the good King. The bad King had nothing to offer the good King but some used war machines and the hand of his daughter in marriage. The good King accepted that the daughter of the bad King should should marry his son and that when the two Kings has passed on the greater battleground in the sky, the son of the good King would rule both countries. Thus the two kingdoms would become one united country. A country that would be large and strong enough to make war on the countries on the far side of the mountains.

The moral of this story is that in times of plenty, make sure that the granaries are filled, for as surely as fat times follow lean times, lean times follow fat times, and the best protection against lean times are full granaries. On this matter, a King must beware of false council. When times are good, the false council will say "What could possibly go wrong? The times are fat and everyone is happy. Make the populace more happy by selling off the grain in the granary and rewarding the citizens each according to what they have contributed." Even worse, when times are lean the false council will say "Times are awful and getting worse, we must take the grain out of the peoples mouths and put in in the granaries for the harvest next year could be even worse than this year." The point of a granary or any store of wealth is to save the excess during the fat years so that they can be used during the lean years.