Monday, July 11, 2011

Will Microsoft Wipe Out the Google Android Advantage?

Microsoft seems to be working many different angles to wipe out any advantage that Google should get as the primary developer of the Android OS for Smartphones. The sneaky angle is to persuade the Smartphone makers to replace Google as the default search engine with Microsoft's own Bing search engine. For example, the LG Revolution has the Android OS, however as the PC Mag review says, it has been thoroughly Binged. Bing is the default search engine and Bing Maps is the default mapping application. My son, who recently got a LG Revolution told me that he had to download Google Maps to get the mapping application he prefers.

Microsoft has also been going after the various Android smartphone makers, getting them to pay royalties on Microsoft patents. Reputedly, HTC is paying Microsoft $15 per smartphone to license patents. It would not surprise me if Microsoft is making more money from licensing patents to Android handset makers than it is making from selling its own Windows 7 Mobile operating system to handset makers. They are certainly making more profit from Android than from Windows 7.

Microsoft could apply pressure other smartphone makers to set the default search engine to Bing in exchange for reduced payments of royalties. If this becomes widespread, Google loses the advantage that it gets from having developed Android, and in the long run even threatens the existence of Android. If Google gets no advantage from developing Android why should it continue? This is only a tiny slice of what is going on with mobile patents, however the ball seems to be in Google's court and we are waiting to see what they do with it.

Friday, July 01, 2011


It happened yesterday. They just sent me an eMail. How is that a good way to be fired? Anyway, from today onwards I can tell you that I am no longer an Associate. I cannot fault Amazon's reasons for firing me. Yesterday Governor "Moonbeam" Brown signed a law that because Amazon has associates in California they would have to collect sales taxes from California residents, and Amazon, clinging to "the internet is tax free" mantra fired all their associates that are California residents.

Also, I cannot really fault Governor Brown. He has a big budget gap to fund, an intransigent legislature and the determination that he is going to fix the budget problem properly, or at least better than his predecessors. So Brown signed legislation that would provide California with more revenue, or at least try to plug a gap in state revenue, or at least be a stage along the way to plugging the internet sales tax hole. This legislation on its own is not going to generate any extra revenue, however if enough other states with a sales tax pass similar legislation maybe enough companies will throw in the towel and start complying with state sales tax collection.

What am I going to do? Well I am not going to leave California. The weather is good, there are plenty of good jobs that pay well enough and the other amenities, while not cheap, are well worth it. Also, I have to confess that I have not been a very good Amazon Associate, making practically nothing from all those seductive links sprinkled throughout this blog. In truth, I did not become an Amazon Associate to make money. The real reason was that I was concerned that if I copied a link from the web site, and if you clicked on that link, you might see a message at the top of your screen that said something like "Hello Richard Taylor. Have we got some recommendations for you." As an Associate, I could get a good clean link for a product without having to worry about the fact that the link could have other unwanted baggage attached.

So adios Amazon. I will just have to go back to guessing how to fix up links to your site so that they do the right thing.

Tuesday, June 28, 2011

Bitcoin as an Economic Entity

Bitcoin is the new peer-to-peer virtual currency that I wrote about previously. This post evaluates Bitcoin as money from an economic point of view. I will write a separate post on technical and security aspects. Economists look at money as three things: a measure of value, a medium of exchange and as a commodity, more commonly and politely stated as a store of value. Here is how Bitcoin measures up to these three function.

One function of money is as a measure of value. When we use money to measure value, we do not mean that the money exists, rather that the asset, good or service is worth or cost the sum of money. Thus when we say that someone is a millionaire, this means that the sum of all they own minus their debts is more than a million dollars. It does not mean that they have a million dollars in bills stuffed into a mattress.

The men with the green eyeshades often talk about this purpose of money as "unit of account", thinking about it as a measure of value gets to the essence more quickly. So, when I am in a computer store trying to decide whether I should buy the $500 laptop or the $1000 laptop, I use money as a comparative measure of value, by asking whether the $1000 laptop is really worth twice the $500 laptop and an absolute measure of value by asking whether I can afford the $1000 laptop that I really want or whether I should make do with the $500 laptop and save the difference for other needs.

For a measure of value, the best currency is the currency we are familiar with, that we are paid in and that we use every day. Anyone who has been abroad knows the difficulty of commerce with an unfamiliar currency. At first, after every transaction the thought lingers in the back of your mind, did we just get a deal, or were we robbed? However, with repeated use you pick up a new currency. By the end of a vacation you are starting to be able to predict what goods and services will cost in the new currency. When I played World of Warcraft (WOW), I quickly learned the value of WOW Gold through having to work with it all the time.

Bitcoin has another problem as a measure of value, its volatile exchange rate with other currencies. Since its introduction, it has appreciated against all other currencies by about 200,000%. Recently, heavy selling on a Bitcoin exchange caused its value to fluctuate between $0.01 and $17.00 over the period of a day. This volatility makes it difficult to use as a measure of value because its value is uncertain. Most currencies are managed by a central bank and one of the purposes of a central bank is to keep the currency stable with respect to other currencies so that it can be safely used for all the three functions on money. On the other hand, the essence of Bitcoin is that it is completely distributed with no central authority. As it is unmanaged, we can expect its exchange rate to be somewhat more volatile than other currencies.

Another function of money is as a medium of exchange. Before money existed, trading was difficult. If I led a cow to market with the intent on trading it for grain, I might come to an agreement with another farmer that my cow is worth 8 sacks of grain, except that I only want one sack of grain the other farmer only has 5 sacks of grain to trade and he does not want a cow anyway. With money, I can sell the cow for money to someone who wants a cow, buy just as much grain as I need and save any leftover money for other transactions in the future. Money as a medium of exchange greases the wheels of commerce by acting as an intermediary and thus removing barriers to trade.

Bitcoin scores high as a medium of exchange. It can be securely and anonymously traded on the internet for other goods and services. Also it is almost infinitely divisible so it serves for small exchanges. There are two caveats. Firstly a Bitcoin transaction takes about 10 minutes to confirm, so sellers may be unwilling to accept it for immediate transactions where there is no recourse. That is, Bitcoin is good for selling Alpacca socks over the internet, but not for selling hot-dogs at Coney Island. As Bitcoin is an internet currency, this is only of concern to someone who sells virtual goods over the internet without recourse. The Bitcoin FAQ addresses this issue, saying:
Do you have to wait 10 minutes in order to buy or sell things with BitCoin? 
No, it's reasonable to sell things without waiting for a confirmation as long as the transaction is not of high value. 
When people ask this question they are usually thinking about applications like supermarkets or snack machines, as discussed in this thread from July 2010. Zero confirmation transactions still show up in the GUI, but you cannot spend them. You can however reason about the risk involved in assuming you will be able to spend them in future. In general, selling things that are fairly cheap (like snacks, digital downloads etc) for zero confirmations will not pose a problem if you are running a well connected node."
The second caveat is that we typically maintain a reserve of any currency that we regularly use as a float to smooth out transactions. Anyone concerned with the volatility of the value of Bitcoin may be unwilling to maintain a float in Bitcoin and therefore not have a convenient reserve of Bitcoin for doing transactions. If Bitcoin continues to have a volatile exchange rate with other currencies and users do not keep a reserve of Bitcoin for doing transactions, it becomes more cumbersome to use and therefore less useful as a medium of exchange. The end result is that Bitcoin is only used when there is no alternative method of payment. The conclusion is that Bitcoin, or any other currencies usefulness as a medium of exchange does depend on it having a reasonably stable value.

The final function of money is as a commodity like Gold, Oil or Frozen Concentrated Orange Juice (FCOJ). Currencies are commodities that are traded like other commodities for good legitimate reasons. For example, a company that contracts to buy a good that is priced in another currency may want to buy insurance against a change in the exchange rate that would cause the good to become more expensive than when they made the original commitment. Financial companies create and sell instruments that provide this insurance and then trade currencies as commodities to protect their position.

First some words about commodities in general. Owning a commodity does not produce any value. Stocks and bonds may pay a dividend, while a commodity does not, so the only reason for owning a commodity as an investor is the hope that its value will increase so that it can be sold at a profit. In practice owning a commodity is an even worse proposition because money is tied up in owning the commodity that could be otherwise earning interest, so even owning a commodity is a losing proposition unless the commodity increases in value. Then there is a cost for every trade which further saps profits. Thus people who are not Bitcoin speculators will not want to hold more Bitcoin than they need for their day to day needs.

Commodity trading creates a market for the commodity that sets its price. The first test of a commodity is that there is a market where the commodity can be traded efficiently. Bitcoin passes this test as there are several markets for Bitcoin, although a recent attack against the MtGox, the largest Bitcoin exchange may reduce confidence. As an example of the efficiency of trading Bitcoin, MtGox charges a 0.65% charge against every trade.

When evaluating a commodity, we consider how it is used to understand the supply and demand that determines its fundamental price. Bitcoin is a fiat currency which has value because people find it useful as a medium of exchange like the other like other fiat currency: Dollar, Pound, Euro or Yen. The key to understanding the value of Bitcoin like any other currency is money supply, the sum of all the money that people keep in their bank accounts and wallet to smooth out their transactions and grease the wheels of commerce as discussed previously. However there is one difference. With other currencies there is a central bank that manages the money supply to keep the value of the currency stable. With Bitcoin, there is no central bank, rather the amount of Bitcoin is circulation is stable. Thus the base value of Bitcoin depends on demand for its money supply.

The base demand for Bitcoin is to use it as a medium of exchange. If more people regularly do Bitcoin transactions and keep it in their wallet to smooth out their transactions, or they tend to keep more Bitcoin in their wallet because they expect to use it for more transactions, there is more demand for the stable supply of Bitcoin and therefore its price rises. Conversely, if less people keep Bitcoin in their wallet or people keep less money in their wallet the price falls. On top of this base demand, there is demand from speculators who expect the price of Bitcoin to rise and therefore hold it in investment level quantities. The base demand for Bitcoin will tend to keep the price stable, while the speculative demand is likely to make the price more volatile.

Another consideration is whether there are any risks associated with owning the commodity. Bitcoin is a virtual currency and a problem with other virtual currencies has been hyperinflation, caused by someone discovering a software bug that allows them to generate huge amounts of the virtual currency without much effort. This has happened in several Massive Multiplayer Online Games (MMOG), but in each case the game has had a central server that hands out money and a game mechanism that is designed with a specific rate of exchange in mind. Bitcoin is different in that it does not have a central authority and it is traded in a free and open market that sets its value. An attack on Bitcoin could reduce its value, however this could be self defeating as it immediately reduces the value of the attack. I will write a separate post on the security considerations, however it is safe to say that as there is a vibrant market for Bitcoin, it is reasonably safe.

In summary, Bitcoin's purpose is to be used as a medium of exchange for transactions over the internet. Its base value comes from small amounts of it being held in a large number of users wallets because they regularly use it as a medium of exchange. If Bitcoin is heavy used as a medium of exchange, this will tend to stabilize its exchange rate against other currencies and make it more useful as a currency when measured against all the functions of money.

eReaders Arrive

After writing about the Kindle for some time, I can let you know that I am now a proud owner of one. I can also tell you that it is a wonderful device, even more wonderful than I imagined, when used for reading the right kind of book, that is, the page-turner kind of book where you start on page 1 and keep turning pages until you get to the end. The other kind of book, the kind of book where you start with the index or table of contents and then jump around has been subsumed by the web with search and hyperlinks to the point where it is redundant anyway. Thus the Kindle is the perfect device for reading the only kind of book that is left, the kind of book that you read straight through.

I am not the only person who has recently bought an eReader. Today a Pew Internet research study showed that eReader ownership has doubled in the last 6 months. It is now up to 12% in the US and is currently growing faster than Tablet ownership. eReaders have been around for longer than the current incarnation of Tablets, and seem to be arriving at the period of mass adoption. Also, given the current price there is little reason not to own one.
An objection to the Kindle has always been that it is not a converged device. It is good for reading one kind of book and little else. Many commentators wanted it to be good at everything, and argued that otherwise it is just another device that you need to carry around. I particularly like that it is not a converged device. When I am reading a book on my Kindle, it will not interrupt my train of thoughts to announce that an email or twitter has arrived, or tempt me to play a silly game or fiddle with a Facebook page. With the Kindle I can walk away from the computer and read a book without all those interruptions and distractions that make life a disconnected stream of consciousness

Sunday, June 19, 2011

Bitcoin, a Peer to Peer Virtual Currency

Bitcoin is a peer-to-peer virtual currency that seems to pop up in the conversation everywhere I look. A virtual currency is is a currency that is created on computers and traded on the internet. A couple of examples of virtual currencies are Linden Dollars in the online world Second Life and Gold in the massive multiplayer online game World of Warcraft (WOW). People in third world countries play WOW to collect WOW Gold and sell it for real money to players in the first world so that they can buy more powerful armor, weapons and spells to use in the game. Bitcoin is different in that its purpose is to be a currency like dollars, euros or pounds, whereas Linden Dollars and WOW Gold are an element of their games and have no real purpose or value outside of the game.

The other aspect of Bitcoin is that it is a peer-to-peer currency. Bitcoin is created by mining for it against a cryptographic algorithm. Once Bitcoins are created they are traded on a peer-to-peer network. When a transaction has taken place, it is broadcast to the peers on the network, they confirm that the transaction is valid and has taken place. The peer computers add the transaction to the history so that the transaction becomes permanent. There is no central authority that creates or manages Bitcoin, it manages itself through its network of peer computers all running the same software.

One feature of Bitcoin that has excited interest is that it promises secure anonymous transactions, like cash, but over the internet. While this may seem like a good thing, it is also a problem as it means that Bitcoin is an extremely useful currency for people who want to get around the law. Bitcoin has the problem that it needs to establish itself as useful currency with a legitimate reason to be. If the major use of Bitcoin turns out to be to abet criminal activity it may find itself under attack from governments that want to suppress it.

I am going to do a couple of posts on Bitcoin, one examining the economic aspects, and the other looking the technical and security aspects. In the mean time here are a number of links on related issues. My interest in a virtual currency comes from several direction. In the past I have written in this blog about both Virtual Goods and Virtual Economies.

A big question at the moment is the whole issue of what is Money. Some politicians, concerned about monetary policy have called for a return to the Gold standard, which has resulted in others asking this question. This American Life did a Podcast on that subject and came to the conclusion that Money is much more ephemeral than we may think. Planet Money did a related story where they looked at the small Pacific island of Yap where they used giant round stones as money. When a stone changes hand because of a payment, as the stone is large and heavy, the stone remains where it is and everyone on the island just knows it belongs to someone different. If you think that is strange, it is not that different from the way we manage gold. The gold bars sit in a bank vault and their ownership is digital bits recorded on a disk that is revolving at 7200 RPM. When the gold changes hands, a new record of ownership is written to the disk, however the gold remains exactly where it is. I will have to write more about virtual goods in real economies another time.

Wednesday, May 25, 2011

On Copyright and Open Source

Copyright is a key part of an Open Source or Free Software project. It may sound like copyright is antithetical to Free and Open software, but if Richard Stallman, President of the Free Software Foundation (FSF) thinks that ownership of copyright is an important part of Free Software, then we should believe him. A couple of things have led me to these conclusions. Firstly, at the February meeting of the Business Intelligence SIG, Ian Fyfe discussed the business of Open Source suites and how Pentaho is able to offer a suite of Open Source projects as a commercial produce by controlling the Open Source projects, and in particular copyright to the code.

The other clue to the importance of copyright came by accident as I was looking at the difference between the emacs editor and the XEmacs editor. Emacs was an open software project that forked in the early 1990's before the terms Free Software and Open Source had even been invented. One of the criticisms that Stallman, speaking for the emacs project levels against the XEmacs project is that they have been sloppy about the ownership of the code and not always got the "legal papers" that assign ownership of the contribution to the project. On this web page about XEmacs versus emacs, Stallman says:
"XEmacs is GNU software because it's a modified version of a GNU program. And it is GNU software because the FSF is the copyright holder for most of it, and therefore the legal responsibility for protecting its free status falls on us whether we want it or not. This is why the term "GNU XEmacs" is legitimate.

"But in another sense it is not GNU software, because we can't use XEmacs in the GNU system: using it would mean paying a price in terms of our ability to enforce the GPL. Some of the people who have worked on XEmacs have not provided, and have not asked other contributors to provide, the legal papers to help us enforce the GPL. I have managed to get legal papers for some parts myself, but most of the XEmacs developers have not helped me get them."
Note that GNU is the FSF "brand" for its software. The legal papers that Stallman references assign ownership and copyright of a code contribution to the FSF. Because the FSF owns the code it can enforce its right as owner on anyone who breaks its license. Also it can change the terms of the license, and license the code to another party under any other license that it sees fit. The FSF has changed the license terms of the code that it owns. As new versions of the GNU Public License (GPL) have emerged the FSF have upgraded the license to the latest version.

Copyright and Open Source is a study in contradictions. On the one hand, Richard Stallman has "campaigning against both software patents and dangerous extension of copyright laws". On the other hand, he uses ownership of copyright to push his agenda through the GNU Public License which has a viral component so that the source code of any software that is linked with GNU licensed software must be published as open source software. I will write more about this issue.

A good Open Source project needs to make sure that everyone who contributes code to the project signs a document that assigns copyright of their contribution to the project. Unless care is taken to make all the code belong to a single entity, each person who has contributed to the code owns their contribution. If the project wants to be able to do anything with the code other than passively allow its distribution under its existing license, the code must be owned by a single entity. As Stallman says, the project may not be able to defend its own rights unless the code has clear ownership.

Wednesday, May 18, 2011

The Facebook PR Fiasco

Last week came the revelation that Facebook had secretly hired a prestigious Public Relations (PR) firm to plant negative stories about Google and its privacy practices. This is a completely ridiculous thing to have done and wrong in so many ways that it is difficult to know where to begin. Here are some of the top reasons as to why it was a bad idea.
  • Firstly, the idea that Facebook should be accusing anyone of of playing fast and loose with peoples privacy is a severely hypocritical. Just last year, Mark Zuckerberg told us that "the age of privacy is over". Now he is trying to say that Google is worse for privacy than Facebook! And by the way, this revelation comes at the same time as Symantec has discovered a serious and longstanding security hole in the Facebook App API that allows a users private data to to leak. The only cure is to change your Facebook password, so if you are a Facebook user, go and change your password now!
  • Secondly, we come to the oxymoronic idea of a secret PR campaign. Anyone who thinks that a PR campaign can be secret does not understand PR.
  • Thirdly, a competent let alone "prestigious" PR firm should have understood that the ruse was bound to be discovered and that the fallout would be much worse publicity than anything negative that they could promulgate. Thus anyone who claims to understands PR should have guided their client to do something less radical and refused to get involved in the PR campaign. As it is, the PR firm of Burson-Marsteller has lost a lot of their credibility by being involved in the fiasco, and in PR credibility is everything.
  • Fourthly, the whole idea of a secret PR campaign against another company seems sophomoric, as if Facebook is run by a bunch of undergraduates who have little real world experience, and think that they will be able to get away with a jape like this. No wait …
  • Finally, if Facebook does want to launch a PR campaign on privacy they should do so openly by generating positive press that compares their supposedly good privacy policies with others less good privacy policies and behavior. As Machiavelli said "A prince also wins prestige for being a true friend or a true enemy, that is, for revealing himself without any reservation in favor of one side against another" and goes on to explain why openness and taking sides leads to better outcomes than pretended neutrality. As Facebook did their PR campaign in secret, we conclude that they could not have done it in public and therefore their privacy practices are no better than that of Google or anyone else.
Note: I was going to call this post "Pot hires PR firm to secretly call kettle black" until I read this article from the Atlantic about Search Engine Optimization (SEO) and the fact that as search engines do not have a sense of humor, humorous headlines do not work in the online world.

Saturday, May 07, 2011

Living In the Stream

It used to be that "stream of consciousness" was a pejorative. It was a phrase you used to put down the type of person who talked endlessly with little connection between what they said and what anyone else said or even between what they had just said. Nowadays, the way live our lives is in a stream of consciousness.

Text messages demand to be answered. If you do not answer a text within ten or fifteen minutes the sender complains that you are ignoring them. Emails keep arriving, and a popup in the corner of the screen heralds their arrival. The popup contains an excerpt of the message designed to make you read the whole thing immediately, even although you know that it is junk or something that you should handle later. Instant message boxes pop up whenever you are on line and cannot be ignored. Sometimes people call you on the phone, although good form these days is to IM someone first to see if you can call them on the phone. Finally there are the two great streams of consciousness that captivate our attention: Facebook and Twitter. Random stuff arrives in a random order and as you have subscribed to the feeds you keep looking at them to see if anything interesting happened. In practice it is most likely to be a video of a cute animal doing something stupid.

How people survive and get anything done with these constant streams of distraction is a mystery to me. I do software, and sometimes I need to concentrate on a problem for a good period of time without interruption. It is not that I am necessarily thinking hard all the time, just that it can take time to investigate a problem or think through all the ramifications of a solution and any distraction just breaks the groove, meaning I have to start over. When this happens endlessly in a day my rate of getting stuff done drops towards zero.

So how do we fight back against constant disruption? The answer is to take control and do not let others dictate the agenda. Firstly, establish that there are periods when you are off-line. I do not take my phone to the bathroom, or when I work out or when I go to bed. Also, I do not answer the phone when driving alone, and have my passenger answer when I am not alone. All our means of communication apart from voice have a buffer so that they do not need to be answered immediately, for voice there is a thing called voicemail. On the other hand, voicemail introduces us to the game of telephone tag which is fun for those who like playing it and exceedingly annoying for the rest of us.

Secondly, you do need to "return your calls" as they used to say. Which brings to the crux of the matter. If you want to be part of the conversation, you need to take part in it. Unfortunately, these days what you have to do is "return your calls", respond to your texts, answer your emails, react to IMs, post to Facebook and Twitter to show that you are a conscious sentient being, and finally do something to make a living. So it comes down to picking conversations, and thinking hard about which conversations you want to join. Do this right and we become Islands in the Stream, which is the most we can hope to be these days.

Sunday, May 01, 2011

Understanding MapReduce Performance: Part 2

Getting good performance out of MapReduce is a matter of understanding two concepts. I discussed the first one, that MapReduce is designed to run on large clusters, in a post last week. Here is the second concept and it is something that everyone who uses MapReduce needs to grapple with. MapReduce works by breaking the processing task into a huge number of little pieces so that the work can be distributed over the cluster to be done in parallel. Each Map task and each Reduce task is a separate task that is can be scheduled to run in parallel with other tasks. For both Map and Reduce, the number of tasks needs to be much larger than the number of nodes in the cluster.

The archetypal example of MapReduce is to count word frequency in a large number of documents. A Map task reads a document and outputs a tuple for each word with the count of occurrences of the word in the document. A Reduce task takes a word and accumulates a total count for the word from the per document count produced by each Map tasks. In this example, there are a large number of documents as input to the Map tasks and presumable a large number of words so that there are a large number of Reduce tasks. Another illustration of this principle is found in the Sort Benchmark disclosure that I discussed in the previous post. For the Gray sort, the 100 TB of data is broken into 190,000 separate Maps and there are 10,000 Reduces for a cluster of 3400 nodes.

While most users of MapReduce get the idea that MapReduce needs its input data broken into lots of little pieces so that there are many Map tasks, they forget about the same requirements for Reduce tasks. Searching the internet it is easy to find examples of MapReduce with a small number of Reduce tasks. One is a tutorial from the University of Wisconsin where there is ultimately only one Reduce task. It is particularly galling that this example comes from the University of Wisconsin where they have a large and prestigious program on parallel database system research. In their defense, the tutorial does show how to do intermediate reduction of the data, but that does not prevent it from being a bad example in general.

Sometimes the problem is too small. What do you do if the problem you are working on now just involves the computation of a single result? The answer is to enlarge the problem. In a large cluster it is better to compute more results even although they may not be of immediate use to you. Lets look at an example. Say you want to analyze a set of documents for the frequency of the word 'the'. The natural thing to do is process all the documents and in the Map function filter for the word 'the' and count the results in the Reduce function. This is how you are taught to use "valuable" computing resources. In practice, with MapReduce it is better to count the frequency of all the words in the documents and save the results. It is not a lot more effort for the MapReduce engine to count the frequency of all the words in the documents and if you then want to know how many uses there are of 'a' or any other word, they are there for you immediately.

A common analogy is MapReduce as a freight train as opposed to a relational database which is a racing car. The freight train carries a huge load but is slow to start and stop. A race car is very fast and nimble but it carries only one person. Relational database systems rely on you to use the where clause to reduce the data that it has to analyze, and in return gives you the answer in a short time. MapReduce does not give you an answer as quickly but it is capable of effectively processing a lot more data. With MapReduce you should process all the data and save the results, then use them as you need them. We can sum the way of thinking about how to use MapReduce with the slogan "no where clauses".

Thursday, April 28, 2011

Understanding MapReduce Performance: Part 1

Currently MapReduce is riding high on the hype cycle. The other day I saw a presentation that was nothing but breathless exhortation for MapReduce as the next big thing and that we had better all jump on the bandwagon as soon as possible. However, there are rumblings of performance problems. At the recent Big Data Camp, Greenplum reported that their MapReduce was 100 times slower than their database system. Searching the web finds many people complaining about MapReduce performance, particularly with NoSQL systems like MongoDB. That is a problem because MapReduce is the data analysis tool for processing NoSQL data. For MongoDB, anything more than the most trivial reporting will require the use of MapReduce.

At the same time there is plenty of evidence that MapReduce is no performance slouch. The Sort Benchmark is a prime measure of computer system performance and currently the Hadoop MapReduce system holds two out of 6 titles for which it is eligible. One title is the Gray test for sorting 100 Terabytes (TB) of data in 173 minutes. The other title is the Minute test for sorting 500 Gigabytes (GB) of data in under a minute. These results are as of May 2010 and the Sort Benchmark is run every year, so we can expect better performance in the future.

Understanding MapReduce performance is a matter of understanding two simple concepts. The first concept is that the design center for MapReduce systems like Hadoop is to run large jobs on a large distributed cluster. To get a feel of what this means, look at the Hadoop disclosure document for the Sort Benchmark. The run for sorting 100 TB was made on a cluster of about 3400 nodes. Each node had 8 cores, 4 disks, 16 GB of RAM and 1GB ethernet. For the Minute sort, a smaller cluster was used with 1400 node systems with the same configuration except 8GB of RAM on each node. That is not to say that MapReduce will only work on thousand node systems. Most systems are much smaller than this, however Hadoop is particularly designed so that it will scale to run on a huge cluster.

One problem with a large cluster is that nodes break down. Hadoop has several features that transparently work around the problem of broken nodes and continue processing in the presence of failure. From the Sort Benchmark disclosure, for the Gray sort run, every processing task is replicated. That is, for every processing task, two nodes are assigned to do it so that should a node break down, the sort can still continue with the data from the other node. This was not used for the Minute test because the likelihood of a node breaking down in the minute while the test is running is low enough to be ignored.

Another large cluster feature that has an important effect on performance is that all intermediate results are written to disk. The results of all the Mappers are written to disk and the sorted data for Reducers is written to disk. This is done so that if a node fails only a small piece of work needs to be redone. By contrast, relational database systems go to great length to ensure that after data has been read from disk, it does not touch the disk again before being delivered to the user. If a node fails in a relational database system, the whole system goes into an error state and then does a recovery which can take some time. This is extremely disruptive to performance when a node fails and much better for performance when there is no failure. Relational database systems were not designed to run on thousands of nodes so they treat the problem of a node failure as a very rare event whereas Hadoop is designed as if it a commonplace. The consequence is that Hadoop performance can look slow when compared to a relational database on a small cluster.

Unfortunately, there is not a lot that a user can do about this, except look for a bigger cluster to run their analysis on, or look for bigger data to analyze. That is the subject for the second part of this post where I will talk about the other simple concept for understanding MapReduce performance.