Build and Break: 2011

Monday, July 11, 2011

Will Microsoft Wipe Out the Google Android Advantage?

Microsoft seems to be working many different angles to wipe out any advantage that Google should get as the primary developer of the Android OS for Smartphones. The sneaky angle is to persuade the Smartphone makers to replace Google as the default search engine with Microsoft's own Bing search engine. For example, the LG Revolution has the Android OS, however as the PC Mag review says, it has been thoroughly Binged. Bing is the default search engine and Bing Maps is the default mapping application. My son, who recently got a LG Revolution told me that he had to download Google Maps to get the mapping application he prefers.

Microsoft has also been going after the various Android smartphone makers, getting them to pay royalties on Microsoft patents. Reputedly, HTC is paying Microsoft $15 per smartphone to license patents. It would not surprise me if Microsoft is making more money from licensing patents to Android handset makers than it is making from selling its own Windows 7 Mobile operating system to handset makers. They are certainly making more profit from Android than from Windows 7.

Microsoft could apply pressure other smartphone makers to set the default search engine to Bing in exchange for reduced payments of royalties. If this becomes widespread, Google loses the advantage that it gets from having developed Android, and in the long run even threatens the existence of Android. If Google gets no advantage from developing Android why should it continue? This is only a tiny slice of what is going on with mobile patents, however the ball seems to be in Google's court and we are waiting to see what they do with it.

Friday, July 01, 2011

Fired!

It happened yesterday. They just sent me an eMail. How is that a good way to be fired? Anyway, from today onwards I can tell you that I am no longer an Amazons.com Associate. I cannot fault Amazon's reasons for firing me. Yesterday Governor "Moonbeam" Brown signed a law that because Amazon has associates in California they would have to collect sales taxes from California residents, and Amazon, clinging to "the internet is tax free" mantra fired all their associates that are California residents.

Also, I cannot really fault Governor Brown. He has a big budget gap to fund, an intransigent legislature and the determination that he is going to fix the budget problem properly, or at least better than his predecessors. So Brown signed legislation that would provide California with more revenue, or at least try to plug a gap in state revenue, or at least be a stage along the way to plugging the internet sales tax hole. This legislation on its own is not going to generate any extra revenue, however if enough other states with a sales tax pass similar legislation maybe enough companies will throw in the towel and start complying with state sales tax collection.

What am I going to do? Well I am not going to leave California. The weather is good, there are plenty of good jobs that pay well enough and the other amenities, while not cheap, are well worth it. Also, I have to confess that I have not been a very good Amazon Associate, making practically nothing from all those seductive links sprinkled throughout this blog. In truth, I did not become an Amazon Associate to make money. The real reason was that I was concerned that if I copied a link from the Amazon.com web site, and if you clicked on that link, you might see a message at the top of your screen that said something like "Hello Richard Taylor. Have we got some recommendations for you." As an Associate, I could get a good clean link for a product without having to worry about the fact that the link could have other unwanted baggage attached.

So adios Amazon. I will just have to go back to guessing how to fix up links to your site so that they do the right thing.

Tuesday, June 28, 2011

Bitcoin as an Economic Entity

Bitcoin is the new peer-to-peer virtual currency that I wrote about previously. This post evaluates Bitcoin as money from an economic point of view. I will write a separate post on technical and security aspects. Economists look at money as three things: a measure of value, a medium of exchange and as a commodity, more commonly and politely stated as a store of value. Here is how Bitcoin measures up to these three function.

One function of money is as a measure of value. When we use money to measure value, we do not mean that the money exists, rather that the asset, good or service is worth or cost the sum of money. Thus when we say that someone is a millionaire, this means that the sum of all they own minus their debts is more than a million dollars. It does not mean that they have a million dollars in bills stuffed into a mattress.

The men with the green eyeshades often talk about this purpose of money as "unit of account", thinking about it as a measure of value gets to the essence more quickly. So, when I am in a computer store trying to decide whether I should buy the $500 laptop or the $1000 laptop, I use money as a comparative measure of value, by asking whether the $1000 laptop is really worth twice the $500 laptop and an absolute measure of value by asking whether I can afford the $1000 laptop that I really want or whether I should make do with the $500 laptop and save the difference for other needs.

For a measure of value, the best currency is the currency we are familiar with, that we are paid in and that we use every day. Anyone who has been abroad knows the difficulty of commerce with an unfamiliar currency. At first, after every transaction the thought lingers in the back of your mind, did we just get a deal, or were we robbed? However, with repeated use you pick up a new currency. By the end of a vacation you are starting to be able to predict what goods and services will cost in the new currency. When I played World of Warcraft (WOW), I quickly learned the value of WOW Gold through having to work with it all the time.

Bitcoin has another problem as a measure of value, its volatile exchange rate with other currencies. Since its introduction, it has appreciated against all other currencies by about 200,000%. Recently, heavy selling on a Bitcoin exchange caused its value to fluctuate between $0.01 and $17.00 over the period of a day. This volatility makes it difficult to use as a measure of value because its value is uncertain. Most currencies are managed by a central bank and one of the purposes of a central bank is to keep the currency stable with respect to other currencies so that it can be safely used for all the three functions on money. On the other hand, the essence of Bitcoin is that it is completely distributed with no central authority. As it is unmanaged, we can expect its exchange rate to be somewhat more volatile than other currencies.

Another function of money is as a medium of exchange. Before money existed, trading was difficult. If I led a cow to market with the intent on trading it for grain, I might come to an agreement with another farmer that my cow is worth 8 sacks of grain, except that I only want one sack of grain the other farmer only has 5 sacks of grain to trade and he does not want a cow anyway. With money, I can sell the cow for money to someone who wants a cow, buy just as much grain as I need and save any leftover money for other transactions in the future. Money as a medium of exchange greases the wheels of commerce by acting as an intermediary and thus removing barriers to trade.

Bitcoin scores high as a medium of exchange. It can be securely and anonymously traded on the internet for other goods and services. Also it is almost infinitely divisible so it serves for small exchanges. There are two caveats. Firstly a Bitcoin transaction takes about 10 minutes to confirm, so sellers may be unwilling to accept it for immediate transactions where there is no recourse. That is, Bitcoin is good for selling Alpacca socks over the internet, but not for selling hot-dogs at Coney Island. As Bitcoin is an internet currency, this is only of concern to someone who sells virtual goods over the internet without recourse. The Bitcoin FAQ addresses this issue, saying:

Do you have to wait 10 minutes in order to buy or sell things with BitCoin?

No, it's reasonable to sell things without waiting for a confirmation as long as the transaction is not of high value.

When people ask this question they are usually thinking about applications like supermarkets or snack machines, as discussed in this thread from July 2010. Zero confirmation transactions still show up in the GUI, but you cannot spend them. You can however reason about the risk involved in assuming you will be able to spend them in future. In general, selling things that are fairly cheap (like snacks, digital downloads etc) for zero confirmations will not pose a problem if you are running a well connected node."

The second caveat is that we typically maintain a reserve of any currency that we regularly use as a float to smooth out transactions. Anyone concerned with the volatility of the value of Bitcoin may be unwilling to maintain a float in Bitcoin and therefore not have a convenient reserve of Bitcoin for doing transactions. If Bitcoin continues to have a volatile exchange rate with other currencies and users do not keep a reserve of Bitcoin for doing transactions, it becomes more cumbersome to use and therefore less useful as a medium of exchange. The end result is that Bitcoin is only used when there is no alternative method of payment. The conclusion is that Bitcoin, or any other currencies usefulness as a medium of exchange does depend on it having a reasonably stable value.

The final function of money is as a commodity like Gold, Oil or Frozen Concentrated Orange Juice (FCOJ). Currencies are commodities that are traded like other commodities for good legitimate reasons. For example, a company that contracts to buy a good that is priced in another currency may want to buy insurance against a change in the exchange rate that would cause the good to become more expensive than when they made the original commitment. Financial companies create and sell instruments that provide this insurance and then trade currencies as commodities to protect their position.

First some words about commodities in general. Owning a commodity does not produce any value. Stocks and bonds may pay a dividend, while a commodity does not, so the only reason for owning a commodity as an investor is the hope that its value will increase so that it can be sold at a profit. In practice owning a commodity is an even worse proposition because money is tied up in owning the commodity that could be otherwise earning interest, so even owning a commodity is a losing proposition unless the commodity increases in value. Then there is a cost for every trade which further saps profits. Thus people who are not Bitcoin speculators will not want to hold more Bitcoin than they need for their day to day needs.

Commodity trading creates a market for the commodity that sets its price. The first test of a commodity is that there is a market where the commodity can be traded efficiently. Bitcoin passes this test as there are several markets for Bitcoin, although a recent attack against the MtGox, the largest Bitcoin exchange may reduce confidence. As an example of the efficiency of trading Bitcoin, MtGox charges a 0.65% charge against every trade.

When evaluating a commodity, we consider how it is used to understand the supply and demand that determines its fundamental price. Bitcoin is a fiat currency which has value because people find it useful as a medium of exchange like the other like other fiat currency: Dollar, Pound, Euro or Yen. The key to understanding the value of Bitcoin like any other currency is money supply, the sum of all the money that people keep in their bank accounts and wallet to smooth out their transactions and grease the wheels of commerce as discussed previously. However there is one difference. With other currencies there is a central bank that manages the money supply to keep the value of the currency stable. With Bitcoin, there is no central bank, rather the amount of Bitcoin is circulation is stable. Thus the base value of Bitcoin depends on demand for its money supply.

The base demand for Bitcoin is to use it as a medium of exchange. If more people regularly do Bitcoin transactions and keep it in their wallet to smooth out their transactions, or they tend to keep more Bitcoin in their wallet because they expect to use it for more transactions, there is more demand for the stable supply of Bitcoin and therefore its price rises. Conversely, if less people keep Bitcoin in their wallet or people keep less money in their wallet the price falls. On top of this base demand, there is demand from speculators who expect the price of Bitcoin to rise and therefore hold it in investment level quantities. The base demand for Bitcoin will tend to keep the price stable, while the speculative demand is likely to make the price more volatile.

Another consideration is whether there are any risks associated with owning the commodity. Bitcoin is a virtual currency and a problem with other virtual currencies has been hyperinflation, caused by someone discovering a software bug that allows them to generate huge amounts of the virtual currency without much effort. This has happened in several Massive Multiplayer Online Games (MMOG), but in each case the game has had a central server that hands out money and a game mechanism that is designed with a specific rate of exchange in mind. Bitcoin is different in that it does not have a central authority and it is traded in a free and open market that sets its value. An attack on Bitcoin could reduce its value, however this could be self defeating as it immediately reduces the value of the attack. I will write a separate post on the security considerations, however it is safe to say that as there is a vibrant market for Bitcoin, it is reasonably safe.

In summary, Bitcoin's purpose is to be used as a medium of exchange for transactions over the internet. Its base value comes from small amounts of it being held in a large number of users wallets because they regularly use it as a medium of exchange. If Bitcoin is heavy used as a medium of exchange, this will tend to stabilize its exchange rate against other currencies and make it more useful as a currency when measured against all the functions of money.

eReaders Arrive

After writing about the Kindle for some time, I can let you know that I am now a proud owner of one. I can also tell you that it is a wonderful device, even more wonderful than I imagined, when used for reading the right kind of book, that is, the page-turner kind of book where you start on page 1 and keep turning pages until you get to the end. The other kind of book, the kind of book where you start with the index or table of contents and then jump around has been subsumed by the web with search and hyperlinks to the point where it is redundant anyway. Thus the Kindle is the perfect device for reading the only kind of book that is left, the kind of book that you read straight through.

I am not the only person who has recently bought an eReader. Today a Pew Internet research study showed that eReader ownership has doubled in the last 6 months. It is now up to 12% in the US and is currently growing faster than Tablet ownership. eReaders have been around for longer than the current incarnation of Tablets, and seem to be arriving at the period of mass adoption. Also, given the current price there is little reason not to own one.

An objection to the Kindle has always been that it is not a converged device. It is good for reading one kind of book and little else. Many commentators wanted it to be good at everything, and argued that otherwise it is just another device that you need to carry around. I particularly like that it is not a converged device. When I am reading a book on my Kindle, it will not interrupt my train of thoughts to announce that an email or twitter has arrived, or tempt me to play a silly game or fiddle with a Facebook page. With the Kindle I can walk away from the computer and read a book without all those interruptions and distractions that make life a disconnected stream of consciousness.

Sunday, June 19, 2011

Bitcoin, a Peer to Peer Virtual Currency

Bitcoin is a peer-to-peer virtual currency that seems to pop up in the conversation everywhere I look. A virtual currency is is a currency that is created on computers and traded on the internet. A couple of examples of virtual currencies are Linden Dollars in the online world Second Life and Gold in the massive multiplayer online game World of Warcraft (WOW). People in third world countries play WOW to collect WOW Gold and sell it for real money to players in the first world so that they can buy more powerful armor, weapons and spells to use in the game. Bitcoin is different in that its purpose is to be a currency like dollars, euros or pounds, whereas Linden Dollars and WOW Gold are an element of their games and have no real purpose or value outside of the game.

The other aspect of Bitcoin is that it is a peer-to-peer currency. Bitcoin is created by mining for it against a cryptographic algorithm. Once Bitcoins are created they are traded on a peer-to-peer network. When a transaction has taken place, it is broadcast to the peers on the network, they confirm that the transaction is valid and has taken place. The peer computers add the transaction to the history so that the transaction becomes permanent. There is no central authority that creates or manages Bitcoin, it manages itself through its network of peer computers all running the same software.

One feature of Bitcoin that has excited interest is that it promises secure anonymous transactions, like cash, but over the internet. While this may seem like a good thing, it is also a problem as it means that Bitcoin is an extremely useful currency for people who want to get around the law. Bitcoin has the problem that it needs to establish itself as useful currency with a legitimate reason to be. If the major use of Bitcoin turns out to be to abet criminal activity it may find itself under attack from governments that want to suppress it.

I am going to do a couple of posts on Bitcoin, one examining the economic aspects, and the other looking the technical and security aspects. In the mean time here are a number of links on related issues. My interest in a virtual currency comes from several direction. In the past I have written in this blog about both Virtual Goods and Virtual Economies.

A big question at the moment is the whole issue of what is Money. Some politicians, concerned about monetary policy have called for a return to the Gold standard, which has resulted in others asking this question. This American Life did a Podcast on that subject and came to the conclusion that Money is much more ephemeral than we may think. Planet Money did a related story where they looked at the small Pacific island of Yap where they used giant round stones as money. When a stone changes hand because of a payment, as the stone is large and heavy, the stone remains where it is and everyone on the island just knows it belongs to someone different. If you think that is strange, it is not that different from the way we manage gold. The gold bars sit in a bank vault and their ownership is digital bits recorded on a disk that is revolving at 7200 RPM. When the gold changes hands, a new record of ownership is written to the disk, however the gold remains exactly where it is. I will have to write more about virtual goods in real economies another time.

Wednesday, May 25, 2011

On Copyright and Open Source

Copyright is a key part of an Open Source or Free Software project. It may sound like copyright is antithetical to Free and Open software, but if Richard Stallman, President of the Free Software Foundation (FSF) thinks that ownership of copyright is an important part of Free Software, then we should believe him. A couple of things have led me to these conclusions. Firstly, at the February meeting of the Business Intelligence SIG, Ian Fyfe discussed the business of Open Source suites and how Pentaho is able to offer a suite of Open Source projects as a commercial produce by controlling the Open Source projects, and in particular copyright to the code.

The other clue to the importance of copyright came by accident as I was looking at the difference between the emacs editor and the XEmacs editor. Emacs was an open software project that forked in the early 1990's before the terms Free Software and Open Source had even been invented. One of the criticisms that Stallman, speaking for the emacs project levels against the XEmacs project is that they have been sloppy about the ownership of the code and not always got the "legal papers" that assign ownership of the contribution to the project. On this web page about XEmacs versus emacs, Stallman says:

"XEmacs is GNU software because it's a modified version of a GNU program. And it is GNU software because the FSF is the copyright holder for most of it, and therefore the legal responsibility for protecting its free status falls on us whether we want it or not. This is why the term "GNU XEmacs" is legitimate.

"But in another sense it is not GNU software, because we can't use XEmacs in the GNU system: using it would mean paying a price in terms of our ability to enforce the GPL. Some of the people who have worked on XEmacs have not provided, and have not asked other contributors to provide, the legal papers to help us enforce the GPL. I have managed to get legal papers for some parts myself, but most of the XEmacs developers have not helped me get them."

Note that GNU is the FSF "brand" for its software. The legal papers that Stallman references assign ownership and copyright of a code contribution to the FSF. Because the FSF owns the code it can enforce its right as owner on anyone who breaks its license. Also it can change the terms of the license, and license the code to another party under any other license that it sees fit. The FSF has changed the license terms of the code that it owns. As new versions of the GNU Public License (GPL) have emerged the FSF have upgraded the license to the latest version.

Copyright and Open Source is a study in contradictions. On the one hand, Richard Stallman has "campaigning against both software patents and dangerous extension of copyright laws". On the other hand, he uses ownership of copyright to push his agenda through the GNU Public License which has a viral component so that the source code of any software that is linked with GNU licensed software must be published as open source software. I will write more about this issue.

A good Open Source project needs to make sure that everyone who contributes code to the project signs a document that assigns copyright of their contribution to the project. Unless care is taken to make all the code belong to a single entity, each person who has contributed to the code owns their contribution. If the project wants to be able to do anything with the code other than passively allow its distribution under its existing license, the code must be owned by a single entity. As Stallman says, the project may not be able to defend its own rights unless the code has clear ownership.

Wednesday, May 18, 2011

The Facebook PR Fiasco

Last week came the revelation that Facebook had secretly hired a prestigious Public Relations (PR) firm to plant negative stories about Google and its privacy practices. This is a completely ridiculous thing to have done and wrong in so many ways that it is difficult to know where to begin. Here are some of the top reasons as to why it was a bad idea.

Firstly, the idea that Facebook should be accusing anyone of of playing fast and loose with peoples privacy is a severely hypocritical. Just last year, Mark Zuckerberg told us that "the age of privacy is over". Now he is trying to say that Google is worse for privacy than Facebook! And by the way, this revelation comes at the same time as Symantec has discovered a serious and longstanding security hole in the Facebook App API that allows a users private data to to leak. The only cure is to change your Facebook password, so if you are a Facebook user, go and change your password now!
Secondly, we come to the oxymoronic idea of a secret PR campaign. Anyone who thinks that a PR campaign can be secret does not understand PR.
Thirdly, a competent let alone "prestigious" PR firm should have understood that the ruse was bound to be discovered and that the fallout would be much worse publicity than anything negative that they could promulgate. Thus anyone who claims to understands PR should have guided their client to do something less radical and refused to get involved in the PR campaign. As it is, the PR firm of Burson-Marsteller has lost a lot of their credibility by being involved in the fiasco, and in PR credibility is everything.
Fourthly, the whole idea of a secret PR campaign against another company seems sophomoric, as if Facebook is run by a bunch of undergraduates who have little real world experience, and think that they will be able to get away with a jape like this. No wait …
Finally, if Facebook does want to launch a PR campaign on privacy they should do so openly by generating positive press that compares their supposedly good privacy policies with others less good privacy policies and behavior. As Machiavelli said "A prince also wins prestige for being a true friend or a true enemy, that is, for revealing himself without any reservation in favor of one side against another" and goes on to explain why openness and taking sides leads to better outcomes than pretended neutrality. As Facebook did their PR campaign in secret, we conclude that they could not have done it in public and therefore their privacy practices are no better than that of Google or anyone else.

Note: I was going to call this post "Pot hires PR firm to secretly call kettle black" until I read this article from the Atlantic about Search Engine Optimization (SEO) and the fact that as search engines do not have a sense of humor, humorous headlines do not work in the online world.

Saturday, May 07, 2011

Living In the Stream

It used to be that "stream of consciousness" was a pejorative. It was a phrase you used to put down the type of person who talked endlessly with little connection between what they said and what anyone else said or even between what they had just said. Nowadays, the way live our lives is in a stream of consciousness.

Text messages demand to be answered. If you do not answer a text within ten or fifteen minutes the sender complains that you are ignoring them. Emails keep arriving, and a popup in the corner of the screen heralds their arrival. The popup contains an excerpt of the message designed to make you read the whole thing immediately, even although you know that it is junk or something that you should handle later. Instant message boxes pop up whenever you are on line and cannot be ignored. Sometimes people call you on the phone, although good form these days is to IM someone first to see if you can call them on the phone. Finally there are the two great streams of consciousness that captivate our attention: Facebook and Twitter. Random stuff arrives in a random order and as you have subscribed to the feeds you keep looking at them to see if anything interesting happened. In practice it is most likely to be a video of a cute animal doing something stupid.

How people survive and get anything done with these constant streams of distraction is a mystery to me. I do software, and sometimes I need to concentrate on a problem for a good period of time without interruption. It is not that I am necessarily thinking hard all the time, just that it can take time to investigate a problem or think through all the ramifications of a solution and any distraction just breaks the groove, meaning I have to start over. When this happens endlessly in a day my rate of getting stuff done drops towards zero.

So how do we fight back against constant disruption? The answer is to take control and do not let others dictate the agenda. Firstly, establish that there are periods when you are off-line. I do not take my phone to the bathroom, or when I work out or when I go to bed. Also, I do not answer the phone when driving alone, and have my passenger answer when I am not alone. All our means of communication apart from voice have a buffer so that they do not need to be answered immediately, for voice there is a thing called voicemail. On the other hand, voicemail introduces us to the game of telephone tag which is fun for those who like playing it and exceedingly annoying for the rest of us.

Secondly, you do need to "return your calls" as they used to say. Which brings to the crux of the matter. If you want to be part of the conversation, you need to take part in it. Unfortunately, these days what you have to do is "return your calls", respond to your texts, answer your emails, react to IMs, post to Facebook and Twitter to show that you are a conscious sentient being, and finally do something to make a living. So it comes down to picking conversations, and thinking hard about which conversations you want to join. Do this right and we become Islands in the Stream, which is the most we can hope to be these days.

Sunday, May 01, 2011

Understanding MapReduce Performance: Part 2

Getting good performance out of MapReduce is a matter of understanding two concepts. I discussed the first one, that MapReduce is designed to run on large clusters, in a post last week. Here is the second concept and it is something that everyone who uses MapReduce needs to grapple with. MapReduce works by breaking the processing task into a huge number of little pieces so that the work can be distributed over the cluster to be done in parallel. Each Map task and each Reduce task is a separate task that is can be scheduled to run in parallel with other tasks. For both Map and Reduce, the number of tasks needs to be much larger than the number of nodes in the cluster.

The archetypal example of MapReduce is to count word frequency in a large number of documents. A Map task reads a document and outputs a tuple for each word with the count of occurrences of the word in the document. A Reduce task takes a word and accumulates a total count for the word from the per document count produced by each Map tasks. In this example, there are a large number of documents as input to the Map tasks and presumable a large number of words so that there are a large number of Reduce tasks. Another illustration of this principle is found in the Sort Benchmark disclosure that I discussed in the previous post. For the Gray sort, the 100 TB of data is broken into 190,000 separate Maps and there are 10,000 Reduces for a cluster of 3400 nodes.

While most users of MapReduce get the idea that MapReduce needs its input data broken into lots of little pieces so that there are many Map tasks, they forget about the same requirements for Reduce tasks. Searching the internet it is easy to find examples of MapReduce with a small number of Reduce tasks. One is a tutorial from the University of Wisconsin where there is ultimately only one Reduce task. It is particularly galling that this example comes from the University of Wisconsin where they have a large and prestigious program on parallel database system research. In their defense, the tutorial does show how to do intermediate reduction of the data, but that does not prevent it from being a bad example in general.

Sometimes the problem is too small. What do you do if the problem you are working on now just involves the computation of a single result? The answer is to enlarge the problem. In a large cluster it is better to compute more results even although they may not be of immediate use to you. Lets look at an example. Say you want to analyze a set of documents for the frequency of the word 'the'. The natural thing to do is process all the documents and in the Map function filter for the word 'the' and count the results in the Reduce function. This is how you are taught to use "valuable" computing resources. In practice, with MapReduce it is better to count the frequency of all the words in the documents and save the results. It is not a lot more effort for the MapReduce engine to count the frequency of all the words in the documents and if you then want to know how many uses there are of 'a' or any other word, they are there for you immediately.

A common analogy is MapReduce as a freight train as opposed to a relational database which is a racing car. The freight train carries a huge load but is slow to start and stop. A race car is very fast and nimble but it carries only one person. Relational database systems rely on you to use the where clause to reduce the data that it has to analyze, and in return gives you the answer in a short time. MapReduce does not give you an answer as quickly but it is capable of effectively processing a lot more data. With MapReduce you should process all the data and save the results, then use them as you need them. We can sum the way of thinking about how to use MapReduce with the slogan "no where clauses".

Thursday, April 28, 2011

Understanding MapReduce Performance: Part 1

Currently MapReduce is riding high on the hype cycle. The other day I saw a presentation that was nothing but breathless exhortation for MapReduce as the next big thing and that we had better all jump on the bandwagon as soon as possible. However, there are rumblings of performance problems. At the recent Big Data Camp, Greenplum reported that their MapReduce was 100 times slower than their database system. Searching the web finds many people complaining about MapReduce performance, particularly with NoSQL systems like MongoDB. That is a problem because MapReduce is the data analysis tool for processing NoSQL data. For MongoDB, anything more than the most trivial reporting will require the use of MapReduce.

At the same time there is plenty of evidence that MapReduce is no performance slouch. The Sort Benchmark is a prime measure of computer system performance and currently the Hadoop MapReduce system holds two out of 6 titles for which it is eligible. One title is the Gray test for sorting 100 Terabytes (TB) of data in 173 minutes. The other title is the Minute test for sorting 500 Gigabytes (GB) of data in under a minute. These results are as of May 2010 and the Sort Benchmark is run every year, so we can expect better performance in the future.

Understanding MapReduce performance is a matter of understanding two simple concepts. The first concept is that the design center for MapReduce systems like Hadoop is to run large jobs on a large distributed cluster. To get a feel of what this means, look at the Hadoop disclosure document for the Sort Benchmark. The run for sorting 100 TB was made on a cluster of about 3400 nodes. Each node had 8 cores, 4 disks, 16 GB of RAM and 1GB ethernet. For the Minute sort, a smaller cluster was used with 1400 node systems with the same configuration except 8GB of RAM on each node. That is not to say that MapReduce will only work on thousand node systems. Most systems are much smaller than this, however Hadoop is particularly designed so that it will scale to run on a huge cluster.

One problem with a large cluster is that nodes break down. Hadoop has several features that transparently work around the problem of broken nodes and continue processing in the presence of failure. From the Sort Benchmark disclosure, for the Gray sort run, every processing task is replicated. That is, for every processing task, two nodes are assigned to do it so that should a node break down, the sort can still continue with the data from the other node. This was not used for the Minute test because the likelihood of a node breaking down in the minute while the test is running is low enough to be ignored.

Another large cluster feature that has an important effect on performance is that all intermediate results are written to disk. The results of all the Mappers are written to disk and the sorted data for Reducers is written to disk. This is done so that if a node fails only a small piece of work needs to be redone. By contrast, relational database systems go to great length to ensure that after data has been read from disk, it does not touch the disk again before being delivered to the user. If a node fails in a relational database system, the whole system goes into an error state and then does a recovery which can take some time. This is extremely disruptive to performance when a node fails and much better for performance when there is no failure. Relational database systems were not designed to run on thousands of nodes so they treat the problem of a node failure as a very rare event whereas Hadoop is designed as if it a commonplace. The consequence is that Hadoop performance can look slow when compared to a relational database on a small cluster.

Unfortunately, there is not a lot that a user can do about this, except look for a bigger cluster to run their analysis on, or look for bigger data to analyze. That is the subject for the second part of this post where I will talk about the other simple concept for understanding MapReduce performance.

Sunday, April 24, 2011

The Truth about Smartphone Location Tracking

There is a wave of outrage over the internet about revelation that iPhones has a file with tracking information recording all the places it has been. How dare Apple track users of their products! I am afraid that this is a extremely naive attitude. The fact is that everybody is tracking you on iPhone and not only on a iPhone but on all smartphones and on many less than smart phones as well. Let me count the ways, starting off with the benign and moving to the egregious.

Firstly the carriers and handset makers collect data from phone to help improve their service. Last week we has a joint meeting of the SDForum Business Intelligence and Mobile SIGs on "Mobile Analytics". At that meeting Andrew Coward of CarrierIQ described how they embed their software in phones, usually at the carriers direction, to collect information that can be used to improve service. For example, he told us for example that it is quite normal for them to report to a carrier that their dropped call rate is 6% whereas the carrier's own engineers are telling management that their dropped call rate is 1%. They collect data on location so that the carrier knows where their users are using their phones from so that they can improve their service to that area.

In Europe, CDR laws require phone carriers to retain their Call Data Record (CDR) for all calls for a period of 1 or 2 years. The police can and do request information on all the calls made to or from a number to help with their enquiries into crime. While a CDR record does not usually contain specific location information, it can identify the cell tower and thus the approximate location of the caller. Police have successfully used location based CDR data to help with their investigations for well over a decade.

With the users permission, Google collects information from Android phones about their location. Google is the ultimate data collection company and I am always amazed at the creative ways they find for using that data. One Google service is the Traffic overlay on their Maps. This is derived from observing the change in location of Android phones. However, while Google says that they do not collect personally identifying information, they do need to distinguish between phones to make this application work, so they are tracking the movements of individuals, if only to provide the rest of us generic information on traffic flows. Google has plenty of other uses for this data. For example, they keep a database that locates every Wi-Fi hotspot is so that they can identify your location based on the Wi-Fi hotspot you using. Google can use data from Android phones to validate and update that database.

Mobile analytics and Apps is where the use of location based information starts to get interesting. Last year Flurry presented to the Business Intelligence SIG and we heard about their run in with Steve Jobs. You can read their press release to get the full story of what they did. In short Flurry has a free toolkit that developers install into their mobile Apps that collects information and sends the data back to Flurry. The developer can then access analytics reports about their app at the Flurry web site. However, Flurry retains the data that has been collected from the App, including location based data.

In January 2010, a couple of days before the iPad was announced, Flurry issued a press release saying that they saw a new Apple device that was was only being used in the Apple headquarters in Cupertino and gave some statistics on the number of different Apps that were being tested on this device. At this Steve Jobs blew his top and tried to get Flurry completely banned from iPhone Apps. Eventually Flurry and Apple settled their differences. The conclusion was that in the words of the iPhone developer agreement "The use of third party software in Your Application to collect and send Device Data to a third party for processing or analysis is expressly prohibited."

So lets parse this. Flurry is a company that has no direct relationship with the carriers, handset makers or the users of Apps, yet is is collecting data from all the Apps that it is included in. The data is available for use by the App developer and by Flurry. At the time of the iPad release they could identify that the device was different from all other devices and identify its location to within one set of buildings. Now, I am not trying to pick on Flurry specifically, there are several companies in this area. At the Business Intelligence SIG last week we heard from Apsalar, a recent start up in the same space, however, Flurry is the largest company that provides mobile analytics. Flurry estimates that they are included in up to 1 in 5 mobile Apps for the iPhone and Android. Because they are in so many Apps, they can provide aggregate data on all App usage.

The point of this is that we want location aware Apps, however we also want to preserve our privacy. As Apps are, these two goals are incompatible. To be location aware, the App has to know your location, and if the App knows your location, it can transmit that information back to the App developer or aggregator of analytics for the App developer. Thus they know where you are whether you want to or not. Android, has a profile that determines which information an App can access that is set when the App is installed. If it is allowed to access location information on installation, it can continue to do so until it is uninstalled.

Compared to what Apps know about what you are doing while you use the App, the location database that the iPhone is collecting seems to be a small matter. In fact it seems to be a good reason to limit the number of Apps that you can be running at any one time. At least if only one App is running then only one App knows where you are at any particular time.

Tuesday, April 12, 2011

The Business of Open Source Suites

I have often wondered how a commercial company builds an Open Source Suite out of a collection of open source projects. At the last BI SIG meeting Ian Fyfe Chief Technology Evangelist at Pentaho told us how they do it and gave some interesting insights on how Open Source really works. Pentaho offers a Open Source Business Intelligence suite that includes the Kettle data integration project, the Mondrian OLAP project and the WETA data mining project amongst other projects.

As Ian explained, Pentaho controls these Open Source projects because it employs the project leader and major contributors to each of the projects. In some cases Pentaho also owns the copyright of the code. In other cases, any ownership is in doubt, because there have been too many contributors and or what they have contributed has not been managed well enough to be able to say who owns the code. Mondrian is an example of an Open Source project where there have been enough contributors that it is not possible to take control of the whole source code and exert any real rights over it.

The real control that Pentaho exerts over the Open Source components of its suites is that it gets to say what their roadmap is and how they will evolve in the future. As I noted, Pentaho is driving the various projects to a common metadata layer so that they can become integrated as a single suite of products.

Saturday, April 09, 2011

The Fable of the Good King and the Bad King

A long time ago there were two countries. Each country had a King. One King was a good King and the other King was a bad King as we will find out. Now, as you all know a Kings main job is to go out and make war on his enemies. It is the reason that Kings exist. If a king is not out making war against his enemies, he will go out hunting and make war on the animals of the forest. A good war will enlarge the kingdom, enhance the King fame and gives him more subjects to rule over. But before a King can make war, he should make sure that his subjects provided for. For while the subjects of a King owe everything that they have to their King, the King is also responsible for the welfare and good being of his subjects.

There are many parts to taking care of subjects: making good laws, passing down sound judgements, but the most important one is making sure that the granaries are filled in times of plenty. For as surely as fat times follow lean times, lean times follow fat times. In times of plenty, the excess harvest should be saved so that in times of need the subjects do not starve. Subjects who are starving are weak and cannot praise their King nor defend his kingdom.

Now in our two countries, these were years of plenty, and the Kings knew that they would go to war. The good King also knew that it was his duty to make sure the granaries were filled, and so he did. However, the bad King wanted to win the battle so badly that he sold off all the grain in his granaries to buy expensive war machines. A little incident happened, it was blown up into a huge crisis and the two countries went to war. Each King assembled his army and let it to the battleground at the border of their countries as had happened so many times before. The armies were evenly matched and they fought all day. At the end of the day the army of the bad King held its ground and he was declared the victor. The expensive war machines had helped, but less than hoped for. However, both armies were so weakened and exhausted by the fight that they turned around and went home, as they had so many times before.

The years after this battle were years of want. The harvest had failed and both kingdoms suffered. However, the kingdom of the bad King suffered much more than the kingdom of the good King for there was no grain in their granaries. When the little incident happened that blew up into a huge crisis, both Kings assembled their armies and marched to the battleground on the border. This time the good King won the battle because his men were stronger.

The good King advanced his army into the country of the bad King. They may not be able to take the whole country, but the good King had to let his men do a little rape and pillage as a reward for winning the battle. The bad King realizing his precarious position came out to parley with the good King. The bad King had nothing to offer the good King but some used war machines and the hand of his daughter in marriage. The good King accepted that the daughter of the bad King should should marry his son and that when the two Kings has passed on the greater battleground in the sky, the son of the good King would rule both countries. Thus the two kingdoms would become one united country. A country that would be large and strong enough to make war on the countries on the far side of the mountains.

The moral of this story is that in times of plenty, make sure that the granaries are filled, for as surely as fat times follow lean times, lean times follow fat times, and the best protection against lean times are full granaries. On this matter, a King must beware of false council. When times are good, the false council will say "What could possibly go wrong? The times are fat and everyone is happy. Make the populace more happy by selling off the grain in the granary and rewarding the citizens each according to what they have contributed." Even worse, when times are lean the false council will say "Times are awful and getting worse, we must take the grain out of the peoples mouths and put in in the granaries for the harvest next year could be even worse than this year." The point of a granary or any store of wealth is to save the excess during the fat years so that they can be used during the lean years.

Wednesday, March 30, 2011

Cloud Security

Security is not only the the number one concern for adopting cloud computing, it is also a serious barrier to the adopt-ability of cloud computing. Also, security considerations are causing the Virtual Machine (VM) operating system to evolve. All this came out at the SDForum Cloud SIG night on Cloud Security (the presentations are on the SIG page). There were three speakers and a lot was said. I am just going to highlight a few things that struck me as important.

Firstly, Dr Chenxi Wang from Forrester Research spoke on cloud security issues and trends. She highlighted the issue of compliance to various regulations and how it clashes with what the cloud providers have to offer. One concern is where data is stored, as countries have different regulations for data privacy and record keeping on individuals. If data from one country happened to be stored in another country, that could create a problem with complex legal ramifications that would be expensive to resolve. On the other side of the equation are the cloud system vendors who want to provide a generic service with as few constraints as possible. Having to give a guarantee about where data is stored would make their service offering more complicated and expensive to provide.

Another more specific example of the clash between compliance and what cloud vendors provide is with the PCI security standard in credit card card industry. One PCI requirement is that all computer systems used for PCI applications are scanned for vulnerabilities at least ever three months. Most cloud vendors are unwilling to have their systems scanned for vulnerabilities for a variety of reasons, one of which I will discuss shortly. The solution may be specialized cloud services that are aimed at specific industries. IBM is experimenting with a cloud service that they claim is PCI compliant. These specific services will be more expensive and we will have wait and see whether they succeed.

Chris Richter from Savvis, a cloud provider spoke next. He mentioned standards as a way to resolve the issued described above. The International Standards Organization is creating the ISO 27000 suite of standards for information security. So far ISO 27001 "Information security management systems — Requirements" and ISO 27002 "Code of practice for information security management" are the most mature and relevant standards. As with other ISO standards like ISO 9000 quality standard, there is certification process which will allow cloud providers to make standards based security claims about the service that they provide.

Finally, Dave Asprey from Trend Micro discussed the evolving nature of the VM technology that underlies cloud computing offerings. The original VMware vision was that a virtual machine would be used to develop software for a real physical machine so they spent a lot of time and effort on faithfully replication every aspect of a physical machine in their virtual machine. Now the use case has shifted to making more efficient use of resources. However, a problem is that common operations can bring a set of virtual machines to a standstill if they all decide to do the same common operation at the same time.

Again, vulnerability scanning shows the problem. If the company default is that the anti-virus scan is scheduled for lunchtime Wednesday, then the whole virtual machine infrastructure can be brought to its knees when everyone's VM starts its scan at the same time. Furthermore, because many of the files being scanned may be shared by all the virtual machines, having each VM scan them is a huge waste of resources. Anti-virus software companies are working with the VM software vendors to provide a vulnerability scan that is VM aware and that uses new VM APIs to perform its function is an efficient and non-disruptive way. While this is necessary it seems to run counter to the original notion that each VM is an entirely separate entity that is completely unaware that other VMs exist.

Sunday, March 13, 2011

Database System Startups Capitulate

In the last decade, there have been many database system startups, most of them aimed at the analytics market. In the last year, several of the most prominent ones have sold out to large companies. Here are my notes on what has happened.

Netezza to IBM
Netezza is database appliance that uses hardware assistance to do search. Recently it has been quite successful, with revenues getting into the $200M range. Netezza was founded in 2000 and sold out to IBM for $1.7B. The deal closed in November 2010. The Netezza hardware assistance is a gismo near the disk head that decides which data to read. Many people, myself included, think that special purpose hardware in this application is of marginal value at best. You can get better price performance and much more flexibility with commodity hardware and clever software. IBM seems to be keeping Netezza at arms length as a separate company and brand, which is unusual as IBM normally integrates the companies it buys into its existing product lines.

Greenplum to EMC
Greenplum is a massive multi-processor database system. For example, Brian Dolan told the BI SIG last year how Fox Interactive Media (MySpace) used a 40 host Greenplum database system to do their data analytics. The company was founded in 2003. The sale to EMC closed in July 2010. The price is rumoured to be somewhere at the top of the $300M to $400M range. EMC is a storage system vendor that has been growing very fast, partly by acquiring successful companies. EMC owns VMWare (virtualization), RSA (security) and many other businesses. The Greenplum acquisition adds big data to big storage.

Vertica to HP
Vertica is a columnar database system for analytics. The privately held company started in 2005 with respected database guru Mike Stonebreaker as a founder. The sale was announced in February 2011. The sale price has not been announced. I have heard a rumour of $180M which seems low, although the company received only $30M in VC funding. Initially Vertica seemed to be doing well, however in the last year it seems to have lost momentum.

The other interesting part of this equation is HP which used to be a big partner with Oracle for database software. When Oracle bought HP hardware rival Sun Microsystems in 2009, HP was left in a dangerous position as they did not have a database system to call their own. I was surprised that nobody commented on this at the time. In the analytics area, HP tried to fill in with the NeoView database system, which proved to be such a disaster that they recently cancelled it and bought Vertica instead. NeoView was based on the Tandem transaction processing database system. Firstly, it is difficult to get database system that is optimized for doing large numbers of small transactions to do large analytic queries well, and the Tandem system is highly optimized for transaction processing. Secondly, the Tandem database system only ran on the most expensive hardware that HP had to offer so it was very expensive to implement.

Aster Data Systems to Teradata
Aster Data is a massive multi-processor database system, which in theory is a little more flexible about using a cluster of hosts than Greenplum. The company was founded in 2006 and sold out to Teradata for about $300M in March 2011. Teradata, founded in 1979 and acquired by NCR in 1991 was spun out of NCR in 2007 and since then has been sucessfully growing in the data warehouse space. It is not clear how Aster Data and Teradata will integrate their product lines. One thing is that Aster data gives Teradata a scalable offering in the cloud computing space. Teradata has been angling to get into this space for some time as we heard last summer when Daniel Graham spoke the the BI SIG.

Recently there have been a lot of database systems startups, and several of them are still independent. On the other side, there are not a lot of companies that might want to buy a database systems vendor. Furthermore, there is a strong movement to NoSQL databases which are easier to develop and where there are several strong contenders. The buyout prices are good, but apart from Netezza the prices are no blowout. The VCs behind these sales probably decided that they do not want to be left standing when the music stops and so sold out for a good but not great profit.

Saturday, February 26, 2011

The App Store Margin

Recently there has been a lot of discussion about the Apple announcement that they are taking a 30% margin for selling subscriptions through their App store, and that Apple will also take a 30% margin for Apps that sell virtual products and subscriptions through the App. Unfortunately most of the discussion has been heat without light. That is there have been no facts to back up the arguments on either side. I had been curious about the margin in selling goods anyway, so as I had the data, I computed gross margin for publicly traded US companies in the various different retail categories.

As you can see the margin varies between 20% and 40%. The overall average is about 25%, dominated by the Grocery and Department & Discount categories. For retailers working in the real world, after paying for their goods, they have to pay for their properties, staff and marketing so their net margin is considerably less. On the other hand Apple is just processing payments and delivering virtual goods over the internet. On this basis, a 30% margin seems to be on the high side, although not completely out of line.

Galen Gruman at Infoworld points out that a higher margin tends to favor small app and content providers because they would have high distribution costs anyway. On the other hand, a large content provider resents having to hand over 30% of their revenue to Apple for not doing a lot of work. For this reason, I expect that large content providers campaign for a bulk discount on the cost of distributing their content. Thus a good and hopefully likely outcome is a sliding scale. For example, a 30% margin on the first $20,000 per month, 20% on the next $20,000, 10% on the next $20,000 and so on (I have no insight on the business so these numbers are invented as an illustration rather than a suggestion as to what the numbers should be).

Part of the resentment with Apple is that they have a captive market and their behavior in stating terms appears dictatorial. They would have been much better to follow the standard politically correct procedure. That is, to put out a discussion document and then after some to and fro, imposed their terms as they always intended. It has the same end result while creating good will through a patina of choice and consultation.

Saturday, February 19, 2011

Agile BI at Pentaho

Ian Fyfe, Chief Technology Evangelist at Pentaho showed us what the Pentaho Open Source Business Intelligence Suite is and where they are going with it when he spoke to the February meeting of the SDForum Business Intelligence SIG on "Agile BI". Here are my notes on Pentaho from the meeting.

Ian started off with positioning. Pentaho is an open source Business Intelligence Suite with a full set of data integration, reporting, analysis and data mining tools. Their perspective is that 80% of the work in a BI project is acquiring the data and getting it into a suitable form and then other 20% is reporting and analysis of the data. Thus the centerpiece of their suite is the Kettle data integration tool. They have a strong Mondrian OLAP analysis tool and Weka Data Mining tools. Their reporting tool is perhaps not quite as strong as other Open Source BI suites that started from a reporting tool. All the code is written in Java. It is fully embeddable in other applications and can be branded for that application.

Ian showed us a simple example of loading data from a spreadsheet, building a data model from the data and then generating reports from the data. All of these things could be done from within the data integration tool, although they can also be done with stand alone tools. Pentaho is working in the direction of a fully integrated set of tools with common metadata between them all. Currently some of the tools are thick clients and some web based clients. They are moving to have all their client tools be web based.

We had come to hear a presentation on agile BI and Ian gave us the Pentaho view. In an enterprise, the task of generating useful business intelligence is usually done by the IT department in consultation with the end users who want the product. The IT people are involved because they supposedly know the data sources and they own the expensive BI tools. Also, the tools are complicated and using them is usually too difficult for the end user. However, IT works to their own schedule, through their own processes and take their time to produce the product. Often, by the time IT has produced a report, the need for it has moved on.

Pentaho provides a tightly integrated set of tools with a common metadata layer so there is no need to export the metadata from one tool and import it into the next one. The idea is that that the end to end task of generating business intelligence from source data can be done within a single tool or with a tightly integrated suite of tools. This simplifies and speeds up the process of building BI products to the point that it can be delivered while it is still useful. In some cases, the task is simplified to such an extent that it may be done by a power user rather than being thrown over the wall to IT.

The audience was somewhat sceptical of the idea that a sprinkling of common metadata can make for agile BI. All the current BI suites, commercial and open source, have been pulled together from a set of disparate products and they all have rough edges in the way the components work together. I can see that deep and seamless integration between the tools in a suite will make the work of producing Business Intelligence faster and easier. Whether it will be fast enough to call agile we will have to learn from experience.

Sunday, February 06, 2011

Revolution: The First 2000 Years of Computing

For years, the Computer History Museum (CHM) has a open storage area where they put their collection of old computers, but without any interpretation except for docent led tours. I had no problem wandering through this treasure trove because I knew a lot about what they had on show, from slide rules and abacuses to the Control Data 6600 and the Cray machines. Even then, a docent could help by pointing out features that I would miss, such as the ash tray on each workstation of the Sage early warning computer system.

Now the CHM has opened their "Revolution: The First 2000 Years of Computing" exhibition, and I recommend a visit. They still have all the interesting computer hardware as they had in the visible storage area, however it is placed in a larger space and there is all kind of interpretive help from explanation of the exhibits to video clips that you can browse. In my visit, I saw a lot of new things and learned much.

For example, Napier's Bones are an old time calculation aid that turns long multiplication into addition. The Napier's Bones exhibit explains how they work and allows you to do calculations using a set. The exhibit on computers and rocketry has the guidance computer for a large missile arrayed in a circle around the inside of the missile skin leaving an ominously empty space in the middle for the payload. In the semiconductor area they had examples of silicon wafers that ranged from the size of a small coin from the early days to a current wafer that is the size of a large dinner plate. There is also an interesting video discussion of the marketing of the early microprocessors like the 8086, the Z8000, the M68000 and the absolute importance of landing the design win for the IBM PC that led to the current era where Intel is biggest and most profitable chip maker. These are just a sample of the many fascinating exhibits there.

I spent over 2 hours in the exhibition and only managed to get through half of it. I am a long time member of the museum and can go back any time, so this is a warning to non-members to allow enough time for their visit.

Saturday, February 05, 2011

Greenplum at Big Data Camp

I was at the Big Data Camp at the Strata Big Data conference the other day and one of the breakout sessions was with Greenplum. They had several interesting things to say on performance in the cloud, map-reduce and performance. Greenplum is a parallel database system that runs on a distributed set of servers. To the user, Greenplum looks like a conventional database server except that it should be faster and able to handle large data because it farms out the data and the workload over all the hosts in the system. Greenplum also has a map-reduce engine in the server and distributed Hadoop file system. Thus the user can use Greenplum both as a big data relational database and as a big data NoSQL database.

Map-reduce is good for taking semi-structured data and reducing it to more structured data. The example of map-reduce that I gave some time ago does exactly that. Thus a good use of Map Reduce is to do the Transformation part of ETL (Extract-Transform-Load), which is the way data gets into a data warehouse. The Greenplum people confirmed that this is a common use pattern for map-reduce in their system.

Next was a discussion of performance. Greenplum has compared performance and asserted that their relational database is 100 times faster than their map-reduce engine for doing the same query. I was somewhat surprised by the magnitude of this number, however I know that at the low end of system and data size, a relational database can be much faster than map-reduce and at the high end there places you can go with map-reduce that conventional database servers will not go, so it is never a level comparison. I will write more on this in another post.

Finally we got to performance on Virtual Machines (VMs) and in the cloud. Again Greenplum had measured their performance and offered the following. In a place where where conditions are well controlled like a private cloud, they expect to see a 30% performance reduction from running on VMs. In a public cloud like the Amazon EC2 server cloud, they see a 4 times performance reduction. The problem in a public cloud is inconsistent speed for data access and networks. They see both an overall speed reduction and inconsistent speeds when the same query is run over and over again.

It is worth remembering that Greenplum and other distributed database systems are designed to run on a set of servers with the same performance. In practice this means that the whole database system tends to run at the speed of the slowest instance. On the other hand, map-reduce is designed to run on distributed systems with inconsistent performance. The workload is dynamically balanced as the map-reduce job progresses, so map-reduce will work relatively better in a public cloud than a distributed database server.

Monday, January 31, 2011

Security in the Cloud

Although I am not an expert, I have been asked more than once about security in the cloud. Now I can help because last week I got an education on best security practices in the cloud at the SDForum Cloud SIG meeting. Dave Asprey VP of Cloud Security at Trend Micro gave us 16 best practices for ensuring that data is safe in a public cloud like the Amazon cloud services. I will not list all of them, but here is the gist.

Foremost is to encrypt all data. The cloud is a very dynamic place with instances being created and destroyed all over the place, and your instances or data storage may be moved about to optimize performance. When this happens, the residual copy of your data can be left behind for the next occupier of that space to see. Although this would happen by accident, you do not want to expose confidential data for other to see. The only cure is to encrypt all data so that whatever may be left behind is not recognizable. Thus you should only use encrypted file systems, encrypt data in shared memory and encrypt all data on the network.

Management of encryption keys is important. For example, you should only allow the decryption key to enter the cloud when it is needed, and make sure that it is wiped from memory after it has been used. Passwords are a thing of the past. Instead of a password, being able to provide the key to decrypt your data is sufficient to identify you to your cloud system. There should be no password based authentication and access to root privileges should not be mediated by a password, but should be enabled as needed by mechanisms like encryption keys.

Passive measures are not the end of cloud security. There are system hardening tools and security testing services. Also use an active intrusion detection system, for example OSSEC. Finally, and most importantly, the best advice is to "Write Better Applications!"

Monday, January 17, 2011

The Steve Jobs Media Playbook

Information wants to be free. Steve Jobs is not usually associated with setting information free, however he set music free and may well be on the way to set more media free. Here is the playbook that he used to set music free, and an examination of whether he can set other media free.

Back at the turn of the millennium digital music was starting to make waves and Apple introduced their first iPod in 2001. At the beginning, it was not a great seller. Next year the second generation iPod that worked with Microsoft Windows came out and sales started to take off. The next problem with promoting sales of the iPod was to let people buy music directly. In those days, to buy music you had to buy a CD, rip it onto a computer and then sync the music onto the iPod.

The record companies did not like digital music. It was in the process of destroying their business model of selling physical goods, that is CDs, which had been plenty profitable until the internet and file sharing had come along. Thus the record companies knew that if they were going to allow anyone to sell digital music, the music content had to be protected by a strong Digital Rights Management (DRM) system. Basically DRM encrypts digital content so that it can only be accessed by a legitimate user on a accredited device.

Now there is one important thing about any encryption, it depends upon a secret key to unlock the content. If too many people know a secret, it is no longer a secret. So it made perfect sense for Apple to have their own DRM system and be responsible for keeping their secret safe. The only problem was that Apple effectively controlled the music distribution channel because of the DRM system and its secret. By providing exactly what the music business had asked for, Apple managed to wrest control of the distribution channel from them.

In the past I have joked about the music business controlling the industry by controlling the means of production. In fact they controlled the business by controlling the distribution channel between the artists and the record stores who sold the music. When the iTunes store became the prime music distribution channel it was game over for the recording industry. They had to climb down and offer their music without DRM to escape from its deadly embrace. DRM free music has not stopped iTunes but it does open up other sales channels.

The remaining question is what will happen with other media? Apple will not dominate the tablet market as it has the music player market so it will not be able to exert the same influence. On the other hand, other media is not a collectible as music. We collect music because we want to listen to it over and over again. With most other media, we are happy to consume it once and then move on. Thus we do not feel the need to own the media in the same way. I have some more thoughts that will have to wait for another time.