Wednesday, May 25, 2011

On Copyright and Open Source

Copyright is a key part of an Open Source or Free Software project. It may sound like copyright is antithetical to Free and Open software, but if Richard Stallman, President of the Free Software Foundation (FSF) thinks that ownership of copyright is an important part of Free Software, then we should believe him. A couple of things have led me to these conclusions. Firstly, at the February meeting of the Business Intelligence SIG, Ian Fyfe discussed the business of Open Source suites and how Pentaho is able to offer a suite of Open Source projects as a commercial produce by controlling the Open Source projects, and in particular copyright to the code.

The other clue to the importance of copyright came by accident as I was looking at the difference between the emacs editor and the XEmacs editor. Emacs was an open software project that forked in the early 1990's before the terms Free Software and Open Source had even been invented. One of the criticisms that Stallman, speaking for the emacs project levels against the XEmacs project is that they have been sloppy about the ownership of the code and not always got the "legal papers" that assign ownership of the contribution to the project. On this web page about XEmacs versus emacs, Stallman says:
"XEmacs is GNU software because it's a modified version of a GNU program. And it is GNU software because the FSF is the copyright holder for most of it, and therefore the legal responsibility for protecting its free status falls on us whether we want it or not. This is why the term "GNU XEmacs" is legitimate.

"But in another sense it is not GNU software, because we can't use XEmacs in the GNU system: using it would mean paying a price in terms of our ability to enforce the GPL. Some of the people who have worked on XEmacs have not provided, and have not asked other contributors to provide, the legal papers to help us enforce the GPL. I have managed to get legal papers for some parts myself, but most of the XEmacs developers have not helped me get them."
Note that GNU is the FSF "brand" for its software. The legal papers that Stallman references assign ownership and copyright of a code contribution to the FSF. Because the FSF owns the code it can enforce its right as owner on anyone who breaks its license. Also it can change the terms of the license, and license the code to another party under any other license that it sees fit. The FSF has changed the license terms of the code that it owns. As new versions of the GNU Public License (GPL) have emerged the FSF have upgraded the license to the latest version.

Copyright and Open Source is a study in contradictions. On the one hand, Richard Stallman has "campaigning against both software patents and dangerous extension of copyright laws". On the other hand, he uses ownership of copyright to push his agenda through the GNU Public License which has a viral component so that the source code of any software that is linked with GNU licensed software must be published as open source software. I will write more about this issue.

A good Open Source project needs to make sure that everyone who contributes code to the project signs a document that assigns copyright of their contribution to the project. Unless care is taken to make all the code belong to a single entity, each person who has contributed to the code owns their contribution. If the project wants to be able to do anything with the code other than passively allow its distribution under its existing license, the code must be owned by a single entity. As Stallman says, the project may not be able to defend its own rights unless the code has clear ownership.

Wednesday, May 18, 2011

The Facebook PR Fiasco

Last week came the revelation that Facebook had secretly hired a prestigious Public Relations (PR) firm to plant negative stories about Google and its privacy practices. This is a completely ridiculous thing to have done and wrong in so many ways that it is difficult to know where to begin. Here are some of the top reasons as to why it was a bad idea.
  • Firstly, the idea that Facebook should be accusing anyone of of playing fast and loose with peoples privacy is a severely hypocritical. Just last year, Mark Zuckerberg told us that "the age of privacy is over". Now he is trying to say that Google is worse for privacy than Facebook! And by the way, this revelation comes at the same time as Symantec has discovered a serious and longstanding security hole in the Facebook App API that allows a users private data to to leak. The only cure is to change your Facebook password, so if you are a Facebook user, go and change your password now!
  • Secondly, we come to the oxymoronic idea of a secret PR campaign. Anyone who thinks that a PR campaign can be secret does not understand PR.
  • Thirdly, a competent let alone "prestigious" PR firm should have understood that the ruse was bound to be discovered and that the fallout would be much worse publicity than anything negative that they could promulgate. Thus anyone who claims to understands PR should have guided their client to do something less radical and refused to get involved in the PR campaign. As it is, the PR firm of Burson-Marsteller has lost a lot of their credibility by being involved in the fiasco, and in PR credibility is everything.
  • Fourthly, the whole idea of a secret PR campaign against another company seems sophomoric, as if Facebook is run by a bunch of undergraduates who have little real world experience, and think that they will be able to get away with a jape like this. No wait …
  • Finally, if Facebook does want to launch a PR campaign on privacy they should do so openly by generating positive press that compares their supposedly good privacy policies with others less good privacy policies and behavior. As Machiavelli said "A prince also wins prestige for being a true friend or a true enemy, that is, for revealing himself without any reservation in favor of one side against another" and goes on to explain why openness and taking sides leads to better outcomes than pretended neutrality. As Facebook did their PR campaign in secret, we conclude that they could not have done it in public and therefore their privacy practices are no better than that of Google or anyone else.
Note: I was going to call this post "Pot hires PR firm to secretly call kettle black" until I read this article from the Atlantic about Search Engine Optimization (SEO) and the fact that as search engines do not have a sense of humor, humorous headlines do not work in the online world.

Saturday, May 07, 2011

Living In the Stream

It used to be that "stream of consciousness" was a pejorative. It was a phrase you used to put down the type of person who talked endlessly with little connection between what they said and what anyone else said or even between what they had just said. Nowadays, the way live our lives is in a stream of consciousness.

Text messages demand to be answered. If you do not answer a text within ten or fifteen minutes the sender complains that you are ignoring them. Emails keep arriving, and a popup in the corner of the screen heralds their arrival. The popup contains an excerpt of the message designed to make you read the whole thing immediately, even although you know that it is junk or something that you should handle later. Instant message boxes pop up whenever you are on line and cannot be ignored. Sometimes people call you on the phone, although good form these days is to IM someone first to see if you can call them on the phone. Finally there are the two great streams of consciousness that captivate our attention: Facebook and Twitter. Random stuff arrives in a random order and as you have subscribed to the feeds you keep looking at them to see if anything interesting happened. In practice it is most likely to be a video of a cute animal doing something stupid.

How people survive and get anything done with these constant streams of distraction is a mystery to me. I do software, and sometimes I need to concentrate on a problem for a good period of time without interruption. It is not that I am necessarily thinking hard all the time, just that it can take time to investigate a problem or think through all the ramifications of a solution and any distraction just breaks the groove, meaning I have to start over. When this happens endlessly in a day my rate of getting stuff done drops towards zero.

So how do we fight back against constant disruption? The answer is to take control and do not let others dictate the agenda. Firstly, establish that there are periods when you are off-line. I do not take my phone to the bathroom, or when I work out or when I go to bed. Also, I do not answer the phone when driving alone, and have my passenger answer when I am not alone. All our means of communication apart from voice have a buffer so that they do not need to be answered immediately, for voice there is a thing called voicemail. On the other hand, voicemail introduces us to the game of telephone tag which is fun for those who like playing it and exceedingly annoying for the rest of us.

Secondly, you do need to "return your calls" as they used to say. Which brings to the crux of the matter. If you want to be part of the conversation, you need to take part in it. Unfortunately, these days what you have to do is "return your calls", respond to your texts, answer your emails, react to IMs, post to Facebook and Twitter to show that you are a conscious sentient being, and finally do something to make a living. So it comes down to picking conversations, and thinking hard about which conversations you want to join. Do this right and we become Islands in the Stream, which is the most we can hope to be these days.

Sunday, May 01, 2011

Understanding MapReduce Performance: Part 2

Getting good performance out of MapReduce is a matter of understanding two concepts. I discussed the first one, that MapReduce is designed to run on large clusters, in a post last week. Here is the second concept and it is something that everyone who uses MapReduce needs to grapple with. MapReduce works by breaking the processing task into a huge number of little pieces so that the work can be distributed over the cluster to be done in parallel. Each Map task and each Reduce task is a separate task that is can be scheduled to run in parallel with other tasks. For both Map and Reduce, the number of tasks needs to be much larger than the number of nodes in the cluster.

The archetypal example of MapReduce is to count word frequency in a large number of documents. A Map task reads a document and outputs a tuple for each word with the count of occurrences of the word in the document. A Reduce task takes a word and accumulates a total count for the word from the per document count produced by each Map tasks. In this example, there are a large number of documents as input to the Map tasks and presumable a large number of words so that there are a large number of Reduce tasks. Another illustration of this principle is found in the Sort Benchmark disclosure that I discussed in the previous post. For the Gray sort, the 100 TB of data is broken into 190,000 separate Maps and there are 10,000 Reduces for a cluster of 3400 nodes.

While most users of MapReduce get the idea that MapReduce needs its input data broken into lots of little pieces so that there are many Map tasks, they forget about the same requirements for Reduce tasks. Searching the internet it is easy to find examples of MapReduce with a small number of Reduce tasks. One is a tutorial from the University of Wisconsin where there is ultimately only one Reduce task. It is particularly galling that this example comes from the University of Wisconsin where they have a large and prestigious program on parallel database system research. In their defense, the tutorial does show how to do intermediate reduction of the data, but that does not prevent it from being a bad example in general.

Sometimes the problem is too small. What do you do if the problem you are working on now just involves the computation of a single result? The answer is to enlarge the problem. In a large cluster it is better to compute more results even although they may not be of immediate use to you. Lets look at an example. Say you want to analyze a set of documents for the frequency of the word 'the'. The natural thing to do is process all the documents and in the Map function filter for the word 'the' and count the results in the Reduce function. This is how you are taught to use "valuable" computing resources. In practice, with MapReduce it is better to count the frequency of all the words in the documents and save the results. It is not a lot more effort for the MapReduce engine to count the frequency of all the words in the documents and if you then want to know how many uses there are of 'a' or any other word, they are there for you immediately.

A common analogy is MapReduce as a freight train as opposed to a relational database which is a racing car. The freight train carries a huge load but is slow to start and stop. A race car is very fast and nimble but it carries only one person. Relational database systems rely on you to use the where clause to reduce the data that it has to analyze, and in return gives you the answer in a short time. MapReduce does not give you an answer as quickly but it is capable of effectively processing a lot more data. With MapReduce you should process all the data and save the results, then use them as you need them. We can sum the way of thinking about how to use MapReduce with the slogan "no where clauses".