Tuesday, December 30, 2008

Notes from the Dismal Science

Over the Christmas break, I have been reading "The Return Of Depression Economics" by Paul Krugman. There are good reasons why they call Economics the Dismal Science. The worse the financial situation becomes, the more there is to analyze, discuss and comment on. These are exciting times to be an Economist. I will report more on the book later, in the mean time here are some thoughts on the Dismal Science.

To my delight, the current Wikipedia entry on the Dismal Science calls it a derogatory alternative name for Economics and tries to contrast it with the "The Gay Science", the title of a book by the philosopher Nietzsche. The "Gay Science" of the books title is apparently the technique of poetry writing. It also claims that the first reference to Dismal Science is in a pamphlet published by Thomas Carlyle. The problem with all this is that the dates do not match up. The best known dismal economist is Malthus who published the first version of his pamphlet on the impending doom of population explosion in 1798. The Carlyle pamphlet was published in 1849 and Nietzsche's The Gay Science was published in 1882.

One interesting thing about Malthus is his use of mathematical models to explain his thesis. His argument was that population growth is exponential while the growth in food supply is linear, leading future generations to have more and more people fighting for proportionally less food. Nowadays economists still use mathematical models to explain their positions, although they have now graduated to sometimes using differential equations to make their point. In my time, I have known several applied mathematicians. Their models are always second order differential equations and they always oscillate, just like our financial fortunes.

Another branch of mathematics with relevance is Game Theory. Derivative trading is a zero sum game. That is, one persons gain is another persons loss. I believe that bond trading is also a zero sum game as well. In a properly open market with good information, trading in bonds and derivatives should be a straightforward and relatively low profit enterprise. For at least 20 years, as depicted first in Bonfire of the Vanities and Liar's Poker, it has been exactly the opposite. The market players have conspired to hide information and keep the market inefficient so that they can reward themselves with enormous profits from trading.

Note that derivative and bond trading is only a zero sum game when they do not default. Adding defaults makes bond trading a manly game where the best can win. Thus, are defaults necessary to justify the the profits and bonuses that Wall Street firms have been paying? Is it a matter of: "It is not enough to succeed, others must fail"? Could it be that these ridiculous collateralize debt obligation bonds were deliberately created so that some of them would fail?

Monday, December 29, 2008

More DRM Nonsense

I have just run into another piece of DRM nonsense. As I mentioned before, we made the transition to HDTV far too early and are now stuck with all the annoying outrages that any early adopter has to deal with. Some years ago we bought a new TV/Monitor with a 16:9 aspect ratio and component video input at 1080i. So, for years every TV actress that crossed our screen has had an unnaturally broad bottom, and every late night TV host, an unnaturally broad smile.

This Christmas I was thinking of upgrading our DVD player to one that would up-convert the signal sent to the TV to 1080i. Then I started reading the specs. Apparently, although all recent DVD players tout the ability to up-convert their signal, the movie Nabobs have decreed that DVD players cannot up-convert or otherwise enhance encrypted content if it is sent out en plein as component video. As all commercial DVDs are encrypted, an "up-converting" DVD players would not up-convert any of the DVD content that we watch.

We live in perilous times and I know that it is my duty as a good citizen to go out and consume. But if I cannot buy a new DVD player that does something more that the one I already have what is the point of buying it? This Christmas my wallet has remains firmly closed for A/V upgrades.

Friday, December 26, 2008

Functional Programming and Concurrency

With the arrival of more and more cores on the typical processor chip these days, concurrency is becoming an screamingly important topic in programming. Unfortunately, many people seem to think that the solution to concurrent programming is a Functional Programming language. For example, in the January issue of Dr. Dobb's Journal there is an article on "It's Time to Get Good at Functional Programming" which has received a lot of comment around the web.

I have been doing concurrent programming for a long time, but I have never understood why there there is any connection between functional programming and concurrency. Rather than waste a lot of time with specious argument, let me just give a counterexample. Occam was the first programming language designed specifically for programming concurrent computer systems. It is based on the Concurrent Sequential Processes (CSP) notation of Tony Hoare. While Occam has functions, it is by no means a language for functional programming, having eschewed recursion in the first few versions of the language.

Tuesday, December 23, 2008

Map Reduce Sort Benchmark

In November Google posted in a blog that they had beaten the Terabyte Sort Benchmark with a time of 68 seconds. I waited to comment on the result until it was confirmed on the Sort Benchmark Home Page and to see the technical details, but this has not happened, so here are some preliminary thoughts.

The Sort Benchmark Home Page has several results for for different races. The big division is between Daytona and Indy, where Daytona is for general purpose software like Map Reduce while Indy is for code specially written for the task. The other dimension is how much is sorted. There are competitions for how many records can be sorted for a penny, and in a minute. Then there is the big Kahuna prize, sorting a Terabyte, that is 10 Billion 100 byte records in the shortest time.

Map Reduce can be used to sort data. (See this previous post for a simple explanation of Map Reduce.) Most of the sorting work is done by the partitioning function that sits between between the map part of the process and the reduce part. Normally the partitioning function uses hashing to get reasonably even partition sizes in the face of skewed data. Map Reduce allows you to supply a custom partitioning function and for sorting, the default partitioning function is replaced by a range partitioning function so that each reduce process gets a set of results in the same range. The reduce process then sorts and outputs its group of results.

Here we come to a little issue. The Sort Benchmark uses artificial data with a randomly generated key that is guaranteed to be evenly distributed throughout the range. Range partitioning is fine for partitioning this synthetic data, however it will not work so well with real world skewed data. Thus, while the results are impressive they should be taken with a pinch of salt.

The Google result seems particularly impressive, because last summer Yahoo had used Hadoop, their Open Source implementation of Map Reduce to officially win the Terabyte Sort Benchmark with a time of 209 seconds. There has been plenty of speculation about why Google result is much faster the the Hadoop result. Here are some thoughts:
  • Bigger iron. Google have not disclosed the details of their sort (from what I can find), but their post suggests that they used 12 disks per system, as opposed to Yahoo with 4 disks per system. The total time is short so the difference in IO capacity could make a big difference. The Google systems may have had more memory and other resources per node as well.
  • Misconfigured Hadoop. The Hadoop Sort benchmark disclosure says "I configured the job with 1800 maps", on a 910 node system where each node has 4 dual core processors! The Hadoop pages says "The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very CPU-light map tasks." The map part of sorting with Map-Reduce is a very CPU light task.
  • Yahoo did not try very hard. They handily beat the previous record of 268 seconds. The benchmark disclosure says "Although I had the 910 nodes mostly to myself, the network core was shared with another active 2000 node cluster, so the times varied a lot depending on the other activity."
  • SSL Communication. Hadoop uses SSL to communicate between nodes. SSL provides good network security, however it has some setup time for each node in a communication intensive task. It is not clear what Google uses for communication between nodes.
Here are a couple of final comments. Firstly Hadoop is still the official world record holder for the Terabyte Sort. Secondly, a Terabyte is small amount of data these days. The real point of the Google post was to say that they had sorted a Petabyte in 6.8 hours. Now that is a real sort benchmark.