Build and Break: November 2008

Friday, November 28, 2008

SpotFire

Christian Marcazzo of Spotfire spoke to the SDForum BI SIG November meeting. Sandeep Giri posted an excellent description of the meeting. Here I want to try and understand where Spotfire fits into the arc of Business Intelligence tools.

The origins of Spotfire is in academic Computer Human Interface (CHI) studies. Christian made fun of typical stupid dashboard tricks like thermometer and speedometer indicators that take up a lot of space and tell us little. In this sense Spotfire is like Tableau who are also based academic research. However they attack different markets. While Tableau is primarily an end user tool, Spotfire is an enterprise solution, with various different types of server to support more or less sophisticated use of the client.

Christian works with Pharmaceutical customers, and important customer base for Spotfire. The examples he showed were all straightforward uses of Spotfire in sales and marketing, however he told us that in pharmaceuticals they have the largest group of customers in research, next in clinical trials and only then in sales and marketing.

Spotfire supports sophisticated data analysis. In the meeting, I asked Christian how they compare to SAS or SPSS. He did not answer the question directly, instead he told us that Spotfire has recently integrated with the S+ programming language. In his view the future of sophisticated analytics is the R programming language, the Open Source version of S.

Sunday, November 23, 2008

Fear and Loathing in my 401K

Every so often you need to sit back and take a more detached look at what is going on, particularly when there seems to be a new event every day. The current issue is the all engulfing financial crisis. Taking the long view allows you to look past the current pain in your 401K. Like may others I know that I am not going to be retiring any time soon. I have commented previously on the risk of unregulated markets, here are some more thoughts.

One big question is who is responsible. One group of people are working hard to establish that is is not the responsibility of the current hapless President, but is something that was foisted on him by his wily predecessor and the Democratic Congress from way back when. Well, if you believe that the role of government is to stand aside and let events unfurl, as the administration has on several occasions, then it is clearly not the responsibility of George W Bush. On the other hand, if you believe that the role of government is to at the very least steady the tiller, then the current administration has been asleep at the wheel.

Another thought is that nobody is responsible, it is just a natural consequence of a complex financial system. Several people have commented that the complex derivatives made the system less volatile, however they also increased the probability of a huge collapse. I was reminded of a blog post from some time ago that referenced an IEEE Spectrum article that electrical blackouts are inevitable. This was after the big blackout of 2004. If the conclusion is that the highly regulated and controlled electrical industry will have a big blackout every 35 years or so, then the loosely regulated financial is also bound to have big blackouts every so often. Chaos theory rules.

Here are some of people who I have been following:

Paul Krugman, recent Nobel prize winner, called the problem in the housing market in 2005. He is a careful person who takes care that what he says is totally defensible. Something that I am sure infuriates his many detractors.
Andrew Leonard on How the World Works pulls together a lot if interesting ideas. He muses on everything from the demise of petro-empires to the demise of his bank: Washington Mutual.
Igor Greenwald's Taking Stock blog on Smart Money. The latest word from someone close to the trading floor on Wall Street.
Michael Lewis left the money business and wrote Liar's Poker because he wanted to write and did not believe that the decade of greed could continue. Well, the financial world took another 20 years before it blew itself up, and Michael Lewis has just written a great retrospective article for Portfolio.

Sunday, November 16, 2008

Map-Reduce versus Relational Database

In the previous post I said that Map-Reduce is just the old database concept of aggregation rewritten for extremely large scale data. To understand this, lets look at the example in that post, and see how it would be implemented in a relational database. The problem is to take the World Wide Web and for each web page count the number of different domains that reference that page in links on the other web pages.

As a starting point, let us consider using the same data structure, a two column table where the first column contains the URL of each web page and the second column contains the contents of that web page. However, this immediately presents a problem. Data in a relational database should be normalized, and the first rule of normalization is that each data item should be atomic. While there is some argument as to exactly what atomic means, everyone would agree that the contents of a web page with multiple links to other web pages is not an an atomic data item, particularly if we are interested in those links.

The obvious relational data structure for this data is a join table with two columns. One column, called PAGE_URL, contains the URL of the page. The other column, called LINK_URL, contains URLS of links on the corresponding page. There is one row in this table (called WWW_LINKS) for every link in the World Wide Web. Given this structure we can write the following SQL query to solve the problem in the example (presuming a function called getdomain that returns the domain from a URL):


SELECT LINK_URL, count(distinct getdomain(PAGE_URL))
FROM WWW_LINKS
GROUP BY LINK_URL

The point of this example is to show that Map-Reduce and SQL aggregate functions both address the same kind of data manipulation. My belief is that most Map-Reduce problems can be similarly expressed by database aggregation. However there are differences. Map-Reduce is obviously more flexible and puts less constraint on how the data is represented.

I strongly believe that every programmer should understand the principals of data normalization and why it is useful, but I am willing to be flexible when it comes to practicalities. In this example, if the WWW_LINKS table is a useful structure that is used in a number of different queries, then it is worth building. However if the only reason for building the table is to do one aggregation on it, the Map-Reduce solution is better.

Tuesday, November 11, 2008

Understanding Map-Reduce

Map-Reduce is the hoopy new data management function. Google produced the seminal implementation. Start-ups are jumping on the gravy train. The old guard decry it. What is it? In my opinion it is just the old database concept of aggregation rewritten for extremely large scale data as I will explain in another post. But firstly we need to understand what Map-Reduce does, and I have yet to find a good clear explanation, so here goes mine.

Map Reduce is an application for performing analysis on very large data sets. I will give a brief explanation of what Map Reduce does conceptually and then give an example. The Map Reduce application takes three inputs. The first input is a map (note lower case). A map is a data structure, sometimes called a dictionary. A tuple is a pair of values and a map is a set of tuples. The first value in a tuple is called the key and the second is called the value. Each key in a map is unique. The second input to Map-Reduce is a Map function (note upper case) . The Map function takes as input a tuple, (k1, v1) and produces a list of tuples (list(k2, v2)) from data in its input. Note that the list may be empty or contain only one value. The third input is a Reduce function. The Reduce function takes a tuple where the value is a list of values and returns a tuple. In practice it reduces the list of values to a single value.

The Map Reduce application takes the input map and applies the Map function to each tuple in that map. We can think of it creating an intermediate result that is a single large list from the lists produced by each application of the Map function:
{ Map(k1, v1) } -> { list(k2, v2) }
Then for each unique key in the intermediate result list it groups all the corresponding values into a list associated with the key value:
{ list(k2, v2) } -> { (k2, list(v2)) }
Finally it goes through this structure and applies the Reduce function to the value list in each element:
{ Reduce(k2, list(v2)) } -> { (k2, v3) }
The output of Map Reduce is a map.

Now for an example. In this application we are going to take the World Wide Web and for each web page count the number of other domains that reference that page. A domain is the part of a URL between the first two sets of slashes. For example, the domain of this web page is "www.bandb.blogspot.com". A web page is uniquely identified by its URL, so a URL is a good key for for a map. The data input is a map of the entire web. The key for each map element is the URL of the page, and the value is the corresponding web page. Now I know that this is a data structure on a scale that is difficult to imagine, however this is the kind of data that Google has to process to organize the worlds information.

The Map function takes the URL, web page pair and adds an element to its output list for every URL that it finds on the web page. The key in the output list is the URL found on the web page and the value is the domain from the key value in the input URL. So for example, on this page, our Map function finds the link to the Google paper on Map Reduce and adds to its list of outputs the tuple ("research.google.com/archive/mapreduce.html", "www.bandb.blogspot.com"). Map-Reduce reorganizes its intermediate data so that for each URL it collects all the domains that reference that page and stores them as a list. The Reduce function goes through the list of domains and counts the number of different domain values that it finds. The result of Map-Reduce is a map where the key is a URL and the value is a number, the number of other domains on the web that reference that page.

While this example is invented, Google reports that they use a set of 5 to 10 such Map-Reduce steps to generate their web index. The point of Map Reduce is that a user can write a couple of simple functions and have them applied to data on a vast scale.

Saturday, November 08, 2008

Leonardo at The Tech

We visited the Leonardo exhibition at The Tech this afternoon. It is a huge exhibition. They suggest that you allow 2 hours for the tour. We were there for two hours and we rushed through the second half to such an extent that I will go back and do it again. The exhibition starts with Brunelleschi's Dome for the Duomo in Florence. Leonardo was an apprentice in Florence towards the end of its construction and it started his interest in mechanics.

After wandering through many halls of mechanical inventions, we came to the anatomy room where Leonardo takes his knowledge of mechanics and applies it to understanding how the human body works. It was after this that we had to pick up the pace just as the exhibits started to get really interesting. The exhibition then goes into his more cultural side which includes his painting and sculpture.

One thing that I got from the painting displays is that Leonardo's knowledge of both mechanics and anatomy informed his paintings. For example, there is an interesting display on the dynamics of the characters in The Last Supper. There is another display on his studies into understanding faces, expressions and the muscles that are used to form facial expressions. So, the inscrutable expression on Mona Lisa's face is no accident (this is my summize, I did not see a reference to Mona Lisa in the exhibition) .

I highly recommend that you see the Leonardo exhibition if you can, and suggest that you allow several hours to see it all properly. Also, do not spend too much time on the mechanics. It is a necessary introduction to understanding how Leonardo viewed the world but it is also important to see how he applied all this knowledge.

Sunday, November 02, 2008

Financial Data Integration

Suzanne Hoffman of Star Analytics spoke to the October meeting of the SDForum Business Intelligence SIG on "Financial Data Integration". There were two aspects of her talk that particularly interested me. The first aspect was that Suzanne has been doing what we now call Enterprise Performance Management ever since her first job, 30 years ago, and she peppered her talk with a lot of interesting historical perspectives and anecdotes.

The most important anecdote relates to Ted Codd, inventor of the relational model for databases, and for his 12 rules that defined what a relational database is. Later Codd coined the term OLAP for analytic processing and published 12 rules that defined OLAP. Unfortunately the 12 rules for OLAP were not well regarded as they not as crisp as the 12 rules for a relational database and people found out that Codd had been paid a large sum of money by a OLAP software vendor for writing them. Susanne confirmed that the software vendor was Arbor Software and the money was $25,000.

The second interesting aspect to Susanne's talk was the idea that data can get trapped in OLAP systems. OLAP hold data in a multi-dimensional cube for analysis, so it is close to an end user presentation tool. OLAP is heavily used for financial analysis and modelling. The Hyperion, now Oracle, EssBase server is the king of the hill in dealing with large data cubes. Susanne reported that the largest cube she knew of was at Ford. It had 50 dimensions with the largest dimension having a million members.

We have system to get data into OLAP cubes so that the financial analyst can do their work, but when the work is done, there is no way to get the data out again so that it can be used in other parts of a business. In my opinion, a Business Intelligence system can and should be constructed so that the data in OLAP cubes is sourced from a data warehouse and is not just lost in the OLAP server. However this approach may limit the size of the OLAP cubes that can be built. Anyway many large companies have already bought high end OLAP servers and their data is trapped in there. The purpose of the Star Analytics integration server is to get that data out.

Saturday, November 01, 2008

The Scala Programming Language

These days feel like the 1980's as far as programming languages are concerned with new programming languages springing up all over the place. Then, the prolific Nicklaus Wirth invented a new programming language every other year. Now, the center of language design in Switzerland has moved to Lausanne where Martin Odersky at EPFL has conceived Scala. Bill Venners of Artima introduced the Scala Programming Language at the October meeting of the SDForum Java SIG.

Scala is a functional language, in the sense that is every "statement" produces a value. Also, Scala is a statically typed language although programs look like a dynamic language. The trick is that variables are declared by a 'var' declaration, and the type variable is the type of the initial value assigned to the variable. Contrast this with a dynamic language where the data type is associated with the data value and every operation on data has to look at the data types of the operands to decide what to do.

Getting the data type from the value assigned reduces the need to over specify type as is typical of statically typed languages like Java. Bill recalled the discussion of Duh typing that a group of us had after the last time he spoke to a SFDForum SIG. The other thing that Scala makes easy is declaring invariant variables. They are like 'var' variables except they are introduced by the keyword 'val'. Contrast this with Java where you put final before the declaration or C++ where you put const before the declaration. Thus a constant in Java is declared something like this:

final static String HELLO_WORLD = "Hello World";

while in Scala the declaration looks like this:

val HELLO_WORLD = "Hello World"

This leads to a more declarative style of programming, which is a good thing. Bill reported that while in Java 95% of declarations are variables and the other 5% are constants, in Scala, 95% of declarations are constant and only 5% are variables. I have used a similar style of programming in C++ when using APIs that make heavy use of const, so you have to declare variables that you are going to pass to the API as const. The only time that this is an problem is where you have to create const objects that can throw in their constructor. Then you can end up with heavily indented try blocks as you create each const object safely so that you can pass it to the API.

Finally, Scala, like many other languages these days, compiles to the Java Virtual Machine. That way, it is broadly portable, and developers have access to the vast Java libraries.

Build and Break