Tuesday, December 30, 2008

Notes from the Dismal Science

Over the Christmas break, I have been reading "The Return Of Depression Economics" by Paul Krugman. There are good reasons why they call Economics the Dismal Science. The worse the financial situation becomes, the more there is to analyze, discuss and comment on. These are exciting times to be an Economist. I will report more on the book later, in the mean time here are some thoughts on the Dismal Science.

To my delight, the current Wikipedia entry on the Dismal Science calls it a derogatory alternative name for Economics and tries to contrast it with the "The Gay Science", the title of a book by the philosopher Nietzsche. The "Gay Science" of the books title is apparently the technique of poetry writing. It also claims that the first reference to Dismal Science is in a pamphlet published by Thomas Carlyle. The problem with all this is that the dates do not match up. The best known dismal economist is Malthus who published the first version of his pamphlet on the impending doom of population explosion in 1798. The Carlyle pamphlet was published in 1849 and Nietzsche's The Gay Science was published in 1882.

One interesting thing about Malthus is his use of mathematical models to explain his thesis. His argument was that population growth is exponential while the growth in food supply is linear, leading future generations to have more and more people fighting for proportionally less food. Nowadays economists still use mathematical models to explain their positions, although they have now graduated to sometimes using differential equations to make their point. In my time, I have known several applied mathematicians. Their models are always second order differential equations and they always oscillate, just like our financial fortunes.

Another branch of mathematics with relevance is Game Theory. Derivative trading is a zero sum game. That is, one persons gain is another persons loss. I believe that bond trading is also a zero sum game as well. In a properly open market with good information, trading in bonds and derivatives should be a straightforward and relatively low profit enterprise. For at least 20 years, as depicted first in Bonfire of the Vanities and Liar's Poker, it has been exactly the opposite. The market players have conspired to hide information and keep the market inefficient so that they can reward themselves with enormous profits from trading.

Note that derivative and bond trading is only a zero sum game when they do not default. Adding defaults makes bond trading a manly game where the best can win. Thus, are defaults necessary to justify the the profits and bonuses that Wall Street firms have been paying? Is it a matter of: "It is not enough to succeed, others must fail"? Could it be that these ridiculous collateralize debt obligation bonds were deliberately created so that some of them would fail?

Monday, December 29, 2008

More DRM Nonsense

I have just run into another piece of DRM nonsense. As I mentioned before, we made the transition to HDTV far too early and are now stuck with all the annoying outrages that any early adopter has to deal with. Some years ago we bought a new TV/Monitor with a 16:9 aspect ratio and component video input at 1080i. So, for years every TV actress that crossed our screen has had an unnaturally broad bottom, and every late night TV host, an unnaturally broad smile.

This Christmas I was thinking of upgrading our DVD player to one that would up-convert the signal sent to the TV to 1080i. Then I started reading the specs. Apparently, although all recent DVD players tout the ability to up-convert their signal, the movie Nabobs have decreed that DVD players cannot up-convert or otherwise enhance encrypted content if it is sent out en plein as component video. As all commercial DVDs are encrypted, an "up-converting" DVD players would not up-convert any of the DVD content that we watch.

We live in perilous times and I know that it is my duty as a good citizen to go out and consume. But if I cannot buy a new DVD player that does something more that the one I already have what is the point of buying it? This Christmas my wallet has remains firmly closed for A/V upgrades.

Friday, December 26, 2008

Functional Programming and Concurrency

With the arrival of more and more cores on the typical processor chip these days, concurrency is becoming an screamingly important topic in programming. Unfortunately, many people seem to think that the solution to concurrent programming is a Functional Programming language. For example, in the January issue of Dr. Dobb's Journal there is an article on "It's Time to Get Good at Functional Programming" which has received a lot of comment around the web.

I have been doing concurrent programming for a long time, but I have never understood why there there is any connection between functional programming and concurrency. Rather than waste a lot of time with specious argument, let me just give a counterexample. Occam was the first programming language designed specifically for programming concurrent computer systems. It is based on the Concurrent Sequential Processes (CSP) notation of Tony Hoare. While Occam has functions, it is by no means a language for functional programming, having eschewed recursion in the first few versions of the language.

Tuesday, December 23, 2008

Map Reduce Sort Benchmark

In November Google posted in a blog that they had beaten the Terabyte Sort Benchmark with a time of 68 seconds. I waited to comment on the result until it was confirmed on the Sort Benchmark Home Page and to see the technical details, but this has not happened, so here are some preliminary thoughts.

The Sort Benchmark Home Page has several results for for different races. The big division is between Daytona and Indy, where Daytona is for general purpose software like Map Reduce while Indy is for code specially written for the task. The other dimension is how much is sorted. There are competitions for how many records can be sorted for a penny, and in a minute. Then there is the big Kahuna prize, sorting a Terabyte, that is 10 Billion 100 byte records in the shortest time.

Map Reduce can be used to sort data. (See this previous post for a simple explanation of Map Reduce.) Most of the sorting work is done by the partitioning function that sits between between the map part of the process and the reduce part. Normally the partitioning function uses hashing to get reasonably even partition sizes in the face of skewed data. Map Reduce allows you to supply a custom partitioning function and for sorting, the default partitioning function is replaced by a range partitioning function so that each reduce process gets a set of results in the same range. The reduce process then sorts and outputs its group of results.

Here we come to a little issue. The Sort Benchmark uses artificial data with a randomly generated key that is guaranteed to be evenly distributed throughout the range. Range partitioning is fine for partitioning this synthetic data, however it will not work so well with real world skewed data. Thus, while the results are impressive they should be taken with a pinch of salt.

The Google result seems particularly impressive, because last summer Yahoo had used Hadoop, their Open Source implementation of Map Reduce to officially win the Terabyte Sort Benchmark with a time of 209 seconds. There has been plenty of speculation about why Google result is much faster the the Hadoop result. Here are some thoughts:
  • Bigger iron. Google have not disclosed the details of their sort (from what I can find), but their post suggests that they used 12 disks per system, as opposed to Yahoo with 4 disks per system. The total time is short so the difference in IO capacity could make a big difference. The Google systems may have had more memory and other resources per node as well.
  • Misconfigured Hadoop. The Hadoop Sort benchmark disclosure says "I configured the job with 1800 maps", on a 910 node system where each node has 4 dual core processors! The Hadoop pages says "The right level of parallelism for maps seems to be around 10-100 maps per-node, although it has been set up to 300 maps for very CPU-light map tasks." The map part of sorting with Map-Reduce is a very CPU light task.
  • Yahoo did not try very hard. They handily beat the previous record of 268 seconds. The benchmark disclosure says "Although I had the 910 nodes mostly to myself, the network core was shared with another active 2000 node cluster, so the times varied a lot depending on the other activity."
  • SSL Communication. Hadoop uses SSL to communicate between nodes. SSL provides good network security, however it has some setup time for each node in a communication intensive task. It is not clear what Google uses for communication between nodes.
Here are a couple of final comments. Firstly Hadoop is still the official world record holder for the Terabyte Sort. Secondly, a Terabyte is small amount of data these days. The real point of the Google post was to say that they had sorted a Petabyte in 6.8 hours. Now that is a real sort benchmark.

Friday, November 28, 2008

SpotFire

Christian Marcazzo of Spotfire spoke to the SDForum BI SIG November meeting. Sandeep Giri posted an excellent description of the meeting. Here I want to try and understand where Spotfire fits into the arc of Business Intelligence tools.

The origins of Spotfire is in academic Computer Human Interface (CHI) studies. Christian made fun of typical stupid dashboard tricks like thermometer and speedometer indicators that take up a lot of space and tell us little. In this sense Spotfire is like Tableau who are also based academic research. However they attack different markets. While Tableau is primarily an end user tool, Spotfire is an enterprise solution, with various different types of server to support more or less sophisticated use of the client.

Christian works with Pharmaceutical customers, and important customer base for Spotfire. The examples he showed were all straightforward uses of Spotfire in sales and marketing, however he told us that in pharmaceuticals they have the largest group of customers in research, next in clinical trials and only then in sales and marketing.

Spotfire supports sophisticated data analysis. In the meeting, I asked Christian how they compare to SAS or SPSS. He did not answer the question directly, instead he told us that Spotfire has recently integrated with the S+ programming language. In his view the future of sophisticated analytics is the R programming language, the Open Source version of S.

Sunday, November 23, 2008

Fear and Loathing in my 401K

Every so often you need to sit back and take a more detached look at what is going on, particularly when there seems to be a new event every day. The current issue is the all engulfing financial crisis. Taking the long view allows you to look past the current pain in your 401K. Like may others I know that I am not going to be retiring any time soon. I have commented previously on the risk of unregulated markets, here are some more thoughts.

One big question is who is responsible. One group of people are working hard to establish that is is not the responsibility of the current hapless President, but is something that was foisted on him by his wily predecessor and the Democratic Congress from way back when. Well, if you believe that the role of government is to stand aside and let events unfurl, as the administration has on several occasions, then it is clearly not the responsibility of George W Bush. On the other hand, if you believe that the role of government is to at the very least steady the tiller, then the current administration has been asleep at the wheel.

Another thought is that nobody is responsible, it is just a natural consequence of a complex financial system. Several people have commented that the complex derivatives made the system less volatile, however they also increased the probability of a huge collapse. I was reminded of a blog post from some time ago that referenced an IEEE Spectrum article that electrical blackouts are inevitable. This was after the big blackout of 2004. If the conclusion is that the highly regulated and controlled electrical industry will have a big blackout every 35 years or so, then the loosely regulated financial is also bound to have big blackouts every so often. Chaos theory rules.

Here are some of people who I have been following:
  • Paul Krugman, recent Nobel prize winner, called the problem in the housing market in 2005. He is a careful person who takes care that what he says is totally defensible. Something that I am sure infuriates his many detractors.
  • Andrew Leonard on How the World Works pulls together a lot if interesting ideas. He muses on everything from the demise of petro-empires to the demise of his bank: Washington Mutual.
  • Igor Greenwald's Taking Stock blog on Smart Money. The latest word from someone close to the trading floor on Wall Street.
  • Michael Lewis left the money business and wrote Liar's Poker because he wanted to write and did not believe that the decade of greed could continue. Well, the financial world took another 20 years before it blew itself up, and Michael Lewis has just written a great retrospective article for Portfolio.

Sunday, November 16, 2008

Map-Reduce versus Relational Database

In the previous post I said that Map-Reduce is just the old database concept of aggregation rewritten for extremely large scale data. To understand this, lets look at the example in that post, and see how it would be implemented in a relational database. The problem is to take the World Wide Web and for each web page count the number of different domains that reference that page in links on the other web pages.

As a starting point, let us consider using the same data structure, a two column table where the first column contains the URL of each web page and the second column contains the contents of that web page. However, this immediately presents a problem. Data in a relational database should be normalized, and the first rule of normalization is that each data item should be atomic. While there is some argument as to exactly what atomic means, everyone would agree that the contents of a web page with multiple links to other web pages is not an an atomic data item, particularly if we are interested in those links.

The obvious relational data structure for this data is a join table with two columns. One column, called PAGE_URL, contains the URL of the page. The other column, called LINK_URL, contains URLS of links on the corresponding page. There is one row in this table (called WWW_LINKS) for every link in the World Wide Web. Given this structure we can write the following SQL query to solve the problem in the example (presuming a function called getdomain that returns the domain from a URL):

SELECT LINK_URL, count(distinct getdomain(PAGE_URL))
FROM WWW_LINKS
GROUP BY LINK_URL

The point of this example is to show that Map-Reduce and SQL aggregate functions both address the same kind of data manipulation. My belief is that most Map-Reduce problems can be similarly expressed by database aggregation. However there are differences. Map-Reduce is obviously more flexible and puts less constraint on how the data is represented.

I strongly believe that every programmer should understand the principals of data normalization and why it is useful, but I am willing to be flexible when it comes to practicalities. In this example, if the WWW_LINKS table is a useful structure that is used in a number of different queries, then it is worth building. However if the only reason for building the table is to do one aggregation on it, the Map-Reduce solution is better.

Tuesday, November 11, 2008

Understanding Map-Reduce

Map-Reduce is the hoopy new data management function. Google produced the seminal implementation. Start-ups are jumping on the gravy train. The old guard decry it. What is it? In my opinion it is just the old database concept of aggregation rewritten for extremely large scale data as I will explain in another post. But firstly we need to understand what Map-Reduce does, and I have yet to find a good clear explanation, so here goes mine.

Map Reduce is an application for performing analysis on very large data sets. I will give a brief explanation of what Map Reduce does conceptually and then give an example. The Map Reduce application takes three inputs. The first input is a map (note lower case). A map is a data structure, sometimes called a dictionary. A tuple is a pair of values and a map is a set of tuples. The first value in a tuple is called the key and the second is called the value. Each key in a map is unique. The second input to Map-Reduce is a Map function (note upper case) . The Map function takes as input a tuple, (k1, v1) and produces a list of tuples (list(k2, v2)) from data in its input. Note that the list may be empty or contain only one value. The third input is a Reduce function. The Reduce function takes a tuple where the value is a list of values and returns a tuple. In practice it reduces the list of values to a single value.

The Map Reduce application takes the input map and applies the Map function to each tuple in that map. We can think of it creating an intermediate result that is a single large list from the lists produced by each application of the Map function:
{ Map(k1, v1) } -> { list(k2, v2) }
Then for each unique key in the intermediate result list it groups all the corresponding values into a list associated with the key value:
{ list(k2, v2) } -> { (k2, list(v2)) }
Finally it goes through this structure and applies the Reduce function to the value list in each element:
{ Reduce(k2, list(v2)) } -> { (k2, v3) }
The output of Map Reduce is a map.

Now for an example. In this application we are going to take the World Wide Web and for each web page count the number of other domains that reference that page. A domain is the part of a URL between the first two sets of slashes. For example, the domain of this web page is "www.bandb.blogspot.com". A web page is uniquely identified by its URL, so a URL is a good key for for a map. The data input is a map of the entire web. The key for each map element is the URL of the page, and the value is the corresponding web page. Now I know that this is a data structure on a scale that is difficult to imagine, however this is the kind of data that Google has to process to organize the worlds information.

The Map function takes the URL, web page pair and adds an element to its output list for every URL that it finds on the web page. The key in the output list is the URL found on the web page and the value is the domain from the key value in the input URL. So for example, on this page, our Map function finds the link to the Google paper on Map Reduce and adds to its list of outputs the tuple ("research.google.com/archive/mapreduce.html", "www.bandb.blogspot.com"). Map-Reduce reorganizes its intermediate data so that for each URL it collects all the domains that reference that page and stores them as a list. The Reduce function goes through the list of domains and counts the number of different domain values that it finds. The result of Map-Reduce is a map where the key is a URL and the value is a number, the number of other domains on the web that reference that page.

While this example is invented, Google reports that they use a set of 5 to 10 such Map-Reduce steps to generate their web index. The point of Map Reduce is that a user can write a couple of simple functions and have them applied to data on a vast scale.

Saturday, November 08, 2008

Leonardo at The Tech

We visited the Leonardo exhibition at The Tech this afternoon. It is a huge exhibition. They suggest that you allow 2 hours for the tour. We were there for two hours and we rushed through the second half to such an extent that I will go back and do it again. The exhibition starts with Brunelleschi's Dome for the Duomo in Florence. Leonardo was an apprentice in Florence towards the end of its construction and it started his interest in mechanics.

After wandering through many halls of mechanical inventions, we came to the anatomy room where Leonardo takes his knowledge of mechanics and applies it to understanding how the human body works. It was after this that we had to pick up the pace just as the exhibits started to get really interesting. The exhibition then goes into his more cultural side which includes his painting and sculpture.

One thing that I got from the painting displays is that Leonardo's knowledge of both mechanics and anatomy informed his paintings. For example, there is an interesting display on the dynamics of the characters in The Last Supper. There is another display on his studies into understanding faces, expressions and the muscles that are used to form facial expressions. So, the inscrutable expression on Mona Lisa's face is no accident (this is my summize, I did not see a reference to Mona Lisa in the exhibition) .

I highly recommend that you see the Leonardo exhibition if you can, and suggest that you allow several hours to see it all properly. Also, do not spend too much time on the mechanics. It is a necessary introduction to understanding how Leonardo viewed the world but it is also important to see how he applied all this knowledge.

Sunday, November 02, 2008

Financial Data Integration

Suzanne Hoffman of Star Analytics spoke to the October meeting of the SDForum Business Intelligence SIG on "Financial Data Integration". There were two aspects of her talk that particularly interested me. The first aspect was that Suzanne has been doing what we now call Enterprise Performance Management ever since her first job, 30 years ago, and she peppered her talk with a lot of interesting historical perspectives and anecdotes.

The most important anecdote relates to Ted Codd, inventor of the relational model for databases, and for his 12 rules that defined what a relational database is. Later Codd coined the term OLAP for analytic processing and published 12 rules that defined OLAP. Unfortunately the 12 rules for OLAP were not well regarded as they not as crisp as the 12 rules for a relational database and people found out that Codd had been paid a large sum of money by a OLAP software vendor for writing them. Susanne confirmed that the software vendor was Arbor Software and the money was $25,000.

The second interesting aspect to Susanne's talk was the idea that data can get trapped in OLAP systems. OLAP hold data in a multi-dimensional cube for analysis, so it is close to an end user presentation tool. OLAP is heavily used for financial analysis and modelling. The Hyperion, now Oracle, EssBase server is the king of the hill in dealing with large data cubes. Susanne reported that the largest cube she knew of was at Ford. It had 50 dimensions with the largest dimension having a million members.

We have system to get data into OLAP cubes so that the financial analyst can do their work, but when the work is done, there is no way to get the data out again so that it can be used in other parts of a business. In my opinion, a Business Intelligence system can and should be constructed so that the data in OLAP cubes is sourced from a data warehouse and is not just lost in the OLAP server. However this approach may limit the size of the OLAP cubes that can be built. Anyway many large companies have already bought high end OLAP servers and their data is trapped in there. The purpose of the Star Analytics integration server is to get that data out.

Saturday, November 01, 2008

The Scala Programming Language

These days feel like the 1980's as far as programming languages are concerned with new programming languages springing up all over the place. Then, the prolific Nicklaus Wirth invented a new programming language every other year. Now, the center of language design in Switzerland has moved to Lausanne where Martin Odersky at EPFL has conceived Scala. Bill Venners of Artima introduced the Scala Programming Language at the October meeting of the SDForum Java SIG.

Scala is a functional language, in the sense that is every "statement" produces a value. Also, Scala is a statically typed language although programs look like a dynamic language. The trick is that variables are declared by a 'var' declaration, and the type variable is the type of the initial value assigned to the variable. Contrast this with a dynamic language where the data type is associated with the data value and every operation on data has to look at the data types of the operands to decide what to do.

Getting the data type from the value assigned reduces the need to over specify type as is typical of statically typed languages like Java. Bill recalled the discussion of Duh typing that a group of us had after the last time he spoke to a SFDForum SIG. The other thing that Scala makes easy is declaring invariant variables. They are like 'var' variables except they are introduced by the keyword 'val'. Contrast this with Java where you put final before the declaration or C++ where you put const before the declaration. Thus a constant in Java is declared something like this:
final static String HELLO_WORLD = "Hello World";
while in Scala the declaration looks like this:
val HELLO_WORLD = "Hello World"
This leads to a more declarative style of programming, which is a good thing. Bill reported that while in Java 95% of declarations are variables and the other 5% are constants, in Scala, 95% of declarations are constant and only 5% are variables. I have used a similar style of programming in C++ when using APIs that make heavy use of const, so you have to declare variables that you are going to pass to the API as const. The only time that this is an problem is where you have to create const objects that can throw in their constructor. Then you can end up with heavily indented try blocks as you create each const object safely so that you can pass it to the API.

Finally, Scala, like many other languages these days, compiles to the Java Virtual Machine. That way, it is broadly portable, and developers have access to the vast Java libraries.

Saturday, October 25, 2008

The Google Database System

Google, Yahoo and others are taking the traditional database system and breaking it into pieces. Google has their own set of propriety database system components. Yahoo is working with the Hadoop Open Source project to make their system available to everyone. I came to this conclusion while doing research for my talk on "Models and Patterns for Concurrency" for the SDForum SAM SIG.

In this post I will talk about what is happening, why it is happening and at the end try to draw some conclusions about future directions for database systems. Note, I am using Google as an example here because their system is described in a set of widely accessible academic papers. Yahoo and many other large scale web sites have adopted a similar approach through use of Hadoop and other Open Source projects.

A traditional database system is a server. It takes care of persistent data storage, metadata management for the stored data, transaction management and querying the data, which includes aggregation. Google has developed its own set of applications which support these same functions, except instead of wrapping them into a single entity as a database server, they have been developed as a set of application programs that build on one another.

The are several reasons for Google developing their own database system. Firstly, they are dealing with managing and processing huge amounts of data. Conventional database systems struggle when the data gets really large. In particular the transaction model that underlies database operation starts to break down. This topic is worth a separate post of its own.

Secondly they their computing system is a distributed system built from thousands of commodity computer systems. Conventional database systems are not designed or tuned to run on this type of hardware. One issue is that at this scale the hardware cannot be assumed to be reliable and the database system has to be designed to work around the unreliable hardware. A final issue is that the cost of software licenses for running a conventional database system on the Google hardware would be prohibitive.

The Google internal application look like this. A the bottom is Chubby, a lock service that also provides reliable storage for small amounts of data. The Google File System provides file data storage in a distributed system of thousands of computers with local disks. It uses Chubby to ensure that there is a single master in the face of system failures. Bigtable is a system for storing and accessing structured data. While it is not exactly relational, it is comparable to storing data in a single large relational database table. Bigtable stores its data in the Google File System and uses Chubby for several purposes including metadata storage and to ensure atomicity of certain operations.

Finally Map Reduce is a generalized aggregation engine (and here I mean aggregation in the technical database sense). It uses the Google File System. Map Reduce is surprisingly closely related to database aggregation as found in the SQL language although it is not usually described in that way. I will discuss this in another post. In the mean time, it is interesting to note that Map Reduce has been subject to a rather intemperate attack by database luminaries David DeWitt and Michael Stonebreaker.

In total, these four applications: Chubby, Google File System, Bigtable and Map Reduce provide the capabilities of a database system. In practice there are some differences. The user writes program in a language like C++ that integrated the capabilities of these components as they need them. They do not need to use all the components. For example, Google can calculate the page rank for each page on the web as a series of 6 Map Reduce steps none of which necessarily uses Bigtable.

The concept of a Database System was invented in the late 60's by the CODASYL committee, shortly after their achievement of inventing the COBOL programming language. The Relational model and Transactions came later, however the concept of a server system that owns and manages data and much of the terminology originated with CODASYL. Since then, the world has changed.

Nowadays databases are often hidden behind frameworks such as Hibernate or Ruby on Rails that try to paper over the impedance mismatch between the database model on one side and an object oriented world looking for persistence on the other. These are mostly low end systems. At the other end of scale are the huge data management problems of Google, Yahoo and other web sites. New companies with new visions of database systems to meet these challenges are emerging. It is an exciting time.

Saturday, October 11, 2008

Its Called Risk, Have You Heard Of It?

Senator Phil Gramm famously called us a "nation of whiners" and he may be right. (Note, while I try to keep this blog about technology, the financial system seems to be so badly broken it is worthy of a comment or two.) I recently ran across a blog post on a financial site called "Our Timid Government Is Killing Us" by Michael Kao, CEO of Akanthos Capital Management. In it he complains about 4 things that the Government has not done to help resolve the financial crisis. I want to concentrate on one of there here. "Problem No. 2: Lehman's bankruptcy has severely eroded confidence between counterparties."

The problem is this. Over the last few weeks, financial institutions have become unwilling to trust one another and with good cause. The issue is Credit Default Swaps. This is a 60 Trillion dollar market (that is Trillion with a capital T) where financial institutions like banks and hedge funds (the parties) buy and sell insurance policies on bonds. The "This American Life" radio program and Podcast has very good and understandable explanation of the market and how it came to be.

There are two important things to understand about the credit default swap market. Firstly, it is completely unregulated. Senator Phil Gramm tacked a clause to keep the market unregulated to an appropriations bill in 2000 that was approved by 95 votes to 0 in the Senate. Secondly, the market is not transparent, that is the various parties to the market do not know what any other players position is. Note that these two features are the way the parties in the market wanted it. There has been great outcry about reducing financial regulations in the last few years.

Lack of transparency was not a problem until Lehman Brothers went bankrupt. They were a big player in the credit default swap market. Now all their credit default swap insurance policies are frozen by their bankruptcy. Anyone who sold a credit default swap policy and then laid off the bet by buying an equivalent credit default swap from Lehman Brothers is now on the hook to pay off the insurance policy on the bond without the compensation of being able to get Lehman Brothers to make good on their policy.

The lack of transparency means that nobody knows for sure about anyone else in the market. That is that anyone could go bankrupt tomorrow because they have bought credit default swaps from a bankrupt company like Lehman Brothers and they cannot make good on their promise of a credit default swaps that they have sold. Already AIG has needed a huge injection of government money to stay afloat, and others may be suffering as well. But no one knows what positions anyone else holds. So everyone is conserving their cash not lending it out to anyone else so that they will not lose it if the other party goes bankrupt. Thus are the credit markets constipated.

A final problem is that, because the market is unregulated, there are no capital requirements to back up a bet. I can sell a credit default swap insurance policy based on my good name. I immediately get a large sum of money which I can register as a profit. It is only later that I have to worry about a problem with the bond that I have insured defaulting (what are the chances of that?) This is how the market got to be 60 Trillion dollars in size.

The underlying issue is this. There is a large risk in trading in unregulated markets. The risk is made larger if the market is not transparent, because if one of the parties to the market goes bust nobody knows what their position is worth. These risks were not recognized in the credit default swap market and policies were sold at far too low a price to recognize these risks. If the market were regulated, like other markets are, these risks would not be there and the market could deal with usual events like a bankruptcy of a player.

Finally there is a risk to the nation in allowing an unregulated market to balloon to the size that the credit default swap market has. Bankruptcies happen all the time. the fact that the bankruptcy of a player has caused the entire financial marketplace to go into a swoon is bad for the nation. The players in the credit default swap market asked for an unregulated market and they got what they asked for. Now the risk of having an unregulated market has shown itself and as Senator Gramm tells them they should deal with it and stop whining.

I am not an apologist, I am a technologist who is interested in how things work.

Thursday, October 02, 2008

Articulate UML Modeling

Last week, Leon Starr spoke to the SDForum SAM SIG on "Articulate UML Modeling". Leon is an avid modeler and has been using UML for modeling software systems since it was first defined. He believes in building executable models and I applaud him for that. The very act of making something executable ensures that it is in some sense complete and free from many definitional errors. Executing the model allows it to be tested.

There are several advantages to building models rather than programs. A big part of many project is extracting requirements. Unlike a program, a model can describe requirements in a way that a non-technical user can understand and appreciate, so the user can provide feedback. Another advantage of a model is that is does not arbitrarily constrain the order in which things are done. Essentially, a model is asynchronous and captures opportunities for concurrency in its implementation. This struck a chord with me as I am going to speak to the SAM SIG in October on "Models and Patterns for Concurrency".

The other part of the talk that interested me was Leon's attack on building models as controllers. For example, he gave the example of a laser cutting machine. A common way of modeling this is the laser cutter controller that interprets patterns. He prefers to see this modeled as patterns that describe how they are cut by the laser cutter. Leon's experience is with modeling software to manage physical systems like the air traffic controller example that he used to illustrate his talk. His approach is certainly useful for the understanding and analysis of physical systems, however I have seen the problem argued both ways. It is worth a separate post on the issue.

Monday, September 29, 2008

42 Revisited

Last week TechCrunch had a post on the State of The Blogosphere: The More You Post, The Higher You Rank. One statistic is that the top 100 bloggers post on average 310 times a month, which sounds quite exhausting. As you know, I post 42 times a year. I am going to promise to my faithful reader that I will stick to my pace. You will not get an unreadable avalanche of overlapping verbiage from this blog.

If I have not posted much recently, it is because I have spent a lot of time reading blog posts on the financial crisis. It is very entertaining to see these extrordinary events unfold around us. Who would have thought that George W. Bush will be known to future generations as the President who nationalized the American financial services industry?

Wednesday, September 17, 2008

SaaS Data Integration

Data integration is the problem of gathering data, perhaps from many different application for the purpose of doing some analysis of the data as a whole. Mike Pittaro, Co-Founder of SnapLogic spoke to the SDForum Business Intelligence SIG September meeting on "Enhancing SaaS Applications Through Data Integration with SnapLogic".

The big players in data integration are Informatica and Ascential (now IBM Information Integration) who sell large, expensive and complex products. Because of the cost, these products are often not used, particularly for one off projects which are common. Mike helped found SnapLogic in 2005 to bring a new perspective to data integration. SnapLogic is an open source framework and therefore both affordable and extensible by its users.

He showed us the complexity of data integration. It involves dealing with many different access protocols, multiple ways of getting the data and each type of data has its own metadata format to describe the data. This he contrasted with the World Wide Web where huge amounts of data are pulled back and forth every day, without interoperability problems. There are almost 200 million web sites, and billions of users, yet World Wide Web is completely decentralized, with heterogeneous model that allows for different operating system, servers, client software applications and frameworks, and yet they are all compatible and interoperable.

The World Wide Web is based on open standards and protocols and an architectural principal called REST, which stands for REpresentational State Transfer. REST plays with data resources, in standardized representations and each resource identified by a unique identifier like a URL.

SnapLogic builds on this by turning data sources into standard web resources. With SnapLogic you configure a server to extracts data from a datasource like a file or database and transform the data into the form you want. The server presents the datasource as a standard web resource with a URL. These servers are the blocks for building a data integration application.

Thursday, September 04, 2008

Chrome

On Tuesday, Google announced their new browser Chrome. Although it has generated huge discussions in various forums and an astonishing adoption rate, I am not going to rush to use it. In fact, I think I will wait until it is out of beta before considering whether to adopt it. That should give me many years before I have to even think about making a change!

Wednesday, September 03, 2008

A Tale of Two Search Engines

At the SDForum Software Architecture and Modeling (SAM) SIG last week, John D. Mitchel, Mad Scientist at MarkMail and previously Chief Architect of Krugle talked about the architectures of the search engines that he has built for these two companies.

Krugle is a search engine for code and all related programming artifacts. The public engine indexes all the open source software repositories on the web. This system was built a couple of years ago with cheap of the shelf commodity hardware, open source software and Network Attached Storage (NAS). In total it has about 150 computer systems in its clusters. The major software components are Lucene (search engine), Nutch (web crawling and etc.) and Hadoop (distributed file system and map-reduce). These are all Open Source projects written in Java and sponsored by the Apache Software Foundation. Krugle sells an enterprise edition that can index and make available all source code in an enterprise.

MarkMail is a search engine for email. It indexes all public mailing lists and is a technology demonstrator for the MarkLogic XML Content Server. MarkMail is built with newer hardware that is more capable. It uses Storage Area Network (SAN) for storage which offers higher performance at a greater cost than NAS. The MarkMail search system is built on about 60 computer systems in its clusters.

Saturday, August 16, 2008

Windows Woes

For years it seemed like a good idea, Microsoft produced the software and many vendors sold compatible hardware. Competition kept the hardware innovation flowing and prices low. Then Microsoft turned into a big bloated monopoly that could not create a decent product if it tried. Moreover, Microsoft is not really in control, it itself is hostage to other interests. The result is a horrible user experience. Here are a couple of my recent experiences.

A few months ago I bought a new video card so that it would use the digital input to the monitor. Installing the card was a breeze and the digital input makes the monitor noticeably sharper. The only problem was that the sound had stopped working. After a couple of hours scratching my head and vigorous Googling, the problem turns out to have been caused by Hollywood.

The connection between a computer and its display uses HDMI, a digital interconnect standard that can transmit both video and audio. This allows a PC to connect to a digital television as well as a simple display. It also allows the video and audio content to be encrypted so that you cannot steal it from your own computer. This was mandated by Hollywood and Microsoft meekly acquiesced to it so that they could provide media center software that would display Hollywood movies in high definition.

So after the video card installation, Windows software assumed that I was going to use the digital audio output on the video card and ignored all other audio output devices. This even although my display does not have any speakers. I had to go into the BIOS and change some low level settings for sound so that Windows would allow me to select the sound settings that I had been using before installing the video card. Any time you have to go into the BIOS to change settings, the user experience loses.

More recently my brother and family came to visit during a tour of California. He wanted to unload all the pictures on his cameras flash card and write them to a CD as the flash card was full. I suggested the easy way out, visit Fry's Electronics and buy another flash card, but that deemed more trouble. In practice it would have been much easier.

We downloaded the flash card to my PC. The first difficulty is that you are presented with a list of 6 competing programs that want to download your pictures. Which one should I use? I know that in practice they are all going to put the pictures in some ridiculous place where you can never find them again (that is the subject of another tirade). I chose the first in the list which happened to be compatible with the brand of digital camera.

The next problem came when we went into Windows Explorer so that we could drag the pictures to the CD ROM folder. Every time we went into the folder where the pictures were, Explorer exited saying that it had an unexpected fault. I knew exactly what the problem was because I had seen it before. There were some movie files taken with the digital camera, and Windows has problem with these movie (.avi) files. For some reason, Explorer tries to open every file in a folder when it enters the folder, even although I set it to just list the files and not display thumbnails.

The fix was to open a DOS window, navigate to the folder with the files and rename them so that Windows would not think they were media files. I added the extension .tmp to each .avi file by laborious typing. Then it was possible to do the intuitive drag and and drop with Explorer to make a CD ROM. Any time you have to resort to using a DOS window to do a straightforward function in Windows, usability has gone out the window.

I could go on (as I have in the past), there have been more problems, however with each problem the Apple alternative looks better. Apple is by no means perfect, however the Apple OS is built on a better foundation and the innovations that it makes when it comes out with a new version are both useful and innovative.

Thursday, July 17, 2008

A Gentle Introduction to R

We were given a gentle introduction to the R statistical programming language and its application in Business Intelligence at the July meeting of the SDForum Business Intelligence SIG. The speakers were Jim Porzac ( Senior Director of Analytics at Responsys) and Michael Driscoll (Principal at Dataspora). Jim has posted the presentation here.

R is an Open Source project that uses the GNU license. It has a growing user base with a strong support community and a user group (called UseR Group - try Googling that). There are now almost 1500 packages for the languages that supports various statistical techniques and specialized application areas. Packages include: Bayesian, Econometrics, Genetics, Machine Learning, Natural Language Processing, Pharmacokinetics, Psycometrics, which gives some idea of the range of subjects and techniques that R covers.

Jim did most of the talking, introducing the language and showing us some examples of its use. One example is his data quality package that he uses on each new dataset that he receives for analysis at Responsys. Another example showed how reporting capabilities while a third showed sophisticated graphs and plots used for customer segmentation analysis. Michael showed us how he used R to do some interesting and very practical analyzes of Baseball statistics.

The audience probed R's strength and weakness. R has the connectivity to get data for analysis from databases and other sources. R also has excellent graphing and reporting capabilities. Currently R works by reading data into memory where it is manipulated, which limits the maximum size of data set that can be analyzed to the many Gigabyte range.

One person asked for a comparison with SAS. R has the advantages of being free with an enthusiastic user base to keeps it on the cutting edge. Also R is a more coherent language than SAS, which is a collection of libraries, each of which may be very good but they do not necessarily make a whole.

Jim and Michael are starting a Bay Area chapter of the UseR Group. If you are interested, contact Jim Porzac at Responsys.

Wednesday, July 09, 2008

Social Search

The SDForum Search SIG pulled together an A-List panel for their July meeting on Social Search. Moderator Safa Rashtchy hosted Bret Taylor of FriendFeed, Ari Steinberg of FaceBook, Jason Calacanis of Mahalo and Jeremie Miller of Wikia Search. Of the panelists, Jason Calacanis had the most to say, was arguably the most interesting and definitely the most opinionated. He also recorded the event with the camera in his MacBook Air. Vallywag has a better and more concise video excerpt of Jason in action shaded by their desire to capture controversy.

Facebook and FriendFeed are working on automated search within their social networks, while Mahalo and Wikia Search are working on improving general search by using people to curate the results. Mahalo is paying people, while Wikia Search is trying to use the Wikipedia model of free community involvement.

Most of the audience questions to the panel were about their business models and monetization. I tried to tried to get into technicalities by asking a question about Search Quality, there was a question on privacy, and one audience member argued that none of the panelists companies were doing social search as he defined it.

Saturday, June 21, 2008

Master Data Management - What, Why, How, Who?

I got two interesting things out of Ramon Chen's talk on Master Data Management (MDM) to the SDForum Business Intelligence SIG June meeting. Ramon is VP Product Marketing at Siperian. The first thing is the importance idea is the notion of Data Governance, and as part of governance the emerging role of the Data Steward. The second thing is the big enterprise software vendors are circling.

Large organizations, companies and government collect vast amounts of information and Data Governance is the process of looking after that data. First is the problem of cataloging all the data that the organization has. Next, there may be different versions of the same data that needs to be reconciled and the quality of the data that needs to be ensured. Finally there is the question of deciding who has access to different parts of the data and ensuring that it is correctly secured. A Data Steward is a person who is responsible for some part of the data.

Ramon had some specific examples of problems with data. One is in the Medical field where gifts to doctors are highly regulated. The problem is in identifying a specific doctor particularly where a father and son with similar names may share a practice, which is not uncommon. Another problem is security. Siperian has implemented security down to the cell level to ensure that each user can only see data that they are allowed to look at.

Ramon also described how MDM software vendors are being consolidated by data providers and the big enterprise software vendors. For example, Purisma, who presented to the BI SIG a couple of years ago was bought by Dunn and Bradstreet last year. IBM has been particularly active in buying small MDM related software vendors, however SAP, Microsoft and Oracle have also bought companies in this area recently.

Thursday, June 12, 2008

Flex, ActionScript, MXML?

It has been over a week since James Ward and Chet Haase of Adobe gave a talk on Flex to the SDForum Java SIG, and I am still trying to get my mind around what it all means. Adobe has a bunch of technologies and products in the Rich Internet Application (RIA) area, but it is difficult to work out what they are, how they fit together and which one I should use for any particular application. Here is the story as I understand it.

Lets start with the programming language ActionScript. ActionScript is ECMAScript which is JavaScript. There are differences in implementation between ActionScript and other forms of JavaScript, however most of the difference is in the Document model, which can be called the API , but is more like the object environment in which the program executes. JavaScript programs execute in a web page defined by the Document Object Model (DOM). ActionScript started as the language of the Flash player so it is more oriented to construction an environment and this leads to some differences in the objects that it can use.

MXML is an XML based declarative language that compiles into ActionScript. Basically it is a shorthand for defining the static parts of an ActionScript environment. By the way, when I entered MXML into the Adobe site search engine, the first thing that came back was the question "Do you mean MSXML?", where MSXML is a MicroSoft technology.

Next we come to the runtime environments. The Flash player is lightweight client that executes compiled ActionScript and is most commonly deployed as a browser plug in (as opposed to, for example, a Browser which contains a JavaScript interpreter). Adobe AIR is a larger and more capable stand alone client for executing compiled ActionScript as well as HTML, Java etc. (as opposed to, for example, a Browser which is a client for interpreting HTML, JavaScript, et al).

Flex is the framework which means that it is a overarching name for the whole pile of technology. The one piece of technology called Flex is Flex Builder, the Eclipse based development environment for ActionScript and MXML. As they have done with other products, Adobe has open sourced a lot of technology surrounding Flex to bring more developers to the platform.

Overall, I am not sure which is more impressive, the melange of technology in Adobe Flex or the marketing effort that tries to make the whole melange of technology seem like one coherent whole.

Monday, June 09, 2008

New iPhone - New Business Model

Steve Jobs announced the widely anticipated new iPhone at the WWDC today. I have seen a lot of comments on features and price, but nothing interesting on the new business model. Here is my take.

In the old business model, Apple and ATT sold the iPhone at full price and in a highly unusual arrangement, ATT shared its ongoing revenue with Apple. Now Apple and ATT sell the iPhone at a discount. ATT presumable pays Apple for each phone they sell, however there is no ongoing revenue sharing. We will have to see exactly how this plays out when the iPhone goes on sale. It may well be that you have to sign up with ATT to unlock the phone when you register it.

Apple still has a couple of revenue streams which are unusual concessions from a mobile-phone companies, especially in the USA. First, Apple gets to sell all the media and games on the phone through its iTunes store. Songs are still $1, movies and TV shows range from $2 to $5, games and applications range from free to $10. This is a useful revenue stream even although it has a margin of only 20% to 30%.

More interesting is the MobileMe storage and syncing service that costs $100 a year. Verizon charges me $10 to move my phone list from an old phone to a new phone when I have to buy a new one. Nobody there or at any other phone company thought of charging $100 for making this service continuously available. At the same time it is a great idea, that many have picked up as a good reason to get the iPhone.

The only problem with MobileMe is the ridiculously small storage capacity of 20 GB. The phone has 8GB or 16GB. What is the point of having a backing store that is about the same size as my phone? Particularly as storage is not that expensive these days. Google Apps offers 25GB for $50 per year, Apple ought to offer something equivalent.

Apart from that, the new business model keeps Apple ahead of the game which is exactly where it needs to be.

Monday, June 02, 2008

Still HDTV - Not

It is old news to me but HDTV has still to turn on viewers. While working out at the gym the other day and listening to Harry Shearer's Le Show, he mentioned recent research conducted by the Scripps Network that large numbers of viewers that receive a HDTV feed continue to watch the same content in standard definition.

I posted about this problem a couple of years ago. Since then I have successfully trained my family to watch HDTV when it is available, but it took some work to make them HD aware. Still, we do not get a lot of channels in HD and the channels are still in the same obscure 700 range of the "dial".

There is some good programming in HD. Last night we watched 2001: A Space Odyssey on the Universal HD channel and it was stunning. Fortunately there were only a few commercial breaks, because the commercials came in so much louder than the film we had to mute the entire commercial break to remain sane.

Saturday, May 31, 2008

Software Tools Used by Criminals

I am always throughly paranoid by the time I leave a meeting of the SDForum Security SIG, and the May meeting was no exception. After the meeting, I stopped at a gas station. Immediately I was suspicious as their price was several cents less that the gas station on the other side of the intersection. Next the pump did not ask me for my ZIP code, and finally the pump let me get more gas than my credit card limit. What kind of scam was this?

The source of my paranoia was the fascinating presentation on "Software Tools Used by Criminals" by Markus Jakobsson, a Principal Scientist at PARC. Marcus led us through the history of software and internet scams, starting with the first computer virus and internet worm, to the present day where sophisticated criminals are making targeted attacks on individuals and businesses.

Marcus also led us through the crime cycle, starting with 'data mining' of public data sources to get the information needed to make an attack, through ways in which the criminals can get money from a scam without identifying themselves. He also described the results of several experiments that he and others had done to measure how easy some of these data mining exercises were, and experiments to measure how gullible people are when they are set up in the right psychological way for a scam.

Finally Marcus came to his recent work on a password reset system. This is usually done with a security question, the answer to which can often be guessed or found out. For example, data mining of public data from Texas had discovered the mothers maiden name of about half the population of Texas. Markus and his team propose a new technique based on preferences which is both easy to remember and is unlikely to be guessed by outsiders.

Monday, May 26, 2008

Pull Dressed as Push

We have been watching with some degree of schadenfreude the problems that Twitter, the incredibly popular microblogging service, has with scaling, or even providing a reliable service. Yesterday Steve Gillmor suggested in his TechCrunch post that the problem has been caused by FriendFeed. FriendFeed is a new social service aggregator that either enhances or engulfs Twitter depending on your point of view. Here is my take on what the problem is.

First some background. Publish and Subscribe (pub/sub) is the underlying goal of all these services. I, as a client subscribe to something, and when the publisher has something that matches my subscription, they Push it to me. This is efficient because stuff is only sent to me when it exists. The problem is that the publisher may not know where I am when they want to do the Push. So many pub/sub systems work on the Pull by Polling model. That is, every so often I ask the publisher if they have anything new for me. The Polling part is that I repeatedly ask for new stuff and the Pull is that when new stuff exists, I Pull it from the publisher. This works reasonable well as long as I do not poll the publisher too often.

For example, RSS works this way (as I discussed some time ago). To prevent the original publisher being overwhelmed by requests for new information, part of the RSS protocol describes how often someone may poll the publisher and not overwhelm the publisher with too many requests.

Now back to Twitter and FriendFeed. Twitter provides an API so that other services can be built upon it. FriendFeed is an aggregator of social networking services that uses the Twitter API to aggregate information for its users. The Twitter API is based on XMPP, which is a high performance API for instant messaging that supports, for example, instant messaging between large service providers such as AIM and Yahoo Instant Messaging. However XMPP also has a low performance option based on HTTP for polling XMPP servers. This turns XMPP into Pull dressed as Push, which strains the servers when the poll rate is too high.

It turns out that the Twitter XMPP API is based on the low performance HTTP option. Thus FriendFeed is polling Twitter for each of its users, and polling frequently to give the appearance of instant response, which may be the reason that the Twitter servers are overloaded. Twitter has a feature in their API to throttle polling to no more than once a minute, however this could also be a problem if it is badly implemented.

By way of disclosure, I do not use Twitter or many of these other toys. I get quite enough information overload from RSS.

Sunday, May 18, 2008

Blog Ennui

It is Sunday morning and I notice that TechCrunch has a couple of new posts. One is a stream of consciousness piece from Steve Gillmor called "Bill's Gold Watch". This one was better than the stream of consciousness piece Steve wrote last week called "The Blood Brain Barrier", mainly because it was shorter. Steve can write conventional blog posts, for example on Saturday morning he had an excellent piece called "Facebook's Glass Jaw" which comments on the Facebook - Friends Connect fracas.

So what is Steve trying to do with "Bill's Gold Watch"? Is he trying to create a new Journalism? To me, it reads more like poetry. Even if it does not make complete sense, it sparks off thoughts and associations, and that appears to be the intention. Another commenter suggested it reads like rap. If it were printed as blank verse we would see what was going on and Steve could concentrate in getting the rhythms right as well as avoiding some of the more tortuous and interconnected thoughts.

In other blog thoughts, I have completely given up on Vallywag. Shortly after I wrote about Vallywag a year and a half ago, the then editor Nick Douglas departed and it has been downhill ever since. Now it is just a load of social claptrap of the sort that fills the gossip column of a tabloid newspaper.

Saturday, April 26, 2008

Hypertable - A Massively Parallel Database System

Now everyone can have their own database system that scales to thousands of processors, as we heard at the April meeting of the SDForum Software Architecture and Modeling SIG. Doug Judd from zEvents, and the Hypertable Lead Developer spoke on "Architecting Hypertable-a massively parallel high performance database".

Hypertable is an Open Source database system designed to deal with the massive scale of data that is found in web applications such as processing the data returned by web crawlers as they crawl the entire internet. It is also designed to run on the massive commodity computer farms, which can consist of thousands of systems, that are employed to process such data. In particular Hypertable is designed so that its performance will scale with the number of computers used and to handle the unreliability problems that inevitably ensue from using large computer arrays.

From a user perspective, the data model has a database that contains tables. Each table consists of a set of rows. Each row has a primary key value and a set of columns. Each column contains a set of key value pairs commonly known as a map. A timestamp is associated with each key value pair. The number of columns in a table is limited to 256, otherwise there are no tight constraints on the size of keys or values. The only query method is a table scan. Tables are stored in primary key order, so a query easily accesses a row or group of rows by constraining on the row key. The query can specify which columns are returned, and the time range for key value pairs in each column.

The basic unit for inserting data is the key value pair, along with its row key and column. An insert will create a new row if none exist with that row key. More likely, an insert will add a new key value pair to an existing column map or have the existing value superseded if the new column key already exists in the column map.

As Doug explained, Hypertable is neither relational or transactional. Its purpose is to store vast amounts of structured data and make that data easily available. For example, while Hypertable does have logging to ensure that information does not get lost, it does not support transactions whose purpose is to make sure that multiple related changes either all happen together or none of them happen. Interestingly, many database systems switch off transactional behavior for large bulk loads. There is no mechanism for combining data from different tables as tables are expected to be so large that there is little point in trying to combine them.

The current status is that Hypertable is in alpha release. The code is there and works as Doug showed us in a demonstration, however it uses a distributed file system like Hadoop to store its data and while they are still developing they are also waiting for Hadoop to implement a consistency feature before they declare beta. Even then there are a number of places where they have a single point of failure, so there is plenty of work to make it a complete and resilient system.

Hypertable is closely modeled on Google Bigtable. At several times in the presentation when asked about a feature, Doug explained it as something that Bigtable does. At one point he even went so far as to say "if it is good enough for Google, then it is good enough for us".

Monday, April 21, 2008

SaaS, Cloud, Web 2.0... it’s time for Business Intelligence to evolve!

The most surprising phrase in Roman Bukary's presentation to the April meeting of the SDForum Business Intelligence SIG was "right time, not real time", and it was said more than once. Roman is Vice President of Marketing and Business Development at Truviso and his presentation entitled "SaaS, Cloud, Web 2.0... it’s time for Business Intelligence to Evolve!" brought a large audience to our new location at the SAP Labs on Hillview Avenue in Palo Alto.

Truviso provides software to continuously analyze huge volumes of data, enabling instant visibility, immediate action and more profitable decision making. In other words, their product is a streaming database system.

Over the years, the Business Intelligence SIG has heard about several streaming database systems. Truviso distinguishes themselves in a number of ways. Firstly it leverages the open source Postgres database system, so it is a real database system and real SQL. Other desirable characteristics are handling large volumes of data, large numbers of queries and the ability to change queries on the fly. They also have a graphics front end that can draw good looking charts. Roman showed us several Truviso applications including stock and currency trading applications that are both high volume and a rapidly changing environment.

Then we come to the "right time, not real time" phrase. In the past I have associated this phrase with business intelligence systems that could not present the data in a timely manner. Obviously, that is not a problem with streaming database systems that process and aggregate data on the fly and always have the most up to date information.

I think that Roman was trying to go in the other direction. He was suggesting that Truviso is not only useful for high pressure real time applications like stock trading, it also has a place in other applications where time is less pressing but the volume of data is high and there is still a need for a real time view of the current state. Such applications could include RFID, logistics and inventory management.

Tuesday, April 08, 2008

Open Source 10 Years Later

April 7 2008 marks 10 years since the landmark Freeware Summit that signaled the opening of the Open Source movement. By coincidence I recently read the manifesto of the Open Source movement, "The Cathedral and The Bazaar" by Eric S Raymond. The book, published in 1999 and revised in 2001, contains the namesake essay several others including "Revenge of the Hackers" which describes the events leading up to and following the Freeware Summit from an insiders point of view. The essay is valuable as a history of Open Source however its veracity is slightly marred because it dates summit meeting as happening on March 7.

One thing that the Revenge of the Hackers does not shy away from is explaining why Richard Stallman and the Free Software Foundation was not present at the Freeware Summit. In the past I have written on the distinction between Open Source and Free Software. Raymond is tactful but firm in explaining why creating a separation between these two ideas was essential to getting Open Source accepted by the mainstream.

On the other hand, the end of the essay that looks into the future of Open Source does suffer in hindsight. Open Source has advanced by leaps and bounds in the last 10 years. However it is still not in the position of ruling the world as the Revenge of the Hackers suggests it might. Lets give it at least another 10 years.

Thursday, April 03, 2008

An Evening with The Difference Engine

One day I write about Doron Swade and "The Cogwheel Brain". Three days later I get an invitation from the Computer History Museum. Doron Swade is coming to Silicon Valley with a Difference Engine!

The occasion is that another Difference Engine has been commissioned by Nathan Myhrvold, ex CTO of Microsoft. It is being exhibited at the Computer History Museum in Mountain View, and to celebrate its arrival, there is an "Evening with Nathan Myhrvold and Doron Swade" at the museum. We have signed up for the event, have you?

Sunday, March 30, 2008

Building Better Products Through Experimentation

Experimentation is the theme of the SDForum Business Intelligence SIG so far this year. The March meeting featured Deepak Nadig, a Principal Architect at eBay, talking about "Building Better Products Through Experimentation". Experimentation is an important technique for Business Intelligence, although its first uses were with medicine. In 1747, James Lind, a British naval surgeon performed a controlled experiment to find a cure for scurvy. In his book "Supercrunchers", Ian Ayres describes how the Food and Drug Administration has used experimentation since the 1940s to determine whether a medical treatment is efficacious.

While eBay has always used experimentation test and fine tune its web pages, in recent years the process has been formalized. While anyone can propose an experiment, product managers are the group of people who are most likely to do so. Deepak took us through the eBay process and discussed issues with using experimentation. Because they have the infrastructure, simple experiments can be set up within a matter of days. eBay usually runs an experiment for at least a week so that it is exposed to a full cycle of user behavior. Simple experiments to test a small feature typically run for a week or so, larger experiments may run for a month or two and some critical tests run continuously.

For example, eBay is interested in whether it is a good idea to place advertising on their pages. On the one hand it brings in extra revenue in the short term, on the other hand, it might cannibalize revenue in the long term. Experimentation has shown that advertising is a good thing in some situations, however its use is being monitored by some long term experiments to ensure that it remains beneficial.

Deepak took us through some of the issues that with experimentation. One issue is concurrency, how many experiments can be carried out at the same time. As eBay has a high traffic web site, they can get good results with experiments on a small proportion of the users, at most a few percent. As each experiment uses a small percentage of the users, several experiments can be run in parallel. Another issue is establishing a signal to noise ratio for experiments to ensure that experiments are working and giving valid results. eBay has done some AB experiments where A and B are exactly the same to establish whether their experimental technique has any biases.

Wednesday, March 26, 2008

The Cogwheel Brain

The Cogwheel Brain by Doron Swade is the story of Charles Babbage and his quest to build the first computer. The book also details how Doron Swade built a Babbage Difference Engine for the 200th anniversary of Babbage's birth in 1991.

Charles Babbage designed 3 machines. His started with the Difference Engine that would use the method of finite differences to generate tables such as logarithms and navigation tables. The computing section of his first design was built although it did not have a printer. Next he conceived and designed an Analytic Engine, which was a fully functioning computer that was programmed by the same kind of punched cards that were used to run a Jacquard weaving loom. In the course of designing the Analytic Engine he realized that he could improve the design of the Difference Engine to make it faster and use less parts. This resulted in the design of Difference Engine 2. Only small demonstration parts of the Analytic Engine were built and the Difference Engine 2 existed only as a set of plans.

I expected the story to be similar to several other computing projects that I have seen and worked with. You know the projects, the ones where the architect keeps jumping to a new idea while the overall project goals get lost and the project overruns for years before it is abandoned. Building the Difference Engine was a lot more disciplined. The core of the first difference machine was built and worked even although it used orders of magnitude more machined parts than any other machine built up to that time. While it did take a long time, given the engineering practices of the day, all the parts had to be made by a single craftsman in a single workshop.

One thing from the book that surprised me is that during the 19th century other difference engines were built by other engineers. Although these machines were completed, they were never successfully used for any purpose. I think this goes to show that the 19th century was not ready for mechanical computing. The book is easy to read and highly recommended.

Thursday, March 13, 2008

Customer Relationship Intelligence

There is a curious thing about the organization of a typical company. While there is one Vice President in charge of Finance and one Vice President in charge of Operations there can be up to three Vice Presidents facing the customer: a Marketing Vice President, a Sales Vice President, and a Service Vice President. On the one hand, the multiplicity of Vice Presidents and their attendant organizations is a testament to the importance of the customer. On the other hand, multiple organizations mean that no one is in charge of the customer relationship and thus no one takes responsibility for it.

We see this in the metrics that are normally used to measure and reward customer-facing employees. Marketing measure themselves on how well they find leads regardless of whether sales uses the leads. Sales measure themselves on the efficiency of the sales people in making sales regardless of whether the customer is satisfied. Service, left to pick up the pieces of an overpromised sale, measure themselves on how quickly they answer the phone. Every one is measuring their own actions and no one is measuring the customer.

Linda Sharp addresses this conundrum head on in her new book Customer Relationship Intelligence. As Linda explains, a customer relationship is built upon a series of interactions between a business and its customer. For example, the interactions starts with acquiring a lead, perhaps through an email or mass mailing response or a clickthrough on a web site. Next, more interactions qualify the lead as a potential customer. Making the sale requires further interactions leading up to the closing. After the sale there are yet more interactions to deliver and install the product and service to keep it working. Linda's thesis is that each interaction builds the relationship and that by recording all the interactions and giving them both a value and a cost, the business builds a quantified measure of the value of its customer relationships and how much it has spent to build them.

Having a value for a customer relationship completely changes the perspective of that relationship. It gives marketing, sales and service an incentive to work together to build the value in the relationship rather than working at cross purposes to build their own empires. Moreover, knowing the cost of having built the relationship suggests the value in continuing the relationship after the sale is made. In the book, Linda takes the whole of the second chapter to discuss customer retention and why that is where the real profit is.

The rest of the book is logically laid out. Chapter Three “A Comprehensive, Consistent Framework” creates a unified model of a customer relationship throughout its entire lifecycle from the first contact by marketing through sales and service to partnership. This lays a firm bedrock for Chapter Four, “The Missing Metric: Relationship Value” which explains the customer relationship metric, the idea that by measuring the interactions that make the relationship we can give a value to the relationship.

The next two chapters discuss how the metric can be used to drive customer relationship strategy and tactics. The discussion of tactics lays the foundation for Chapter Seven, which shows how the metric is used in the execution of customer relationships. Chapters Six and Seven contain enough concrete examples of how the data can be collected and used to give to give us a feeling of the metric’s practicality. Chapter Eight compares the customer relationship metric with other metrics and explores the many ways in which it can be used. Finally, Chapter Nine summarizes the value of the Customer Relationship Intelligence approach.

Linda backs up her argument with some wonderful metaphors. One example is the contrast between data mining and the data farming approach that she proposes with her Relationship Value metric. For data mining, we gather a large pile of data and then use advanced mathematical algorithms to determine which parts of the pile may contain some useful nuggets of information. This is like the hunter-gatherer stage of information management. When we advance into the data farming stage, we know what customer relationship metric is important and collect that data directly.

As the metaphor suggests, we are still in the early days of understanding and developing customer relationship metrics. Until now, these metrics have concentrated on measuring our own performance to see how well we are doing. Linda Sharp’s Relationship Value metric turns this on its head with a new metric that measures our whole relationship with customers. Read the book to discover a new and unified way of thinking about and measuring your customers.

Tuesday, March 04, 2008

Developing on a Cloud

The cloud computer is here and you can have your corner of it for as little as 10 cents an hour. This was the message that author and consultant Chris Richardson offered to the SDForum SAM SIG when he spoke on "Developing on a Cloud: Amazon's revolutionary EC2" at the SIG's February meeting.

As Chris tells it, you go to the Amazon site, sign up with your credit card, go to another screen where you describe how many cloud servers you need and a couple of minutes later you can SSH to the set of systems and start using them. In practice it is slightly more complicated than this. Firstly, you need to create an operating system configuration with all the software packages that you need installed. Amazon provides standard Linux set ups and you can extend them with your requirements and store the whole thing in the associated Amazon S3 storage array. There goes another 10 cents a month.

Next you need to consider how your cloud servers are going to be used. For example, you could configure a classic 3 tier redundant web server system with 2 cloud servers running web servers, and another 2 cloud servers running tomcat application servers and a another cloud server running the database with yet another cloud server on database standby. Chris has created a framework for defining such a network called EC2Deploy (geddit?). He has also implemented a Maven plug-in that sits on top of EC2Deploy that creates a configuration and starts applications on each server. Needless to say the configuration is defined declaratively through the Maven pom.xml files.

So why would want to use EC2 for? Chris suggested a couple of applications that are particularly interesting for impoverished start ups. Firstly, EC2 can be used to do big system tests before a new version of the software is deployed. The start up does not need buy all the hardware to replicate its production systems so that they can do a full scale system test. Big system tests are done on EC2 saving considerable resources. Another use it to have a backup solution for scaling should the startup take of in an unexpected manner. Given the unreliability of ISPs these days, having a quickly deployable backup system sounds like a good idea, and the best thing is that it does not cost you anything when you are not using it.

Thursday, February 28, 2008

Develop Smarter Products

When someone asks me about Business Intelligence, I will usually say that it is about analyzing the data that a business already has, and the truth is that businesses collect huge amounts of useful data. However, there are many interesting applications where we go out and collect specific data for analysis. We heard about one such application at the February meeting of the SDForum Business Intelligence SIG where Cameron Turner, CEO of ClickStream Technologies, spoke on "Software Instrumentation: How to Develop Smarter Products with Built-in Customer Intelligence".

ClickStream Technologies has a data collector that collects user interactions with GUI based user interfaces. That data is loaded into a data warehouse for analysis of the user experience. Contrast the ClickStream method with other techniques for for analyzing program usage. The most common method is to analyze the logs generated by a program, but logs are typically recorded some distance from the user interface and tends to capture the end result of what the user did rather than how the user did it. For example, if there are several ways in which a function can be invoked, programs logs normally record that the function has been invoked, but not how it was invoked. Also collecting more information involves modifying the program to increase the log data that it produces, which is not always practical. The ClickStream data collector runs as a stand alone program and collects its data with minimum intrusion into the running of the program being instrumented. Another technique for gathering user experience data is to have someone stand behind the user and record what they do, but this is labor intensive and does not allow for large scale studies.

There are many reasons for evaluating the user experience with a program. Cameron lists them in his presentation which you can get a copy of by visiting the files area of the Business Intelligence SIG Yahoo Group. The one that is closest to my heart is providing feedback to the designers of a program that their design leaves a lot to be desired. There are many times when I have become frustrated with a program because I cannot find out how to do the the simplest and most obvious thing. This is when I wish that someone was recording my problems and feeding them back to the development team.

ClickStream Technologies started off as a consulting company. They are moving their offering towards something that is more standardized with the idea that eventually customers will be able to use it on a self service basis. Currently each engagement requires configuring the data collector and writing reports for the analysis. Also for each engagement they recruit a panel of testers who download the data collector. As such, is it more suitable for medium to large sized companies that want to do large scale studies.

Sunday, January 27, 2008

The OpenSocial API

OpenSocial is a standard API for applications in social networking platforms. It is sponsored by Google. The API exists to make applications portable between different social networks. On January 22 Patrick Chanezon, OpenSocial Evangelist at Google spoke to the SDForum Web Services SIG on the topic "OpenSocial Update: On the Slope of Enlightenment".

Social Networks have been a big part of Web 2.0 and thousands of them have sprung up. In the future Social Networks could become like wikis where businesses and organizations set up social networks to allow their employees and members to communicate with one another so there is the potential for millions of social networks. A standard API makes a social network that support the API more valuable because applications can be easily ported to it.

When first announced, there was great expectations for OpenSocial. Unfortunately, many people assumed that it was either an API for communicating between different social networks or an API for porting members data between social networks. OpenSocial is neither of these things. Social networks regard their member data as their crown jewels so allowing for data portability or interaction between networks is something that would not be welcomed easily. As Patrick explained, to get the API out quickly, it had to be something uncontroversial and as all social networks want applications, it is easy to draw them around a common API.

Because of the great expectation OpenSocial went through the hype cycle quickly. In a few weeks it hit the Peak of Inflated Expectations and then just as quickly descended into the Trough of Disillusionment. Now Patrick claims that they are on the Slope of Enlightenment and firmly headed towards the Plateau of Productivity. All this on an API that has reached version 0.6.

APIs are difficult to judge, but this one seems kinda nebulous. There are three parts to the OpenSocial API. The first part is configuration where the application can find out about its environment, the main issue seems to be coming to agreement on the names for common things in social networks. The second part of the API is a container for persisting the applications own data. Finally the API has features for handling event streams that seem to be a common feature of social networks. Ho-hum.

Some other interesting tit-bits came out of the talk. Security is a big issue with JavaScript and browsers. As I wrote previously, the Facebook approach is to have their APIs use their own language which is easy to sanitize. The response from the rest of the world seems to be an Open Source project that filters JavaScript programs to effectively sandbox them. Unfortunately, I was not quick enough to record the name of the project.