Build and Break

Saturday, November 08, 2008

Leonardo at The Tech

We visited the Leonardo exhibition at The Tech this afternoon. It is a huge exhibition. They suggest that you allow 2 hours for the tour. We were there for two hours and we rushed through the second half to such an extent that I will go back and do it again. The exhibition starts with Brunelleschi's Dome for the Duomo in Florence. Leonardo was an apprentice in Florence towards the end of its construction and it started his interest in mechanics.

After wandering through many halls of mechanical inventions, we came to the anatomy room where Leonardo takes his knowledge of mechanics and applies it to understanding how the human body works. It was after this that we had to pick up the pace just as the exhibits started to get really interesting. The exhibition then goes into his more cultural side which includes his painting and sculpture.

One thing that I got from the painting displays is that Leonardo's knowledge of both mechanics and anatomy informed his paintings. For example, there is an interesting display on the dynamics of the characters in The Last Supper. There is another display on his studies into understanding faces, expressions and the muscles that are used to form facial expressions. So, the inscrutable expression on Mona Lisa's face is no accident (this is my summize, I did not see a reference to Mona Lisa in the exhibition) .

I highly recommend that you see the Leonardo exhibition if you can, and suggest that you allow several hours to see it all properly. Also, do not spend too much time on the mechanics. It is a necessary introduction to understanding how Leonardo viewed the world but it is also important to see how he applied all this knowledge.

Sunday, November 02, 2008

Financial Data Integration

Suzanne Hoffman of Star Analytics spoke to the October meeting of the SDForum Business Intelligence SIG on "Financial Data Integration". There were two aspects of her talk that particularly interested me. The first aspect was that Suzanne has been doing what we now call Enterprise Performance Management ever since her first job, 30 years ago, and she peppered her talk with a lot of interesting historical perspectives and anecdotes.

The most important anecdote relates to Ted Codd, inventor of the relational model for databases, and for his 12 rules that defined what a relational database is. Later Codd coined the term OLAP for analytic processing and published 12 rules that defined OLAP. Unfortunately the 12 rules for OLAP were not well regarded as they not as crisp as the 12 rules for a relational database and people found out that Codd had been paid a large sum of money by a OLAP software vendor for writing them. Susanne confirmed that the software vendor was Arbor Software and the money was $25,000.

The second interesting aspect to Susanne's talk was the idea that data can get trapped in OLAP systems. OLAP hold data in a multi-dimensional cube for analysis, so it is close to an end user presentation tool. OLAP is heavily used for financial analysis and modelling. The Hyperion, now Oracle, EssBase server is the king of the hill in dealing with large data cubes. Susanne reported that the largest cube she knew of was at Ford. It had 50 dimensions with the largest dimension having a million members.

We have system to get data into OLAP cubes so that the financial analyst can do their work, but when the work is done, there is no way to get the data out again so that it can be used in other parts of a business. In my opinion, a Business Intelligence system can and should be constructed so that the data in OLAP cubes is sourced from a data warehouse and is not just lost in the OLAP server. However this approach may limit the size of the OLAP cubes that can be built. Anyway many large companies have already bought high end OLAP servers and their data is trapped in there. The purpose of the Star Analytics integration server is to get that data out.

Saturday, November 01, 2008

The Scala Programming Language

These days feel like the 1980's as far as programming languages are concerned with new programming languages springing up all over the place. Then, the prolific Nicklaus Wirth invented a new programming language every other year. Now, the center of language design in Switzerland has moved to Lausanne where Martin Odersky at EPFL has conceived Scala. Bill Venners of Artima introduced the Scala Programming Language at the October meeting of the SDForum Java SIG.

Scala is a functional language, in the sense that is every "statement" produces a value. Also, Scala is a statically typed language although programs look like a dynamic language. The trick is that variables are declared by a 'var' declaration, and the type variable is the type of the initial value assigned to the variable. Contrast this with a dynamic language where the data type is associated with the data value and every operation on data has to look at the data types of the operands to decide what to do.

Getting the data type from the value assigned reduces the need to over specify type as is typical of statically typed languages like Java. Bill recalled the discussion of Duh typing that a group of us had after the last time he spoke to a SFDForum SIG. The other thing that Scala makes easy is declaring invariant variables. They are like 'var' variables except they are introduced by the keyword 'val'. Contrast this with Java where you put final before the declaration or C++ where you put const before the declaration. Thus a constant in Java is declared something like this:

final static String HELLO_WORLD = "Hello World";

while in Scala the declaration looks like this:

val HELLO_WORLD = "Hello World"

This leads to a more declarative style of programming, which is a good thing. Bill reported that while in Java 95% of declarations are variables and the other 5% are constants, in Scala, 95% of declarations are constant and only 5% are variables. I have used a similar style of programming in C++ when using APIs that make heavy use of const, so you have to declare variables that you are going to pass to the API as const. The only time that this is an problem is where you have to create const objects that can throw in their constructor. Then you can end up with heavily indented try blocks as you create each const object safely so that you can pass it to the API.

Finally, Scala, like many other languages these days, compiles to the Java Virtual Machine. That way, it is broadly portable, and developers have access to the vast Java libraries.

Saturday, October 25, 2008

The Google Database System

Google, Yahoo and others are taking the traditional database system and breaking it into pieces. Google has their own set of propriety database system components. Yahoo is working with the Hadoop Open Source project to make their system available to everyone. I came to this conclusion while doing research for my talk on "Models and Patterns for Concurrency" for the SDForum SAM SIG.

In this post I will talk about what is happening, why it is happening and at the end try to draw some conclusions about future directions for database systems. Note, I am using Google as an example here because their system is described in a set of widely accessible academic papers. Yahoo and many other large scale web sites have adopted a similar approach through use of Hadoop and other Open Source projects.

A traditional database system is a server. It takes care of persistent data storage, metadata management for the stored data, transaction management and querying the data, which includes aggregation. Google has developed its own set of applications which support these same functions, except instead of wrapping them into a single entity as a database server, they have been developed as a set of application programs that build on one another.

The are several reasons for Google developing their own database system. Firstly, they are dealing with managing and processing huge amounts of data. Conventional database systems struggle when the data gets really large. In particular the transaction model that underlies database operation starts to break down. This topic is worth a separate post of its own.

Secondly they their computing system is a distributed system built from thousands of commodity computer systems. Conventional database systems are not designed or tuned to run on this type of hardware. One issue is that at this scale the hardware cannot be assumed to be reliable and the database system has to be designed to work around the unreliable hardware. A final issue is that the cost of software licenses for running a conventional database system on the Google hardware would be prohibitive.

The Google internal application look like this. A the bottom is Chubby, a lock service that also provides reliable storage for small amounts of data. The Google File System provides file data storage in a distributed system of thousands of computers with local disks. It uses Chubby to ensure that there is a single master in the face of system failures. Bigtable is a system for storing and accessing structured data. While it is not exactly relational, it is comparable to storing data in a single large relational database table. Bigtable stores its data in the Google File System and uses Chubby for several purposes including metadata storage and to ensure atomicity of certain operations.

Finally Map Reduce is a generalized aggregation engine (and here I mean aggregation in the technical database sense). It uses the Google File System. Map Reduce is surprisingly closely related to database aggregation as found in the SQL language although it is not usually described in that way. I will discuss this in another post. In the mean time, it is interesting to note that Map Reduce has been subject to a rather intemperate attack by database luminaries David DeWitt and Michael Stonebreaker.

In total, these four applications: Chubby, Google File System, Bigtable and Map Reduce provide the capabilities of a database system. In practice there are some differences. The user writes program in a language like C++ that integrated the capabilities of these components as they need them. They do not need to use all the components. For example, Google can calculate the page rank for each page on the web as a series of 6 Map Reduce steps none of which necessarily uses Bigtable.

The concept of a Database System was invented in the late 60's by the CODASYL committee, shortly after their achievement of inventing the COBOL programming language. The Relational model and Transactions came later, however the concept of a server system that owns and manages data and much of the terminology originated with CODASYL. Since then, the world has changed.

Nowadays databases are often hidden behind frameworks such as Hibernate or Ruby on Rails that try to paper over the impedance mismatch between the database model on one side and an object oriented world looking for persistence on the other. These are mostly low end systems. At the other end of scale are the huge data management problems of Google, Yahoo and other web sites. New companies with new visions of database systems to meet these challenges are emerging. It is an exciting time.

Saturday, October 11, 2008

Its Called Risk, Have You Heard Of It?

Senator Phil Gramm famously called us a "nation of whiners" and he may be right. (Note, while I try to keep this blog about technology, the financial system seems to be so badly broken it is worthy of a comment or two.) I recently ran across a blog post on a financial site called "Our Timid Government Is Killing Us" by Michael Kao, CEO of Akanthos Capital Management. In it he complains about 4 things that the Government has not done to help resolve the financial crisis. I want to concentrate on one of there here. "Problem No. 2: Lehman's bankruptcy has severely eroded confidence between counterparties."

The problem is this. Over the last few weeks, financial institutions have become unwilling to trust one another and with good cause. The issue is Credit Default Swaps. This is a 60 Trillion dollar market (that is Trillion with a capital T) where financial institutions like banks and hedge funds (the parties) buy and sell insurance policies on bonds. The "This American Life" radio program and Podcast has very good and understandable explanation of the market and how it came to be.

There are two important things to understand about the credit default swap market. Firstly, it is completely unregulated. Senator Phil Gramm tacked a clause to keep the market unregulated to an appropriations bill in 2000 that was approved by 95 votes to 0 in the Senate. Secondly, the market is not transparent, that is the various parties to the market do not know what any other players position is. Note that these two features are the way the parties in the market wanted it. There has been great outcry about reducing financial regulations in the last few years.

Lack of transparency was not a problem until Lehman Brothers went bankrupt. They were a big player in the credit default swap market. Now all their credit default swap insurance policies are frozen by their bankruptcy. Anyone who sold a credit default swap policy and then laid off the bet by buying an equivalent credit default swap from Lehman Brothers is now on the hook to pay off the insurance policy on the bond without the compensation of being able to get Lehman Brothers to make good on their policy.

The lack of transparency means that nobody knows for sure about anyone else in the market. That is that anyone could go bankrupt tomorrow because they have bought credit default swaps from a bankrupt company like Lehman Brothers and they cannot make good on their promise of a credit default swaps that they have sold. Already AIG has needed a huge injection of government money to stay afloat, and others may be suffering as well. But no one knows what positions anyone else holds. So everyone is conserving their cash not lending it out to anyone else so that they will not lose it if the other party goes bankrupt. Thus are the credit markets constipated.

A final problem is that, because the market is unregulated, there are no capital requirements to back up a bet. I can sell a credit default swap insurance policy based on my good name. I immediately get a large sum of money which I can register as a profit. It is only later that I have to worry about a problem with the bond that I have insured defaulting (what are the chances of that?) This is how the market got to be 60 Trillion dollars in size.

The underlying issue is this. There is a large risk in trading in unregulated markets. The risk is made larger if the market is not transparent, because if one of the parties to the market goes bust nobody knows what their position is worth. These risks were not recognized in the credit default swap market and policies were sold at far too low a price to recognize these risks. If the market were regulated, like other markets are, these risks would not be there and the market could deal with usual events like a bankruptcy of a player.

Finally there is a risk to the nation in allowing an unregulated market to balloon to the size that the credit default swap market has. Bankruptcies happen all the time. the fact that the bankruptcy of a player has caused the entire financial marketplace to go into a swoon is bad for the nation. The players in the credit default swap market asked for an unregulated market and they got what they asked for. Now the risk of having an unregulated market has shown itself and as Senator Gramm tells them they should deal with it and stop whining.

I am not an apologist, I am a technologist who is interested in how things work.

Thursday, October 02, 2008

Articulate UML Modeling

Last week, Leon Starr spoke to the SDForum SAM SIG on "Articulate UML Modeling". Leon is an avid modeler and has been using UML for modeling software systems since it was first defined. He believes in building executable models and I applaud him for that. The very act of making something executable ensures that it is in some sense complete and free from many definitional errors. Executing the model allows it to be tested.

There are several advantages to building models rather than programs. A big part of many project is extracting requirements. Unlike a program, a model can describe requirements in a way that a non-technical user can understand and appreciate, so the user can provide feedback. Another advantage of a model is that is does not arbitrarily constrain the order in which things are done. Essentially, a model is asynchronous and captures opportunities for concurrency in its implementation. This struck a chord with me as I am going to speak to the SAM SIG in October on "Models and Patterns for Concurrency".

The other part of the talk that interested me was Leon's attack on building models as controllers. For example, he gave the example of a laser cutting machine. A common way of modeling this is the laser cutter controller that interprets patterns. He prefers to see this modeled as patterns that describe how they are cut by the laser cutter. Leon's experience is with modeling software to manage physical systems like the air traffic controller example that he used to illustrate his talk. His approach is certainly useful for the understanding and analysis of physical systems, however I have seen the problem argued both ways. It is worth a separate post on the issue.

Monday, September 29, 2008

42 Revisited

Last week TechCrunch had a post on the State of The Blogosphere: The More You Post, The Higher You Rank. One statistic is that the top 100 bloggers post on average 310 times a month, which sounds quite exhausting. As you know, I post 42 times a year. I am going to promise to my faithful reader that I will stick to my pace. You will not get an unreadable avalanche of overlapping verbiage from this blog.

If I have not posted much recently, it is because I have spent a lot of time reading blog posts on the financial crisis. It is very entertaining to see these extrordinary events unfold around us. Who would have thought that George W. Bush will be known to future generations as the President who nationalized the American financial services industry?

Wednesday, September 17, 2008

SaaS Data Integration

Data integration is the problem of gathering data, perhaps from many different application for the purpose of doing some analysis of the data as a whole. Mike Pittaro, Co-Founder of SnapLogic spoke to the SDForum Business Intelligence SIG September meeting on "Enhancing SaaS Applications Through Data Integration with SnapLogic".

The big players in data integration are Informatica and Ascential (now IBM Information Integration) who sell large, expensive and complex products. Because of the cost, these products are often not used, particularly for one off projects which are common. Mike helped found SnapLogic in 2005 to bring a new perspective to data integration. SnapLogic is an open source framework and therefore both affordable and extensible by its users.

He showed us the complexity of data integration. It involves dealing with many different access protocols, multiple ways of getting the data and each type of data has its own metadata format to describe the data. This he contrasted with the World Wide Web where huge amounts of data are pulled back and forth every day, without interoperability problems. There are almost 200 million web sites, and billions of users, yet World Wide Web is completely decentralized, with heterogeneous model that allows for different operating system, servers, client software applications and frameworks, and yet they are all compatible and interoperable.

The World Wide Web is based on open standards and protocols and an architectural principal called REST, which stands for REpresentational State Transfer. REST plays with data resources, in standardized representations and each resource identified by a unique identifier like a URL.

SnapLogic builds on this by turning data sources into standard web resources. With SnapLogic you configure a server to extracts data from a datasource like a file or database and transform the data into the form you want. The server presents the datasource as a standard web resource with a URL. These servers are the blocks for building a data integration application.

Thursday, September 04, 2008

Chrome

On Tuesday, Google announced their new browser Chrome. Although it has generated huge discussions in various forums and an astonishing adoption rate, I am not going to rush to use it. In fact, I think I will wait until it is out of beta before considering whether to adopt it. That should give me many years before I have to even think about making a change!

Wednesday, September 03, 2008

A Tale of Two Search Engines

At the SDForum Software Architecture and Modeling (SAM) SIG last week, John D. Mitchel, Mad Scientist at MarkMail and previously Chief Architect of Krugle talked about the architectures of the search engines that he has built for these two companies.

Krugle is a search engine for code and all related programming artifacts. The public engine indexes all the open source software repositories on the web. This system was built a couple of years ago with cheap of the shelf commodity hardware, open source software and Network Attached Storage (NAS). In total it has about 150 computer systems in its clusters. The major software components are Lucene (search engine), Nutch (web crawling and etc.) and Hadoop (distributed file system and map-reduce). These are all Open Source projects written in Java and sponsored by the Apache Software Foundation. Krugle sells an enterprise edition that can index and make available all source code in an enterprise.

MarkMail is a search engine for email. It indexes all public mailing lists and is a technology demonstrator for the MarkLogic XML Content Server. MarkMail is built with newer hardware that is more capable. It uses Storage Area Network (SAN) for storage which offers higher performance at a greater cost than NAS. The MarkMail search system is built on about 60 computer systems in its clusters.