Build and Break: October 2006

Sunday, October 29, 2006

30 Years of Public Key

The Computer History Museum held a wonderful event celebrating 30 years of Public Key Cryptography last week. In a sense it did not matter that as the meeting unfolded we heard that the NSA may well have implemented the idea that was first invented at GCHQ in the UK during the 70's. The event celebrated the public invention of ideas that underlies secure use of the internet by Diffie and Hellman in 1986.

John Markoff introduced the event with Dan Boneh from Stanford and Voltage Security. Dan gave us a short overview that explained what Public Key Cryptography is in terms that even a layman could follow and impressed on us the importance of its discovery.

This led to the panel run by Steven Levy who basically took us through the his book Crypto, starting with the invention of Public Key Cryptography by panelists Whitfield Diffie and Martin Hellman. The next part of the story was the commercialization of the ideas by RSA security represented by Jim Bidzos, and its first serious use in Lotus Notes by panelist Ray Ozzie. Ray took us to the moment where he needed to get an export license for Notes and ran into the roadblock of the government represented by Brian Snow who recently retired from the NSA.

The central discussion revolved around the position of the defense establishment who wanted to prevent what they saw as a weapon falling into the hands of the enemy and the business interests who needed the technology to make the internet secure for commerce. My big takeaway is the counter-culture character of Diffie and Hellman, particularly Whit Diffie. I get comfort from the fact that people like Diffie are working on my side to ensure that this technology is available for all of us and not just the guys in blue suits and dark glasses.

Wednesday, October 25, 2006

State Management

I was expecting good things from the Ted Neward presentation to the SDForum Software Architecture and Modeling SIG, and I was not disappointed. The presentation was titled "State Management: Shape and Storage: The insidious and slippery problem of storing objects to disk".

However I have to say that the presentation did not go in the direction I expected. Ted started out by drawing the distinction between Transient state and Durable state, where Durable state is the set of objects that reside in a database. He made a passing reference to transient objects becoming durable and I was expecting him to talk about whether durable objects should be distinct from transient objects or whether an object can morph from transient to durable and back again, but he did not go there. He did spend some time on the problem of object management on a clusters. For most of the audience this was preaching to the choir.

Next Ted talked about the shape of durable data: Relational, Object or Hierarchical. We all know the object and relational shapes, and the difficulty of mapping between them. XML is hierarchical data. Ted spent some time convincing us that hierarchical data is not the same shape as object data, the two are different and there is no easy mapping from one to the other. This was the piece of the talk that I found most interesting and he certainly convinced me. After the meeting I want back and looked at my copy of Date (2nd edition, 1977) which has it all. Date describes the three shapes of data: Relational (Relational) , Hierarchical (XML) and Network (Object). Note that Date took the Hierarchical and Network stuff out of later editions.

Finally Ted referred to his controversial blog post on "Object/Relational Mapping is the Vietnam of Computer Science". His argument is that early success with Object Relational mapping has got us on a slippery slope to a morass from which there will be no easy escape. Go read the original, which is better than any summary that I could write here. I agree with him and I will put more thoughts related to this topic in future posts.

Sunday, October 22, 2006

Internet Search Recovered

Last year, I complained that internet search was being polluted by a lot of worthless affiliate marketing sites. My suggestion then was to start by naming the beast. My good friend Roger Shepherd just pointed out a site that actually does something to deal with the problem.

Give me back my Google is very simple. It functions like a Spam blacklist in that it just adds to your search string a list of sites that you do not want results from. You can register more sites to blacklist at the GmbmG web site.

I have even used the site in anger to research HD radio, and I can thoroughly recommend it. (On HD radio, no one has a product that I want and the products that they do have are not yet affordable.)

Wednesday, October 18, 2006

Data Mining and Business Intelligence

After writing about the Netflix prize, I got to thinking about data mining the Netflix data set. On reflection the problem seemed intractable, that is until I attended the SDForum Business Intelligence SIG to hear Paul O'Rorke talk on "Data Mining and Business Intelligence".

Paul covered several topics in his talk, including the standard CRISP-DM data mining process, and a couple of data mining problem areas and their algorithms. One problem area was frequent item-set mining. This is used for applications like market basket analysis which looks for items that are frequently bought together.

In the meeting we spent some time discussing what market basket analysis is for. Of course, 'beer and diapers' came up. The main suggested use was store design. If some one who buys milk is likely to buy cookies, then the best way to design the store is with milk and cookies at opposite ends of the store so that the customer has to walk past all the other shelves with tempting wares while on their simple milk and cookies shopping run. I am sure that there are other more sophisticated uses of market basket analysis. I know of at least one company has made a good business out of it.

To get back to Netflix, there are similarities between an online movie store and a grocery store. Both have a large number of products, both have a large number of customers and any particular customer will only get a small number of the products. For the supermarket we are interested in understanding what products are bought together in a basket, while for Netflix we are interested in the slightly more complex issue of predicting how a customer will rate a movie.

Paul showed us the FP-Tree data structure and showed us some of the FP-growth algorithm for using it. The FP-Tree will only represent the fact that Netflix users have rated movies. As it stands, it cannot also represent the users ratings, however it is a good starting point, and there are several implementations available. Also, Netflix could easily use the FP-Growth algorithm to recommend movies ("Other people who watched your movie selections also watched ...").

Saturday, October 14, 2006

The Netflix Prize

Anyone with a handle on Business Intelligence should be looking at the Netflix Prize. Netflix is offering $1 million to the first team that can make a 10% improvement in their Cinematch movie recommendation system. Moreover, just by entering, you get access to a large dataset of movies and specific customer recommendations, something that should get any numbers jockey salivating.

Having looked at the competition rules carefully, I see that as usual the competition is somewhat tangential to the goal. The competition is to predict how people will rate movies based on their existing movie ratings, while the goal is to improve the recommendation system.

I am a long time Netflix user and I know that their recommendation system has some problems. The first improvement that I would suggest is to not recommend a movie to someone who has already rented the movie from Netflix. I am sure that more than 10% of all the recommendations I see are movies that I have already rented.

I should say by way of disclosure that I never rate movies. My Netflix queue is more than 2 years long, I have plenty of other sources of recommendations and I do not feel the need to do anything more to make my queue grow longer. However I do glance at the recommendations made after adding a movie to the queue and sometimes add movies from them.

There is a bigger discussion here. The Cinematch system judges peoples interest in a movie by what they say they think of movies. It would be much more effective if it also judged people by what they have done, whether they have already seen the movie, what types of movie they add to their queue and what kinds of movies they promote to the top of their queue. We know that what people say and do are often not the same and that actions speak louder than words, so lets take more heed of what they do and less of what they say.

Thursday, October 12, 2006

Spamhaus Case Threat

We have this vision of the Net as a the great level playing field where all the world can come on an equal basis. For some time, people outside the US have been concerned that the US has too much ownership and influence in the Net. Now the concerns are coming home to roost with the Spamhaus case.

The case is quite simple. Spamhaus.org is a volunteer run organization in London that issues lists of email spammers to ISP and other organizations around the world. A US business, e360Insight LLC sued Spamhaus in a Chicago court to have its name removed from the Spamhaus list of email spammers. Spamhaus has no money and considers that the Chicago court did not have jurisdiction over it, so it did not appear to defend their position. As they did not defend their position, they lost the case.

First the Judge issued a judgment of $11M against Spamhaus that has been ignored. Now the Judge has proposed an order that ICANN, a US based organization should take away the Spamhaus.org domain name from them. Without their domain name, people who use Spamhaus would not know where to go to get their email spammer lists.

The problem is that ICANN is a US organization and therefore subject to the vagaries of US justice which is somewhat variable in the lower courts. There has been good coverage of the Spamhaus case, but only today have people started to realize its long term consequences. I think we would all feel that the net would by much more neutral if its governing bodies were based in a neutral country like Switzerland.

Monday, October 09, 2006

Understanding Metadata

Some time ago I made the remark that "metadata is the cure for all problems digital". Well it is true, metadata IS the cure, although this statement is a bit like that old Computer Science saw that any problem in Computer Science can be solved by adding an extra level of indirection. The problem with Metadata is that it is one of these slippery concepts that seems so straightforward but then when you go to look into the details is difficult to pin down.

Ask a number of people what metadata is and you get a variety of answers. There are experts who will tell you that metadata is data about data and then go starry eyed as their mind gets lost in an infinite regression, while you wait unrequited wanting to hear something more useful and concrete. On the other hand there are people who will tell you that metadata is the ID3 tags found in MP3 files. Of course metadata is data about data, but that definition does not capture the essence of it, and while ID3 tags are metadata, there is a lot more useful metadata in an MP3 file than just the ID3 tags, let alone all the metadata in all the other data stores, data sources and file types that are available.

To get an understanding of metadata, a good starting point is to look at a diverse set of examples and then look at some of the important attributes and characteristics that are common to these examples. So, to kick this off, here is a description of metadata in a database, an XML file and a MP3 file. We will look at attributes and characteristics in later posts.

In a SQL database, the metadata is called the Catalog and this is presented as a set of tables like any other database tables. The catalog contains tables that define the tables, columns, views, permissions and so on. In practice the catalog is an external representation of the information that the database system needs to access its data. Internally a Catalog can be stored as a set of database tables or just some data structures, I have seen it done both ways. The Catalog is always presented as a set of tables so that the user can query the Catalog just like any other data. For example, I have fixed a bug by writing a mind-bendingly complicated query on a catalog rather than update the definition of the catalog table to get the required information easily.

In an XML document, the tags are the metadata. Well, except for the fact that tags can contain attributes and the value part of an attribute is data. Next we have to quench the argument about whether processing instructions are metadata by saying that some types of processing instruction are metadata and other types are not. Then there are DTDs and Schema that are metadata and also metadata about the metadata, (which is allowed by the definition that metadata is data about data). Some of the metadata can be in other XML documents that are referenced by URLs.

An MP3 file consists of a sequence of frames where each frame contains a header and data representing the sound. The header is metadata, containing useful information like the bit rate, frequency and whether the data is stereo. An ID3 tag is a frame with a header that indicates that the data is not a MP3 sound frame. The ID3 tag contains information about the artist, recording and album. There are several different versions of ID3 tags that are not upwardly compatible with one another.

Sunday, October 08, 2006

Glassbox

Being a Business Intelligence type, if I were given the job of devising a tool for analyzing Java applications, I would build a component to collect performance data and then present the user with a bunch of fancy reports, charts and data cubes where they could drill down to work out what their problem is. Glassbox takes a different approach as we heard at the SDForum Java SIG last Tuesday.

Glassbox collects the data from Java Apps running in an application server, analyses it for a set of common problems and produces a report that tells you exactly what the problem is. No complicated analysis, no slicing and dicing, just a simple report on the facts. Of course it can only tell you about problems that it has been programmed to analyze, however the common problems are well known. Things like too many database queries to answer web request, slow database response, slow complicated java that seems to call the same function too many times. It is the kind of solution that gets 90% of the problems with 10% of the effort of a full blown performance analysis tool. Another advantage of this approach is that the tool can be used by anyone without a lot of training or time spend getting experienced in its use.

While Glassbox has not wasted time building fancy displays, they have taken the trouble to collect their data in a straightforward and unobtrusive way. As we were shown, you just add a .war file to your application servers directory, update the application servers configuration, restart it and you are on your way. Supposedly data collection only adds 1% or so to program execution times.

All in all, Glassbox looks like a good place to start identifying problems with web apps. As it is Open Source, and easy to use, the cost of trying it out is low.

Wednesday, October 04, 2006

Definitive Source

When George Smoot was called at 3 am by someone with a Swedish accent telling him that he had won a Noble Prize for Physics, he thought this might be a prank. So what did he do? He went to the Nobel Prize web site to check whether it was true. We have come to the point where a Nobel laureate trusts the World Wide Web as the definitive source of information rather than a phone call.

Build and Break