Monday, November 27, 2006

Understanding Metadata Revisited

Last month I wrote down some ideas on metadata. Looking back they are not very useful. More recently, I got back to reading Ralph Kimball's latest book "The Data Warehouse ETL Toolkit". As Kimball says, metadata is an important component of ETL, however he does not have much more success than I did in producing a definition or a useful explanation and slipped back, as I did, into the less satisfactory definition by example.

This got me thinking. Maybe we could improve the definition of metadata by tightening up the data about data definition. Here is my version "Metadata is structured data that describes structured data". One important attribute of metadata is that it can be used by programs as well as by people, and for this reason metadata must have a known structure. Also as metadata describes data, the data it describes has structure as well, if only because the metadata describes it. Compared to other definitions, this one finds a middle ground, specific enough to have some use while not being confined to a specific application.

Properly understanding Metadata is more than absorbing an 8 word definition. I have already mentioned one important attribute, that metadata can be used by programs as well as people. Another important attribute is the distinction between System and User metadata. System metadata is generated by the system that originates or manages the data. For example, the system metadata in a database is the descriptions of the tables, columns, constraints and almost everything else in the catalog. User metadata is created and mostly for use by other people. In a database catalog, User metadata is the contents of comments on the database objects and any semantics that may be associated with the names of the database objects.

A better example of the distinction is found in an MP3 file. System metadata in an MP3 file is the Bit Rate, Frequency and Stereo or Mono mode. User metadata is the contents of the ID3 tag. The distinction between system and user metadata is important because metadata like any other data can have data quality issues. System metadata is almost always correct. If the system metadata were faulty, the system that generated or used the data would break. On the other hand User metadata is always suspect. Just ask any music fan about ID3 tags.

I am sure that there is a lot more to say about metadata, however this feels like a good starting point.

Friday, November 24, 2006

Yahoo Directions

The latest edition of Wired has an article by Bob Garfield on YouTube and its acquisition by Google. It contains the following great quote: "success is 1 percent inspiration, 99 percent monetization", and this brought me back to a comment on Yahoo by my colleague Dave McClure. Yahoo has been under siege recently for poor performance and even the inmates are revolting.

Yahoo sites are the number one web portal with by far the most page views of any web media company. Dave diagnoses Yahoo's primary problem as an inability to monetize all those eyeballs, and I wholeheartedly agree with him. As I have reported before Yahoo's business model is quite simple, get all the world to sign up for compelling Yahoo services and then monetize them by selling targeted advertising based on knowing the user.

In practice, Google, who seems to be reaching for a similar business model, is much more effective at monetizing a smaller audience that it knows less about. When Google get their full panalopy of services out of beta they could be unstoppable.

Monday, November 06, 2006

Flight Simulator Earth

Today, Microsoft has unveiled their answer to Google Earth, Microsoft Live Search with 3D images of cities that you can navigate through. Now they have 15 cities, and they are expanding to 100 cities by next summer.

I took one look at the pictures and immediately recognized what they have done. Microsoft have taken their venerable Flight Simulator program and repurposed it as a web navigation tool. While it may have some initial glitz, in practice it is not going to be nearly interesting, useful or awesome as Google maps.

The funny thing is that I read about this in an earnest report on TechCrunch. Neither the reporter nor any of the comment on the entry noticed the connection between Microsoft Live Search and Flight Simulator. In fact , as may be expected, only about half of the commenters could make it work. For myself, I am not going to take the risk of using Internet Explorer to navigate the web so I will have to forgo the pleasure of experiencing it first hand.

Disclosure: I have never used Microsoft Flight Simulator. The closest I have come is glancing at a review on a gaming web site some years ago.

Sunday, November 05, 2006

Metcalfe's Law

The July issue of IEEE Spectrum contained an article called "Metcalfe's Law is Wrong". Bob Metcalfe wrote a response to the article as a blog entry in August. The editorial in the November issue of IEEE Spectrum says that the article raised its own little storm of controversy and solicits further comments. Here are my thoughts.

Metcalfe's Law states that the 'value' of a network grows as the square of the number of connection points to the network. This is usually stated as the number of users where each user is assumed to have their own connection point. The law was popularized in 1993 by Gorge Gilder, writer of Telecosm, the Gilder Technology Report and chief cheerleader of the telecom/internet revolution/bubble. Brisco, Odlyzko and Tilly argue in the IEEE Spectrum article that the actual growth in value is n*log(n) and that the original formulation was bad because it directly led to the speculative excess of the telecom/internet bubble.

Put baldly, Metcalfe's Law says that if I have a network and you have a network, and we connect our networks together, they are worth much more than either network on its own, or even the sum of the two networks. The more networks we connect the more valuable the whole thing becomes. So the point of Metcalfe's Law is that there is a huge incentive for all networks to join together into one completely connected internetwork. This has come to pass, first for telephones and then for computers. Thus my position is that Metcalfe has been proven correct and that it is academic to argue whether the 'value' (whatever that means) of the network grows quadratically or exponentially.

We need to understand the context when looking at Metcalfe's and Gilder's arguments. As Bob Metcalfe says in his blog entry, in 1980 when he devised Metcalfe's Law he was just trying to sell the value of networks and create business for his company 3COM. This was at a time when an Ethernet card cost $5000 and flinty eyed accountants would argue to reduce the size of their network buy while he would argue that they should increase it.

George Gilder is the person who foresaw a single interconnected Internet at a time when there was CompuServe, Prodigy, AOL and thousands of local bulletin board systems. All of these were swept away by the internet revolution except for AOL who managed to ride the wave by co-opting it. So Gilder was correct as well, although he was eventually carried away by the force of his own argument like many who listened to him.

Wednesday, November 01, 2006

Vallywag

The thing that I miss most from Web 1.0 is Suck. As far as I know Web 2.0 does not have anything that matches its authority, wit, eclectic pithiness and complete mastery of the cultural reference. In those days, if I took a few minutes after lunch to explore the Internet, I would always turned to Suck.

These days the best I can do is Vallywag. Compared to Suck it is a sprawling parochial mess with an unhealthy obsession for TechCrunch blogger Michael Arrington. If Vallywag has a center of gravity it is closer to Castro Street in Mountain View than San Francisco. However it is a good way of keeping up with what is really going on. For example, today there is:

Sunday, October 29, 2006

30 Years of Public Key

The Computer History Museum held a wonderful event celebrating 30 years of Public Key Cryptography last week. In a sense it did not matter that as the meeting unfolded we heard that the NSA may well have implemented the idea that was first invented at GCHQ in the UK during the 70's. The event celebrated the public invention of ideas that underlies secure use of the internet by Diffie and Hellman in 1986.

John Markoff introduced the event with Dan Boneh from Stanford and Voltage Security. Dan gave us a short overview that explained what Public Key Cryptography is in terms that even a layman could follow and impressed on us the importance of its discovery.

This led to the panel run by Steven Levy who basically took us through the his book Crypto, starting with the invention of Public Key Cryptography by panelists Whitfield Diffie and Martin Hellman. The next part of the story was the commercialization of the ideas by RSA security represented by Jim Bidzos, and its first serious use in Lotus Notes by panelist Ray Ozzie. Ray took us to the moment where he needed to get an export license for Notes and ran into the roadblock of the government represented by Brian Snow who recently retired from the NSA.

The central discussion revolved around the position of the defense establishment who wanted to prevent what they saw as a weapon falling into the hands of the enemy and the business interests who needed the technology to make the internet secure for commerce. My big takeaway is the counter-culture character of Diffie and Hellman, particularly Whit Diffie. I get comfort from the fact that people like Diffie are working on my side to ensure that this technology is available for all of us and not just the guys in blue suits and dark glasses.

Wednesday, October 25, 2006

State Management

I was expecting good things from the Ted Neward presentation to the SDForum Software Architecture and Modeling SIG, and I was not disappointed. The presentation was titled "State Management: Shape and Storage: The insidious and slippery problem of storing objects to disk".

However I have to say that the presentation did not go in the direction I expected. Ted started out by drawing the distinction between Transient state and Durable state, where Durable state is the set of objects that reside in a database. He made a passing reference to transient objects becoming durable and I was expecting him to talk about whether durable objects should be distinct from transient objects or whether an object can morph from transient to durable and back again, but he did not go there. He did spend some time on the problem of object management on a clusters. For most of the audience this was preaching to the choir.

Next Ted talked about the shape of durable data: Relational, Object or Hierarchical. We all know the object and relational shapes, and the difficulty of mapping between them. XML is hierarchical data. Ted spent some time convincing us that hierarchical data is not the same shape as object data, the two are different and there is no easy mapping from one to the other. This was the piece of the talk that I found most interesting and he certainly convinced me. After the meeting I want back and looked at my copy of Date (2nd edition, 1977) which has it all. Date describes the three shapes of data: Relational (Relational) , Hierarchical (XML) and Network (Object). Note that Date took the Hierarchical and Network stuff out of later editions.

Finally Ted referred to his controversial blog post
on "Object/Relational Mapping is the Vietnam of Computer Science". His argument is that early success with Object Relational mapping has got us on a slippery slope to a morass from which there will be no easy escape. Go read the original, which is better than any summary that I could write here. I agree with him and I will put more thoughts related to this topic in future posts.

Sunday, October 22, 2006

Internet Search Recovered

Last year, I complained that internet search was being polluted by a lot of worthless affiliate marketing sites. My suggestion then was to start by naming the beast. My good friend Roger Shepherd just pointed out a site that actually does something to deal with the problem.

Give me back my Google is very simple. It functions like a Spam blacklist in that it just adds to your search string a list of sites that you do not want results from. You can register more sites to blacklist at the GmbmG web site.

I have even used the site in anger to research HD radio, and I can thoroughly recommend it. (On HD radio, no one has a product that I want and the products that they do have are not yet affordable.)

Wednesday, October 18, 2006

Data Mining and Business Intelligence

After writing about the Netflix prize, I got to thinking about data mining the Netflix data set. On reflection the problem seemed intractable, that is until I attended the SDForum Business Intelligence SIG to hear Paul O'Rorke talk on "Data Mining and Business Intelligence".

Paul covered several topics in his talk, including the standard CRISP-DM data mining process, and a couple of data mining problem areas and their algorithms. One problem area was frequent item-set mining. This is used for applications like market basket analysis which looks for items that are frequently bought together.

In the meeting we spent some time discussing what market basket analysis is for. Of course, 'beer and diapers' came up. The main suggested use was store design. If some one who buys milk is likely to buy cookies, then the best way to design the store is with milk and cookies at opposite ends of the store so that the customer has to walk past all the other shelves with tempting wares while on their simple milk and cookies shopping run. I am sure that there are other more sophisticated uses of market basket analysis. I know of at least one company has made a good business out of it.

To get back to Netflix, there are similarities between an online movie store and a grocery store. Both have a large number of products, both have a large number of customers and any particular customer will only get a small number of the products. For the supermarket we are interested in understanding what products are bought together in a basket, while for Netflix we are interested in the slightly more complex issue of predicting how a customer will rate a movie.

Paul showed us the FP-Tree data structure and showed us some of the FP-growth algorithm for using it. The FP-Tree will only represent the fact that Netflix users have rated movies. As it stands, it cannot also represent the users ratings, however it is a good starting point, and there are several implementations available. Also, Netflix could easily use the FP-Growth algorithm to recommend movies ("Other people who watched your movie selections also watched ...").

Saturday, October 14, 2006

The Netflix Prize

Anyone with a handle on Business Intelligence should be looking at the Netflix Prize. Netflix is offering $1 million to the first team that can make a 10% improvement in their Cinematch movie recommendation system. Moreover, just by entering, you get access to a large dataset of movies and specific customer recommendations, something that should get any numbers jockey salivating.

Having looked at the competition rules carefully, I see that as usual the competition is somewhat tangential to the goal. The competition is to predict how people will rate movies based on their existing movie ratings, while the goal is to improve the recommendation system.

I am a long time Netflix user and I know that their recommendation system has some problems. The first improvement that I would suggest is to not recommend a movie to someone who has already rented the movie from Netflix. I am sure that more than 10% of all the recommendations I see are movies that I have already rented.

I should say by way of disclosure that I never rate movies. My Netflix queue is more than 2 years long, I have plenty of other sources of recommendations and I do not feel the need to do anything more to make my queue grow longer. However I do glance at the recommendations made after adding a movie to the queue and sometimes add movies from them.

There is a bigger discussion here. The Cinematch system judges peoples interest in a movie by what they say they think of movies. It would be much more effective if it also judged people by what they have done, whether they have already seen the movie, what types of movie they add to their queue and what kinds of movies they promote to the top of their queue. We know that what people say and do are often not the same and that actions speak louder than words, so lets take more heed of what they do and less of what they say.