Last month I wrote down some ideas on metadata. Looking back they are not very useful. More recently, I got back to reading Ralph Kimball's latest book "The Data Warehouse ETL Toolkit". As Kimball says, metadata is an important component of ETL, however he does not have much more success than I did in producing a definition or a useful explanation and slipped back, as I did, into the less satisfactory definition by example.
This got me thinking. Maybe we could improve the definition of metadata by tightening up the data about data definition. Here is my version "Metadata is structured data that describes structured data". One important attribute of metadata is that it can be used by programs as well as by people, and for this reason metadata must have a known structure. Also as metadata describes data, the data it describes has structure as well, if only because the metadata describes it. Compared to other definitions, this one finds a middle ground, specific enough to have some use while not being confined to a specific application.
Properly understanding Metadata is more than absorbing an 8 word definition. I have already mentioned one important attribute, that metadata can be used by programs as well as people. Another important attribute is the distinction between System and User metadata. System metadata is generated by the system that originates or manages the data. For example, the system metadata in a database is the descriptions of the tables, columns, constraints and almost everything else in the catalog. User metadata is created and mostly for use by other people. In a database catalog, User metadata is the contents of comments on the database objects and any semantics that may be associated with the names of the database objects.
A better example of the distinction is found in an MP3 file. System metadata in an MP3 file is the Bit Rate, Frequency and Stereo or Mono mode. User metadata is the contents of the ID3 tag. The distinction between system and user metadata is important because metadata like any other data can have data quality issues. System metadata is almost always correct. If the system metadata were faulty, the system that generated or used the data would break. On the other hand User metadata is always suspect. Just ask any music fan about ID3 tags.
I am sure that there is a lot more to say about metadata, however this feels like a good starting point.
Monday, November 27, 2006
Friday, November 24, 2006
Yahoo Directions
The latest edition of Wired has an article by Bob Garfield on YouTube and its acquisition by Google. It contains the following great quote: "success is 1 percent inspiration, 99 percent monetization", and this brought me back to a comment on Yahoo by my colleague Dave McClure. Yahoo has been under siege recently for poor performance and even the inmates are revolting.
Yahoo sites are the number one web portal with by far the most page views of any web media company. Dave diagnoses Yahoo's primary problem as an inability to monetize all those eyeballs, and I wholeheartedly agree with him. As I have reported before Yahoo's business model is quite simple, get all the world to sign up for compelling Yahoo services and then monetize them by selling targeted advertising based on knowing the user.
In practice, Google, who seems to be reaching for a similar business model, is much more effective at monetizing a smaller audience that it knows less about. When Google get their full panalopy of services out of beta they could be unstoppable.
Yahoo sites are the number one web portal with by far the most page views of any web media company. Dave diagnoses Yahoo's primary problem as an inability to monetize all those eyeballs, and I wholeheartedly agree with him. As I have reported before Yahoo's business model is quite simple, get all the world to sign up for compelling Yahoo services and then monetize them by selling targeted advertising based on knowing the user.
In practice, Google, who seems to be reaching for a similar business model, is much more effective at monetizing a smaller audience that it knows less about. When Google get their full panalopy of services out of beta they could be unstoppable.
Monday, November 06, 2006
Flight Simulator Earth
Today, Microsoft has unveiled their answer to Google Earth, Microsoft Live Search with 3D images of cities that you can navigate through. Now they have 15 cities, and they are expanding to 100 cities by next summer.
I took one look at the pictures and immediately recognized what they have done. Microsoft have taken their venerable Flight Simulator program and repurposed it as a web navigation tool. While it may have some initial glitz, in practice it is not going to be nearly interesting, useful or awesome as Google maps.
The funny thing is that I read about this in an earnest report on TechCrunch. Neither the reporter nor any of the comment on the entry noticed the connection between Microsoft Live Search and Flight Simulator. In fact , as may be expected, only about half of the commenters could make it work. For myself, I am not going to take the risk of using Internet Explorer to navigate the web so I will have to forgo the pleasure of experiencing it first hand.
Disclosure: I have never used Microsoft Flight Simulator. The closest I have come is glancing at a review on a gaming web site some years ago.
I took one look at the pictures and immediately recognized what they have done. Microsoft have taken their venerable Flight Simulator program and repurposed it as a web navigation tool. While it may have some initial glitz, in practice it is not going to be nearly interesting, useful or awesome as Google maps.
The funny thing is that I read about this in an earnest report on TechCrunch. Neither the reporter nor any of the comment on the entry noticed the connection between Microsoft Live Search and Flight Simulator. In fact , as may be expected, only about half of the commenters could make it work. For myself, I am not going to take the risk of using Internet Explorer to navigate the web so I will have to forgo the pleasure of experiencing it first hand.
Disclosure: I have never used Microsoft Flight Simulator. The closest I have come is glancing at a review on a gaming web site some years ago.
Sunday, November 05, 2006
Metcalfe's Law
The July issue of IEEE Spectrum contained an article called "Metcalfe's Law is Wrong". Bob Metcalfe wrote a response to the article as a blog entry in August. The editorial in the November issue of IEEE Spectrum says that the article raised its own little storm of controversy and solicits further comments. Here are my thoughts.
Metcalfe's Law states that the 'value' of a network grows as the square of the number of connection points to the network. This is usually stated as the number of users where each user is assumed to have their own connection point. The law was popularized in 1993 by Gorge Gilder, writer of Telecosm, the Gilder Technology Report and chief cheerleader of the telecom/internet revolution/bubble. Brisco, Odlyzko and Tilly argue in the IEEE Spectrum article that the actual growth in value is n*log(n) and that the original formulation was bad because it directly led to the speculative excess of the telecom/internet bubble.
Put baldly, Metcalfe's Law says that if I have a network and you have a network, and we connect our networks together, they are worth much more than either network on its own, or even the sum of the two networks. The more networks we connect the more valuable the whole thing becomes. So the point of Metcalfe's Law is that there is a huge incentive for all networks to join together into one completely connected internetwork. This has come to pass, first for telephones and then for computers. Thus my position is that Metcalfe has been proven correct and that it is academic to argue whether the 'value' (whatever that means) of the network grows quadratically or exponentially.
We need to understand the context when looking at Metcalfe's and Gilder's arguments. As Bob Metcalfe says in his blog entry, in 1980 when he devised Metcalfe's Law he was just trying to sell the value of networks and create business for his company 3COM. This was at a time when an Ethernet card cost $5000 and flinty eyed accountants would argue to reduce the size of their network buy while he would argue that they should increase it.
George Gilder is the person who foresaw a single interconnected Internet at a time when there was CompuServe, Prodigy, AOL and thousands of local bulletin board systems. All of these were swept away by the internet revolution except for AOL who managed to ride the wave by co-opting it. So Gilder was correct as well, although he was eventually carried away by the force of his own argument like many who listened to him.
Metcalfe's Law states that the 'value' of a network grows as the square of the number of connection points to the network. This is usually stated as the number of users where each user is assumed to have their own connection point. The law was popularized in 1993 by Gorge Gilder, writer of Telecosm, the Gilder Technology Report and chief cheerleader of the telecom/internet revolution/bubble. Brisco, Odlyzko and Tilly argue in the IEEE Spectrum article that the actual growth in value is n*log(n) and that the original formulation was bad because it directly led to the speculative excess of the telecom/internet bubble.
Put baldly, Metcalfe's Law says that if I have a network and you have a network, and we connect our networks together, they are worth much more than either network on its own, or even the sum of the two networks. The more networks we connect the more valuable the whole thing becomes. So the point of Metcalfe's Law is that there is a huge incentive for all networks to join together into one completely connected internetwork. This has come to pass, first for telephones and then for computers. Thus my position is that Metcalfe has been proven correct and that it is academic to argue whether the 'value' (whatever that means) of the network grows quadratically or exponentially.
We need to understand the context when looking at Metcalfe's and Gilder's arguments. As Bob Metcalfe says in his blog entry, in 1980 when he devised Metcalfe's Law he was just trying to sell the value of networks and create business for his company 3COM. This was at a time when an Ethernet card cost $5000 and flinty eyed accountants would argue to reduce the size of their network buy while he would argue that they should increase it.
George Gilder is the person who foresaw a single interconnected Internet at a time when there was CompuServe, Prodigy, AOL and thousands of local bulletin board systems. All of these were swept away by the internet revolution except for AOL who managed to ride the wave by co-opting it. So Gilder was correct as well, although he was eventually carried away by the force of his own argument like many who listened to him.
Wednesday, November 01, 2006
Vallywag
The thing that I miss most from Web 1.0 is Suck. As far as I know Web 2.0 does not have anything that matches its authority, wit, eclectic pithiness and complete mastery of the cultural reference. In those days, if I took a few minutes after lunch to explore the Internet, I would always turned to Suck.
These days the best I can do is Vallywag. Compared to Suck it is a sprawling parochial mess with an unhealthy obsession for TechCrunch blogger Michael Arrington. If Vallywag has a center of gravity it is closer to Castro Street in Mountain View than San Francisco. However it is a good way of keeping up with what is really going on. For example, today there is:
These days the best I can do is Vallywag. Compared to Suck it is a sprawling parochial mess with an unhealthy obsession for TechCrunch blogger Michael Arrington. If Vallywag has a center of gravity it is closer to Castro Street in Mountain View than San Francisco. However it is a good way of keeping up with what is really going on. For example, today there is:
- A Crook that is trying to destroy the internet.
- Microsoft person explaining why their user experience sucks.
- Obligatory reference to "somebody set us up the bomb".
- Supr.c.ilio.us jargon: NSFCS (not safe for coffee shops).
Sunday, October 29, 2006
30 Years of Public Key
The Computer History Museum held a wonderful event celebrating 30 years of Public Key Cryptography last week. In a sense it did not matter that as the meeting unfolded we heard that the NSA may well have implemented the idea that was first invented at GCHQ in the UK during the 70's. The event celebrated the public invention of ideas that underlies secure use of the internet by Diffie and Hellman in 1986.
John Markoff introduced the event with Dan Boneh from Stanford and Voltage Security. Dan gave us a short overview that explained what Public Key Cryptography is in terms that even a layman could follow and impressed on us the importance of its discovery.
This led to the panel run by Steven Levy who basically took us through the his book Crypto, starting with the invention of Public Key Cryptography by panelists Whitfield Diffie and Martin Hellman. The next part of the story was the commercialization of the ideas by RSA security represented by Jim Bidzos, and its first serious use in Lotus Notes by panelist Ray Ozzie. Ray took us to the moment where he needed to get an export license for Notes and ran into the roadblock of the government represented by Brian Snow who recently retired from the NSA.
The central discussion revolved around the position of the defense establishment who wanted to prevent what they saw as a weapon falling into the hands of the enemy and the business interests who needed the technology to make the internet secure for commerce. My big takeaway is the counter-culture character of Diffie and Hellman, particularly Whit Diffie. I get comfort from the fact that people like Diffie are working on my side to ensure that this technology is available for all of us and not just the guys in blue suits and dark glasses.
John Markoff introduced the event with Dan Boneh from Stanford and Voltage Security. Dan gave us a short overview that explained what Public Key Cryptography is in terms that even a layman could follow and impressed on us the importance of its discovery.
This led to the panel run by Steven Levy who basically took us through the his book Crypto, starting with the invention of Public Key Cryptography by panelists Whitfield Diffie and Martin Hellman. The next part of the story was the commercialization of the ideas by RSA security represented by Jim Bidzos, and its first serious use in Lotus Notes by panelist Ray Ozzie. Ray took us to the moment where he needed to get an export license for Notes and ran into the roadblock of the government represented by Brian Snow who recently retired from the NSA.
The central discussion revolved around the position of the defense establishment who wanted to prevent what they saw as a weapon falling into the hands of the enemy and the business interests who needed the technology to make the internet secure for commerce. My big takeaway is the counter-culture character of Diffie and Hellman, particularly Whit Diffie. I get comfort from the fact that people like Diffie are working on my side to ensure that this technology is available for all of us and not just the guys in blue suits and dark glasses.
Wednesday, October 25, 2006
State Management
I was expecting good things from the Ted Neward presentation to the SDForum Software Architecture and Modeling SIG, and I was not disappointed. The presentation was titled "State Management: Shape and Storage: The insidious and slippery problem of storing objects to disk".
However I have to say that the presentation did not go in the direction I expected. Ted started out by drawing the distinction between Transient state and Durable state, where Durable state is the set of objects that reside in a database. He made a passing reference to transient objects becoming durable and I was expecting him to talk about whether durable objects should be distinct from transient objects or whether an object can morph from transient to durable and back again, but he did not go there. He did spend some time on the problem of object management on a clusters. For most of the audience this was preaching to the choir.
Next Ted talked about the shape of durable data: Relational, Object or Hierarchical. We all know the object and relational shapes, and the difficulty of mapping between them. XML is hierarchical data. Ted spent some time convincing us that hierarchical data is not the same shape as object data, the two are different and there is no easy mapping from one to the other. This was the piece of the talk that I found most interesting and he certainly convinced me. After the meeting I want back and looked at my copy of Date (2nd edition, 1977) which has it all. Date describes the three shapes of data: Relational (Relational) , Hierarchical (XML) and Network (Object). Note that Date took the Hierarchical and Network stuff out of later editions.
Finally Ted referred to his controversial blog post on "Object/Relational Mapping is the Vietnam of Computer Science". His argument is that early success with Object Relational mapping has got us on a slippery slope to a morass from which there will be no easy escape. Go read the original, which is better than any summary that I could write here. I agree with him and I will put more thoughts related to this topic in future posts.
However I have to say that the presentation did not go in the direction I expected. Ted started out by drawing the distinction between Transient state and Durable state, where Durable state is the set of objects that reside in a database. He made a passing reference to transient objects becoming durable and I was expecting him to talk about whether durable objects should be distinct from transient objects or whether an object can morph from transient to durable and back again, but he did not go there. He did spend some time on the problem of object management on a clusters. For most of the audience this was preaching to the choir.
Next Ted talked about the shape of durable data: Relational, Object or Hierarchical. We all know the object and relational shapes, and the difficulty of mapping between them. XML is hierarchical data. Ted spent some time convincing us that hierarchical data is not the same shape as object data, the two are different and there is no easy mapping from one to the other. This was the piece of the talk that I found most interesting and he certainly convinced me. After the meeting I want back and looked at my copy of Date (2nd edition, 1977) which has it all. Date describes the three shapes of data: Relational (Relational) , Hierarchical (XML) and Network (Object). Note that Date took the Hierarchical and Network stuff out of later editions.
Finally Ted referred to his controversial blog post on "Object/Relational Mapping is the Vietnam of Computer Science". His argument is that early success with Object Relational mapping has got us on a slippery slope to a morass from which there will be no easy escape. Go read the original, which is better than any summary that I could write here. I agree with him and I will put more thoughts related to this topic in future posts.
Sunday, October 22, 2006
Internet Search Recovered
Last year, I complained that internet search was being polluted by a lot of worthless affiliate marketing sites. My suggestion then was to start by naming the beast. My good friend Roger Shepherd just pointed out a site that actually does something to deal with the problem.
Give me back my Google is very simple. It functions like a Spam blacklist in that it just adds to your search string a list of sites that you do not want results from. You can register more sites to blacklist at the GmbmG web site.
I have even used the site in anger to research HD radio, and I can thoroughly recommend it. (On HD radio, no one has a product that I want and the products that they do have are not yet affordable.)
Give me back my Google is very simple. It functions like a Spam blacklist in that it just adds to your search string a list of sites that you do not want results from. You can register more sites to blacklist at the GmbmG web site.
I have even used the site in anger to research HD radio, and I can thoroughly recommend it. (On HD radio, no one has a product that I want and the products that they do have are not yet affordable.)
Wednesday, October 18, 2006
Data Mining and Business Intelligence
After writing about the Netflix prize, I got to thinking about data mining the Netflix data set. On reflection the problem seemed intractable, that is until I attended the SDForum Business Intelligence SIG to hear Paul O'Rorke talk on "Data Mining and Business Intelligence".
Paul covered several topics in his talk, including the standard CRISP-DM data mining process, and a couple of data mining problem areas and their algorithms. One problem area was frequent item-set mining. This is used for applications like market basket analysis which looks for items that are frequently bought together.
In the meeting we spent some time discussing what market basket analysis is for. Of course, 'beer and diapers' came up. The main suggested use was store design. If some one who buys milk is likely to buy cookies, then the best way to design the store is with milk and cookies at opposite ends of the store so that the customer has to walk past all the other shelves with tempting wares while on their simple milk and cookies shopping run. I am sure that there are other more sophisticated uses of market basket analysis. I know of at least one company has made a good business out of it.
To get back to Netflix, there are similarities between an online movie store and a grocery store. Both have a large number of products, both have a large number of customers and any particular customer will only get a small number of the products. For the supermarket we are interested in understanding what products are bought together in a basket, while for Netflix we are interested in the slightly more complex issue of predicting how a customer will rate a movie.
Paul showed us the FP-Tree data structure and showed us some of the FP-growth algorithm for using it. The FP-Tree will only represent the fact that Netflix users have rated movies. As it stands, it cannot also represent the users ratings, however it is a good starting point, and there are several implementations available. Also, Netflix could easily use the FP-Growth algorithm to recommend movies ("Other people who watched your movie selections also watched ...").
Paul covered several topics in his talk, including the standard CRISP-DM data mining process, and a couple of data mining problem areas and their algorithms. One problem area was frequent item-set mining. This is used for applications like market basket analysis which looks for items that are frequently bought together.
In the meeting we spent some time discussing what market basket analysis is for. Of course, 'beer and diapers' came up. The main suggested use was store design. If some one who buys milk is likely to buy cookies, then the best way to design the store is with milk and cookies at opposite ends of the store so that the customer has to walk past all the other shelves with tempting wares while on their simple milk and cookies shopping run. I am sure that there are other more sophisticated uses of market basket analysis. I know of at least one company has made a good business out of it.
To get back to Netflix, there are similarities between an online movie store and a grocery store. Both have a large number of products, both have a large number of customers and any particular customer will only get a small number of the products. For the supermarket we are interested in understanding what products are bought together in a basket, while for Netflix we are interested in the slightly more complex issue of predicting how a customer will rate a movie.
Paul showed us the FP-Tree data structure and showed us some of the FP-growth algorithm for using it. The FP-Tree will only represent the fact that Netflix users have rated movies. As it stands, it cannot also represent the users ratings, however it is a good starting point, and there are several implementations available. Also, Netflix could easily use the FP-Growth algorithm to recommend movies ("Other people who watched your movie selections also watched ...").
Saturday, October 14, 2006
The Netflix Prize
Anyone with a handle on Business Intelligence should be looking at the Netflix Prize. Netflix is offering $1 million to the first team that can make a 10% improvement in their Cinematch movie recommendation system. Moreover, just by entering, you get access to a large dataset of movies and specific customer recommendations, something that should get any numbers jockey salivating.
Having looked at the competition rules carefully, I see that as usual the competition is somewhat tangential to the goal. The competition is to predict how people will rate movies based on their existing movie ratings, while the goal is to improve the recommendation system.
I am a long time Netflix user and I know that their recommendation system has some problems. The first improvement that I would suggest is to not recommend a movie to someone who has already rented the movie from Netflix. I am sure that more than 10% of all the recommendations I see are movies that I have already rented.
I should say by way of disclosure that I never rate movies. My Netflix queue is more than 2 years long, I have plenty of other sources of recommendations and I do not feel the need to do anything more to make my queue grow longer. However I do glance at the recommendations made after adding a movie to the queue and sometimes add movies from them.
There is a bigger discussion here. The Cinematch system judges peoples interest in a movie by what they say they think of movies. It would be much more effective if it also judged people by what they have done, whether they have already seen the movie, what types of movie they add to their queue and what kinds of movies they promote to the top of their queue. We know that what people say and do are often not the same and that actions speak louder than words, so lets take more heed of what they do and less of what they say.
Having looked at the competition rules carefully, I see that as usual the competition is somewhat tangential to the goal. The competition is to predict how people will rate movies based on their existing movie ratings, while the goal is to improve the recommendation system.
I am a long time Netflix user and I know that their recommendation system has some problems. The first improvement that I would suggest is to not recommend a movie to someone who has already rented the movie from Netflix. I am sure that more than 10% of all the recommendations I see are movies that I have already rented.
I should say by way of disclosure that I never rate movies. My Netflix queue is more than 2 years long, I have plenty of other sources of recommendations and I do not feel the need to do anything more to make my queue grow longer. However I do glance at the recommendations made after adding a movie to the queue and sometimes add movies from them.
There is a bigger discussion here. The Cinematch system judges peoples interest in a movie by what they say they think of movies. It would be much more effective if it also judged people by what they have done, whether they have already seen the movie, what types of movie they add to their queue and what kinds of movies they promote to the top of their queue. We know that what people say and do are often not the same and that actions speak louder than words, so lets take more heed of what they do and less of what they say.
Thursday, October 12, 2006
Spamhaus Case Threat
We have this vision of the Net as a the great level playing field where all the world can come on an equal basis. For some time, people outside the US have been concerned that the US has too much ownership and influence in the Net. Now the concerns are coming home to roost with the Spamhaus case.
The case is quite simple. Spamhaus.org is a volunteer run organization in London that issues lists of email spammers to ISP and other organizations around the world. A US business, e360Insight LLC sued Spamhaus in a Chicago court to have its name removed from the Spamhaus list of email spammers. Spamhaus has no money and considers that the Chicago court did not have jurisdiction over it, so it did not appear to defend their position. As they did not defend their position, they lost the case.
First the Judge issued a judgment of $11M against Spamhaus that has been ignored. Now the Judge has proposed an order that ICANN, a US based organization should take away the Spamhaus.org domain name from them. Without their domain name, people who use Spamhaus would not know where to go to get their email spammer lists.
The problem is that ICANN is a US organization and therefore subject to the vagaries of US justice which is somewhat variable in the lower courts. There has been good coverage of the Spamhaus case, but only today have people started to realize its long term consequences. I think we would all feel that the net would by much more neutral if its governing bodies were based in a neutral country like Switzerland.
The case is quite simple. Spamhaus.org is a volunteer run organization in London that issues lists of email spammers to ISP and other organizations around the world. A US business, e360Insight LLC sued Spamhaus in a Chicago court to have its name removed from the Spamhaus list of email spammers. Spamhaus has no money and considers that the Chicago court did not have jurisdiction over it, so it did not appear to defend their position. As they did not defend their position, they lost the case.
First the Judge issued a judgment of $11M against Spamhaus that has been ignored. Now the Judge has proposed an order that ICANN, a US based organization should take away the Spamhaus.org domain name from them. Without their domain name, people who use Spamhaus would not know where to go to get their email spammer lists.
The problem is that ICANN is a US organization and therefore subject to the vagaries of US justice which is somewhat variable in the lower courts. There has been good coverage of the Spamhaus case, but only today have people started to realize its long term consequences. I think we would all feel that the net would by much more neutral if its governing bodies were based in a neutral country like Switzerland.
Monday, October 09, 2006
Understanding Metadata
Some time ago I made the remark that "metadata is the cure for all problems digital". Well it is true, metadata IS the cure, although this statement is a bit like that old Computer Science saw that any problem in Computer Science can be solved by adding an extra level of indirection. The problem with Metadata is that it is one of these slippery concepts that seems so straightforward but then when you go to look into the details is difficult to pin down.
Ask a number of people what metadata is and you get a variety of answers. There are experts who will tell you that metadata is data about data and then go starry eyed as their mind gets lost in an infinite regression, while you wait unrequited wanting to hear something more useful and concrete. On the other hand there are people who will tell you that metadata is the ID3 tags found in MP3 files. Of course metadata is data about data, but that definition does not capture the essence of it, and while ID3 tags are metadata, there is a lot more useful metadata in an MP3 file than just the ID3 tags, let alone all the metadata in all the other data stores, data sources and file types that are available.
To get an understanding of metadata, a good starting point is to look at a diverse set of examples and then look at some of the important attributes and characteristics that are common to these examples. So, to kick this off, here is a description of metadata in a database, an XML file and a MP3 file. We will look at attributes and characteristics in later posts.
In a SQL database, the metadata is called the Catalog and this is presented as a set of tables like any other database tables. The catalog contains tables that define the tables, columns, views, permissions and so on. In practice the catalog is an external representation of the information that the database system needs to access its data. Internally a Catalog can be stored as a set of database tables or just some data structures, I have seen it done both ways. The Catalog is always presented as a set of tables so that the user can query the Catalog just like any other data. For example, I have fixed a bug by writing a mind-bendingly complicated query on a catalog rather than update the definition of the catalog table to get the required information easily.
In an XML document, the tags are the metadata. Well, except for the fact that tags can contain attributes and the value part of an attribute is data. Next we have to quench the argument about whether processing instructions are metadata by saying that some types of processing instruction are metadata and other types are not. Then there are DTDs and Schema that are metadata and also metadata about the metadata, (which is allowed by the definition that metadata is data about data). Some of the metadata can be in other XML documents that are referenced by URLs.
An MP3 file consists of a sequence of frames where each frame contains a header and data representing the sound. The header is metadata, containing useful information like the bit rate, frequency and whether the data is stereo. An ID3 tag is a frame with a header that indicates that the data is not a MP3 sound frame. The ID3 tag contains information about the artist, recording and album. There are several different versions of ID3 tags that are not upwardly compatible with one another.
Ask a number of people what metadata is and you get a variety of answers. There are experts who will tell you that metadata is data about data and then go starry eyed as their mind gets lost in an infinite regression, while you wait unrequited wanting to hear something more useful and concrete. On the other hand there are people who will tell you that metadata is the ID3 tags found in MP3 files. Of course metadata is data about data, but that definition does not capture the essence of it, and while ID3 tags are metadata, there is a lot more useful metadata in an MP3 file than just the ID3 tags, let alone all the metadata in all the other data stores, data sources and file types that are available.
To get an understanding of metadata, a good starting point is to look at a diverse set of examples and then look at some of the important attributes and characteristics that are common to these examples. So, to kick this off, here is a description of metadata in a database, an XML file and a MP3 file. We will look at attributes and characteristics in later posts.
In a SQL database, the metadata is called the Catalog and this is presented as a set of tables like any other database tables. The catalog contains tables that define the tables, columns, views, permissions and so on. In practice the catalog is an external representation of the information that the database system needs to access its data. Internally a Catalog can be stored as a set of database tables or just some data structures, I have seen it done both ways. The Catalog is always presented as a set of tables so that the user can query the Catalog just like any other data. For example, I have fixed a bug by writing a mind-bendingly complicated query on a catalog rather than update the definition of the catalog table to get the required information easily.
In an XML document, the tags are the metadata. Well, except for the fact that tags can contain attributes and the value part of an attribute is data. Next we have to quench the argument about whether processing instructions are metadata by saying that some types of processing instruction are metadata and other types are not. Then there are DTDs and Schema that are metadata and also metadata about the metadata, (which is allowed by the definition that metadata is data about data). Some of the metadata can be in other XML documents that are referenced by URLs.
An MP3 file consists of a sequence of frames where each frame contains a header and data representing the sound. The header is metadata, containing useful information like the bit rate, frequency and whether the data is stereo. An ID3 tag is a frame with a header that indicates that the data is not a MP3 sound frame. The ID3 tag contains information about the artist, recording and album. There are several different versions of ID3 tags that are not upwardly compatible with one another.
Sunday, October 08, 2006
Glassbox
Being a Business Intelligence type, if I were given the job of devising a tool for analyzing Java applications, I would build a component to collect performance data and then present the user with a bunch of fancy reports, charts and data cubes where they could drill down to work out what their problem is. Glassbox takes a different approach as we heard at the SDForum Java SIG last Tuesday.
Glassbox collects the data from Java Apps running in an application server, analyses it for a set of common problems and produces a report that tells you exactly what the problem is. No complicated analysis, no slicing and dicing, just a simple report on the facts. Of course it can only tell you about problems that it has been programmed to analyze, however the common problems are well known. Things like too many database queries to answer web request, slow database response, slow complicated java that seems to call the same function too many times. It is the kind of solution that gets 90% of the problems with 10% of the effort of a full blown performance analysis tool. Another advantage of this approach is that the tool can be used by anyone without a lot of training or time spend getting experienced in its use.
While Glassbox has not wasted time building fancy displays, they have taken the trouble to collect their data in a straightforward and unobtrusive way. As we were shown, you just add a .war file to your application servers directory, update the application servers configuration, restart it and you are on your way. Supposedly data collection only adds 1% or so to program execution times.
All in all, Glassbox looks like a good place to start identifying problems with web apps. As it is Open Source, and easy to use, the cost of trying it out is low.
Glassbox collects the data from Java Apps running in an application server, analyses it for a set of common problems and produces a report that tells you exactly what the problem is. No complicated analysis, no slicing and dicing, just a simple report on the facts. Of course it can only tell you about problems that it has been programmed to analyze, however the common problems are well known. Things like too many database queries to answer web request, slow database response, slow complicated java that seems to call the same function too many times. It is the kind of solution that gets 90% of the problems with 10% of the effort of a full blown performance analysis tool. Another advantage of this approach is that the tool can be used by anyone without a lot of training or time spend getting experienced in its use.
While Glassbox has not wasted time building fancy displays, they have taken the trouble to collect their data in a straightforward and unobtrusive way. As we were shown, you just add a .war file to your application servers directory, update the application servers configuration, restart it and you are on your way. Supposedly data collection only adds 1% or so to program execution times.
All in all, Glassbox looks like a good place to start identifying problems with web apps. As it is Open Source, and easy to use, the cost of trying it out is low.
Wednesday, October 04, 2006
Definitive Source
When George Smoot was called at 3 am by someone with a Swedish accent telling him that he had won a Noble Prize for Physics, he thought this might be a prank. So what did he do? He went to the Nobel Prize web site to check whether it was true. We have come to the point where a Nobel laureate trusts the World Wide Web as the definitive source of information rather than a phone call.
Thursday, September 28, 2006
Dashboard Design
We had a lot of fun at the SDForum Business Intelligence SIG September meeting where Stephen Few spoke on "Why Most Dashboards Don't Work". Here we are talking about Information Dashboards that let an executive pilot their enterprise to new levels of performance. Stephen is an expert on the visual presentation of information, he has just published a book on Dashboard design and he has spoken previously to the BI SIG.
The fun came from looking at examples of dashboards that had been culled from the web and picking holes in what they presented. In practice it was surprisingly easy for audience members to find problems in the dashboards shown. From these examples and some critical thinking, Stephen pulled out a list of 13 things to avoid in dashboard design and a shorter list of things to do to get a dashboard design right.
However the thing that I found most compelling about the presentation came right at the end. Stephen had judged a dashboard design competition for DMReview and he showed us some of the entries. Then he showed us a dashboard that he would have entered had he not been a judge. Of all the dashboards presented, this was the one that showed us a great deal of information in a small space and discretely guided us to the information that most required our attention.
If you want judge, you can download a version of the presentation from the Business Intelligence SIG's Yahoo web site. We had to cut the size of the file down to make it fit. You can also buy Stephen's book on dashboard design. I highly recommend it.
The fun came from looking at examples of dashboards that had been culled from the web and picking holes in what they presented. In practice it was surprisingly easy for audience members to find problems in the dashboards shown. From these examples and some critical thinking, Stephen pulled out a list of 13 things to avoid in dashboard design and a shorter list of things to do to get a dashboard design right.
However the thing that I found most compelling about the presentation came right at the end. Stephen had judged a dashboard design competition for DMReview and he showed us some of the entries. Then he showed us a dashboard that he would have entered had he not been a judge. Of all the dashboards presented, this was the one that showed us a great deal of information in a small space and discretely guided us to the information that most required our attention.
If you want judge, you can download a version of the presentation from the Business Intelligence SIG's Yahoo web site. We had to cut the size of the file down to make it fit. You can also buy Stephen's book on dashboard design. I highly recommend it.
Monday, September 11, 2006
The Latest HP Mess
There is a lot of talk in the valley about the latest HP boardroom brouha. It seems like not a year goes by without some new HP management upset. These upsets seem all the worse for the high regard in which the company was held. If you are upset by what seems to have become of such a great company, let me set the record straight.
Firstly, remember that the great company founded by Bill Hewlett and Dave Packard is now called Agilent. While Agilent seems to have lost some of the Hewlett-Packard way, it is not as bad as what has happened to HP, the fat child spun out of the original company several years ago. HP, the computer company, started out life as couple of divisions out of 20 that lost their way.
Part of the Hewlett-Packard way is that divisions grow organically and then split when they reach a certain size. This way no division dominates, they operate as a set of peers. The Computer and Printer divisions eschewed this tradition by just growing until they were big enough to swallow the rest of the company. Worse, the Computer division gave up on organic growth and for much of the last 20 years has been growing by acquisition.
All these acquisitions, particularly the large ones have diluted the blood to the point where we can on longer see a trace of the founding principals (pun intended). So do not feel sorry for HP, it is not the company you thought it was, it is just another big dinosaur well on its way to extinction.
Firstly, remember that the great company founded by Bill Hewlett and Dave Packard is now called Agilent. While Agilent seems to have lost some of the Hewlett-Packard way, it is not as bad as what has happened to HP, the fat child spun out of the original company several years ago. HP, the computer company, started out life as couple of divisions out of 20 that lost their way.
Part of the Hewlett-Packard way is that divisions grow organically and then split when they reach a certain size. This way no division dominates, they operate as a set of peers. The Computer and Printer divisions eschewed this tradition by just growing until they were big enough to swallow the rest of the company. Worse, the Computer division gave up on organic growth and for much of the last 20 years has been growing by acquisition.
All these acquisitions, particularly the large ones have diluted the blood to the point where we can on longer see a trace of the founding principals (pun intended). So do not feel sorry for HP, it is not the company you thought it was, it is just another big dinosaur well on its way to extinction.
Thursday, August 31, 2006
Free as in Peer
Laurence Lessig's writes an interesting column Wired magazine. His latest entry is titled "Free as in Beer". The column starts off talking about free, or more accurately Open Source beer. It is a good read however towards the end there is the following discontinuous comment:
Although peer production is profitable for business, writes Benkler, "we are in the midst of a quite basic transformation in how we perceive the world around us and how we act, alone and in concert with others." What he calls nonmarket peer production is a critical part of this transformation. (sic)Beer Peer? We all know that to appreciate beer you need to open the source, and that after appreciating beer you become a pee'r, but it is not the knd of thing that needs to be talked about.
Thursday, August 17, 2006
The BIRT Strategy
Software is fascinating stuff. Compared to any other engineered product it is completely ephemeral, yet at the same time it is becoming the thing that makes almost every engineered product work. Software also has a meme-like quality where certain software systems become the standard that everyone gravitates to use, however good or bad it eventually turn out to be. It seems that the trick to creating successful software is to make it really, really successful.
I got to thinking about this after listening to Paul Clenahan, VP of Product Management at Actuate Corporation and member of the Eclipse BIRT Project Management Committee talk on "Eclipse BIRT: The Open Source Reporting Framework" at the SDForum Business Intelligence SIG. BIRT is a component of the Open Source Eclipse project that provides a Business Intelligence Reporting Tool (hence BIRT).
BIRT consists of an Eclipse plug in that allows you to design sophisticated reports, a standards based XML definition of the report and delivery mechanisms that allow you to deliver reports as either HTML or PDF documents. As Paul mentioned several times, it is also very extensible, so if it does not have the capabilities that you need, you can easily add them. BIRT is Open Source software that is available under the relatively unencumbered Eclipse public license that allows commercial exploitation of the code.
From the presentation and demo, BIRT seems to be a well designed, easy to use and fully capable reporting system that is free. In fact, as the presentation wore on, the one question in my mind was why Actuate has devoted 8 developers to developing this wonderful new Open Source reporting system. What is in it for Actuate? I think that it has to do with broadening the marketplace.
While reporting tools are widely used, many more developers roll their own reports rather than use a reporting tool. Paul mentioned in his presentation that he asked a large group of developers at a conference whether they used reporting tools and the vast majority did not. Providing an easy to use Open Source tool that fits into the popular Eclipse development environment brings developers into the reporting tool fold.
Reporting tools are not rocket science. Low cost reporting tools have been around for a long time. While Actuate has excellent reporting tools, their core differentiating competence is a scalable platform for delivering reports, something that other reporting tools do not have. So broadening the market for reporting tools also widens the market for report delivery tools. If they are successful and BIRT catches on in that meme-like way, Actuate will have a much larger market to sell their products into. Open Source and an Eclipse plug-in dramatically lowers the barriers to using these tools.
I got to thinking about this after listening to Paul Clenahan, VP of Product Management at Actuate Corporation and member of the Eclipse BIRT Project Management Committee talk on "Eclipse BIRT: The Open Source Reporting Framework" at the SDForum Business Intelligence SIG. BIRT is a component of the Open Source Eclipse project that provides a Business Intelligence Reporting Tool (hence BIRT).
BIRT consists of an Eclipse plug in that allows you to design sophisticated reports, a standards based XML definition of the report and delivery mechanisms that allow you to deliver reports as either HTML or PDF documents. As Paul mentioned several times, it is also very extensible, so if it does not have the capabilities that you need, you can easily add them. BIRT is Open Source software that is available under the relatively unencumbered Eclipse public license that allows commercial exploitation of the code.
From the presentation and demo, BIRT seems to be a well designed, easy to use and fully capable reporting system that is free. In fact, as the presentation wore on, the one question in my mind was why Actuate has devoted 8 developers to developing this wonderful new Open Source reporting system. What is in it for Actuate? I think that it has to do with broadening the marketplace.
While reporting tools are widely used, many more developers roll their own reports rather than use a reporting tool. Paul mentioned in his presentation that he asked a large group of developers at a conference whether they used reporting tools and the vast majority did not. Providing an easy to use Open Source tool that fits into the popular Eclipse development environment brings developers into the reporting tool fold.
Reporting tools are not rocket science. Low cost reporting tools have been around for a long time. While Actuate has excellent reporting tools, their core differentiating competence is a scalable platform for delivering reports, something that other reporting tools do not have. So broadening the market for reporting tools also widens the market for report delivery tools. If they are successful and BIRT catches on in that meme-like way, Actuate will have a much larger market to sell their products into. Open Source and an Eclipse plug-in dramatically lowers the barriers to using these tools.
Wednesday, August 16, 2006
Blogger Upgrade
Blogger, home of this blog is going to get an upgrade. I have used Blogger for the last couple of years as it suits my text mostly blogging style. So far my only complaint has been about Google's segregated indexing (and a spell checker that seems to work against both Blogger and Google). Many others have been less patient.
The bad news is that the new Blogger will be integrated with Google Accounts. Recently, I wrote about how I had been forced to give up my identity to Yahoo. Now, it looks like Larry and Sergey are going to get a bit of my identity as well. Barry Diller has a piece of me and Rupert has all my kids nailed down in their own little spaces. Whatever happened to freedom?
The bad news is that the new Blogger will be integrated with Google Accounts. Recently, I wrote about how I had been forced to give up my identity to Yahoo. Now, it looks like Larry and Sergey are going to get a bit of my identity as well. Barry Diller has a piece of me and Rupert has all my kids nailed down in their own little spaces. Whatever happened to freedom?
Tuesday, August 08, 2006
A Short Post
Everyone including the media is talking about the idea of the long tail. To me it seems like last years idea. On the other hand, this is the silly season so maybe that is all they have to write about.
Saturday, July 29, 2006
Moore's Law Logic
We all know Moore's law, but very few seem to understand the inevitable logic that it implies. Moore's law states that the number of transistors on a silicon chip doubles every 12 to 18 months. In practice chips are all pretty much the same cost (within an order of magnitude or so), thus we get double the capability every 18 months. The long term consequence of this is that everything becomes digital and every digital device is eventually a single chip.
I will not waste your time you with a comprehensive history. It is sufficient to highlight a couple of trends. The first trend is to digital media, starting with music in the CD and then MP3, then video and books, and now we are on the verge of digital broadcast TV and radio. A second trend is towards the single chip implementation of all electronic devices, starting with watches and calculators in the 70's then consumer electronics like CD players, stereos, radios and TV's. Currently we have just achieved the single chip cell-phone. The trend to digital media helps with the movement to single chip implementations because it is much easier to do an all digital device than one that has both analog and digital circuits.
One device that has so far avoided becoming a single chip implementation is the personal computer. A typical motherboard has about 6 to 8 processing chips, a bunch of memory chips and some driver chips that do nothing more than pass on a strengthened signal from a processing chip. One day all these chips except for the driver chips will coalesce into a single chip, because there will be nothing else better to do with all the available transistors.
So, last week when AMD announced that they were buying ATI, I knew what it was about. AMD has the single chip personal computer on their long term road map, and they need the display drivers and other peripherals that ATI has to complete their vision. AMD has already moved the memory controller onto the processor chip. Next I expect them to announce a low end chip with all the rest of the peripherals integrated. Over time the single chip processor implementation will move up to the mid range and high end. Sometime thereafter, the single chip computer with integrated memory will become first possible and then inevitable.
I will not waste your time you with a comprehensive history. It is sufficient to highlight a couple of trends. The first trend is to digital media, starting with music in the CD and then MP3, then video and books, and now we are on the verge of digital broadcast TV and radio. A second trend is towards the single chip implementation of all electronic devices, starting with watches and calculators in the 70's then consumer electronics like CD players, stereos, radios and TV's. Currently we have just achieved the single chip cell-phone. The trend to digital media helps with the movement to single chip implementations because it is much easier to do an all digital device than one that has both analog and digital circuits.
One device that has so far avoided becoming a single chip implementation is the personal computer. A typical motherboard has about 6 to 8 processing chips, a bunch of memory chips and some driver chips that do nothing more than pass on a strengthened signal from a processing chip. One day all these chips except for the driver chips will coalesce into a single chip, because there will be nothing else better to do with all the available transistors.
So, last week when AMD announced that they were buying ATI, I knew what it was about. AMD has the single chip personal computer on their long term road map, and they need the display drivers and other peripherals that ATI has to complete their vision. AMD has already moved the memory controller onto the processor chip. Next I expect them to announce a low end chip with all the rest of the peripherals integrated. Over time the single chip processor implementation will move up to the mid range and high end. Sometime thereafter, the single chip computer with integrated memory will become first possible and then inevitable.
Saturday, July 22, 2006
The Yahoo Business Model
In the back of my mind I had always understood the Yahoo! Business Model. Yahoo! gets you to sign up for the compelling online services that they provide and in return they sell advertising targeted at you. This was confirmed at the July meeting of the SDForum Business Intelligence SIG where Madhu Vudali, Director of Pricing & Yield Management at Yahoo! Spoke on "Pricing & Yield Management at Yahoo!"
For years, I instinctively resisted signing up for any online services and deliberately avoided using any services like driving directions that would reveal part of my identity, while still making liberal use of anything that did not reveal anything about me except for perhaps a few stocks that I was anonymously interested in. Recently I have been required to sign up for a couple of Yahoo! services so they have my identity, as they do for the half billion other people who are also signed up. As I said before their services are compelling.
Madhu explained the other side of the coin where Yahoo! sells advertising. This is not the auctioned search advertising that Google has become known for, although Yahoo! also does this. This advertising is the banner ads that you see when you use all those compelling services. The ads can be very specifically targeted such as a movie ad that is shown for a few hours on a Thursday and Friday afternoon to a specific demographic.
As Madhu explained, Yahoo! uses the yield management techniques that were pioneered by the airline industry in the 70's. The airlines use yield management to get the most revenue out of every available airplane seat while keeping the airplane full. Yahoo! has a similar problem but on a much larger scale. They have a huge inventory of page views and several dimensions such as age, location and interests on which to segment the viewers and interest the advertisers.
The problem is to extract the maximum revenue out of this mix. Compared to aircraft yield management the inventory is much larger and more squishy and the number of potential products numbers in the multi-millions rather than in the tens of thousands that an airline has.
All in all, it was a very interesting presentation. Unfortunately for anyone who was not there, Madhu's presentation is not available, so this is the best that you are going to do. Sign up for the Business Intelligence SIG mailing list (sdforum_bisig-subscribe@yahoogroups.com) and do not miss another meeting.
For years, I instinctively resisted signing up for any online services and deliberately avoided using any services like driving directions that would reveal part of my identity, while still making liberal use of anything that did not reveal anything about me except for perhaps a few stocks that I was anonymously interested in. Recently I have been required to sign up for a couple of Yahoo! services so they have my identity, as they do for the half billion other people who are also signed up. As I said before their services are compelling.
Madhu explained the other side of the coin where Yahoo! sells advertising. This is not the auctioned search advertising that Google has become known for, although Yahoo! also does this. This advertising is the banner ads that you see when you use all those compelling services. The ads can be very specifically targeted such as a movie ad that is shown for a few hours on a Thursday and Friday afternoon to a specific demographic.
As Madhu explained, Yahoo! uses the yield management techniques that were pioneered by the airline industry in the 70's. The airlines use yield management to get the most revenue out of every available airplane seat while keeping the airplane full. Yahoo! has a similar problem but on a much larger scale. They have a huge inventory of page views and several dimensions such as age, location and interests on which to segment the viewers and interest the advertisers.
The problem is to extract the maximum revenue out of this mix. Compared to aircraft yield management the inventory is much larger and more squishy and the number of potential products numbers in the multi-millions rather than in the tens of thousands that an airline has.
All in all, it was a very interesting presentation. Unfortunately for anyone who was not there, Madhu's presentation is not available, so this is the best that you are going to do. Sign up for the Business Intelligence SIG mailing list (sdforum_bisig-subscribe@yahoogroups.com) and do not miss another meeting.
Wednesday, May 31, 2006
Divergence
The conventional wisdom (always suspect) is that portable devices will converge on a single platform. In the future we will each carry a single device that is a phone, clock, radio, camera, media player, media recorder, messenger and provide all the functions of a PDA including calendaring, reminding and note taking. This is called convergence.
Well as the song goes "it ain't necessarily so". I have written about this before, however I got to thinking about this again when I wanted another track on my iPod. I use the iPod Shuffle to listen to podcasts. Sometimes, I would like to have a second music track for when I want to concentrate and need background music to drown out distracting sounds. Time to trade up to a more sophisticated iPod? Well the alternative is just buy another iPod Shuffle, after all they are cheap enough and I can have one Shuffle with podcasts and one Shuffle with a music mix.
The point is, the devices are so cheap that there is no need to have one converged device. In fact you may be better off with several best-of-breed devices rather than one mediocre do-it-all device. Hence divergence. In fact convergence may be the new disintermediation.
Well as the song goes "it ain't necessarily so". I have written about this before, however I got to thinking about this again when I wanted another track on my iPod. I use the iPod Shuffle to listen to podcasts. Sometimes, I would like to have a second music track for when I want to concentrate and need background music to drown out distracting sounds. Time to trade up to a more sophisticated iPod? Well the alternative is just buy another iPod Shuffle, after all they are cheap enough and I can have one Shuffle with podcasts and one Shuffle with a music mix.
The point is, the devices are so cheap that there is no need to have one converged device. In fact you may be better off with several best-of-breed devices rather than one mediocre do-it-all device. Hence divergence. In fact convergence may be the new disintermediation.
Thursday, May 18, 2006
Taming Concurrency
Two items this week. AMD again talked about their next generation of 4 core processors. That is, each chip will have 4 processors. On further research, both AMD and Intel have been talking about 4 core processors for sometime. This week I also received the May edition of IEEE Computer, which had as its cover feature "The Problem with Threads".
On the one hand we are moving into a world where applications have to be concurrent to exploit the available hardware. On the other hand there is a growing realization that the current tools are not adequate. All this in the face of hardware that is being designed to make life more difficult for the casual threads user.
The fundamental problem is quite simple. Programming languages offer a set of features to help the programmer and prevent them from making simple programming mistakes. For example, typechecking and sophisticated type extension mechanisms like objects to allow the programmer to easily express their ideas in a safe way. Automatic garbage collection in Java and other languages eliminate a huge collection of bugs, disastrous crashes and storage management code. On top of this are a set of design patterns that help and guide the programmer.
None of these things exist in the threads environment. A threading library is an add-on to a programming language. It provides no protection to the user, in fact it encourages unsafe programming practices. Also, threading libraries provide very little guidance as to how they should be used. On top of this, there is no rich set of design patterns for parallel and concurrent programming. I had difficulty recognizing the set of design patterns that are available and could not find patterns that I regularly use.
All this has the making of a perfect storm. "The Problem with Threads" paper takes the position that we have to integrate threads into existing programming languages because people are not willing to move to a new programming paradigm. I take the opposite view. We will not be able to write safe concurrent programs until we develop and determine to use programming languages that prevent us from making elementary mistakes, just as we have programming languages that prevents us from making elementary mistakes with types and memory allocation.
On the one hand we are moving into a world where applications have to be concurrent to exploit the available hardware. On the other hand there is a growing realization that the current tools are not adequate. All this in the face of hardware that is being designed to make life more difficult for the casual threads user.
The fundamental problem is quite simple. Programming languages offer a set of features to help the programmer and prevent them from making simple programming mistakes. For example, typechecking and sophisticated type extension mechanisms like objects to allow the programmer to easily express their ideas in a safe way. Automatic garbage collection in Java and other languages eliminate a huge collection of bugs, disastrous crashes and storage management code. On top of this are a set of design patterns that help and guide the programmer.
None of these things exist in the threads environment. A threading library is an add-on to a programming language. It provides no protection to the user, in fact it encourages unsafe programming practices. Also, threading libraries provide very little guidance as to how they should be used. On top of this, there is no rich set of design patterns for parallel and concurrent programming. I had difficulty recognizing the set of design patterns that are available and could not find patterns that I regularly use.
All this has the making of a perfect storm. "The Problem with Threads" paper takes the position that we have to integrate threads into existing programming languages because people are not willing to move to a new programming paradigm. I take the opposite view. We will not be able to write safe concurrent programs until we develop and determine to use programming languages that prevent us from making elementary mistakes, just as we have programming languages that prevents us from making elementary mistakes with types and memory allocation.
Tuesday, May 16, 2006
CDI vs MDM
Anyone who missed the SDForum Business Intelligence SIG meeting tonight missed a great meeting. Paul Friedman, CTO and founder of Purisma gave a talk on "CDI vs MDM: A Basic Primer". CDI is Customer Data Integration and MDM is Master Data Management, two of the most potent new terms in the IT world.
The problem is the same one that IT has been battling for the last 20 years. That is multiple database driven IT systems that all contain different versions of the same data, and the inevitable problem of reconciling that data. For many reasons, reconciling customer data is more difficult than other types of data.
Paul presented a spectrum of possible solutions. At one end of the spectrum is a registry Hub for Customer Data Integration (CDI). This takes data from operational systems and uses it to create a master registry of customers. As it only reads data from other systems it is a relatively inexpensive and light weight solution. On the other hand, the operational systems that source the data do not benefit from it.
At the other end of the spectrum is Centralized Author Control or Master Data Management (MDM). This has a single authority that receives and authorizes all changes to customer data in all operational systems. As you my expect, this is a much more difficult and expensive to implement as it involves changing the operations systems.
Paul's and Purisma's approach is to start by building a registry hub and then perhaps slowly move it to becoming more proactive with managing the data in the operational systems. Big bang MDM is as likely as other big bang IT projects to blow up in your face.
The problem is the same one that IT has been battling for the last 20 years. That is multiple database driven IT systems that all contain different versions of the same data, and the inevitable problem of reconciling that data. For many reasons, reconciling customer data is more difficult than other types of data.
Paul presented a spectrum of possible solutions. At one end of the spectrum is a registry Hub for Customer Data Integration (CDI). This takes data from operational systems and uses it to create a master registry of customers. As it only reads data from other systems it is a relatively inexpensive and light weight solution. On the other hand, the operational systems that source the data do not benefit from it.
At the other end of the spectrum is Centralized Author Control or Master Data Management (MDM). This has a single authority that receives and authorizes all changes to customer data in all operational systems. As you my expect, this is a much more difficult and expensive to implement as it involves changing the operations systems.
Paul's and Purisma's approach is to start by building a registry hub and then perhaps slowly move it to becoming more proactive with managing the data in the operational systems. Big bang MDM is as likely as other big bang IT projects to blow up in your face.
Thursday, May 11, 2006
Uncommon SaaS Wisdom
Ken Rudin gave an interesting talk to the newly revived SDForum Software as a Service (SaaS) SIG on "Not-Yet-Common Wisdom in SaaS". I have heard the main points of his talk at previous presentations at the SaaS SIG. The interest was in the little details and real world experience that he bring to the topic.
Ken has been doing SaaS for a long time. He was an early employee at SalesForce.com, running their engineering team. He was on the original board of NetSuite, and created the Siebel CRM OnDemand division at Siebel. Now he has a new startup called LucidEra that is bringing SaaS to Business Intelligence.
Ken spoke for some time on the engineering challenges of SaaS. He explained why Enterprise software suffers from "feature" bloat (something I have complained about before) and SaaS does not. He also discussed the challenge of making software adapt to customer requirements. Enterprise software is usually given some programmatic customizability. With SaaS, allowing the end-user to program the product is a death knell, so the trick with SaaS is to provide the right simple options so that the customer can configure the software themselves by for example clicking check-boxes.
Related to this Ken talked about doing the relative efficiency of silicon versus carbon. By carbon he means people. Silicon is cheap and scalable while carbon is neither. While it is an amusing metaphor, it risks being confused with the other carbon problem.
Ken was also illuminating on the business problem of Enterprise software companies like SAP and Siebel that try to provide a SaaS version of their product. He termed this SoSaaS (Same old Software as a Service). Basically the business drivers are such that the two models cannot co-exist as peers. The most common outcome is that the SaaS offerings is sidelined as the little brother product and does not thrive.
Ken has been doing SaaS for a long time. He was an early employee at SalesForce.com, running their engineering team. He was on the original board of NetSuite, and created the Siebel CRM OnDemand division at Siebel. Now he has a new startup called LucidEra that is bringing SaaS to Business Intelligence.
Ken spoke for some time on the engineering challenges of SaaS. He explained why Enterprise software suffers from "feature" bloat (something I have complained about before) and SaaS does not. He also discussed the challenge of making software adapt to customer requirements. Enterprise software is usually given some programmatic customizability. With SaaS, allowing the end-user to program the product is a death knell, so the trick with SaaS is to provide the right simple options so that the customer can configure the software themselves by for example clicking check-boxes.
Related to this Ken talked about doing the relative efficiency of silicon versus carbon. By carbon he means people. Silicon is cheap and scalable while carbon is neither. While it is an amusing metaphor, it risks being confused with the other carbon problem.
Ken was also illuminating on the business problem of Enterprise software companies like SAP and Siebel that try to provide a SaaS version of their product. He termed this SoSaaS (Same old Software as a Service). Basically the business drivers are such that the two models cannot co-exist as peers. The most common outcome is that the SaaS offerings is sidelined as the little brother product and does not thrive.
Tuesday, May 09, 2006
Collaboration SIG Podcasts
I have been listening to podcasts from the SDForum Collaboration SIG, and I have to say that they are very good. It is not like being at the meeting, the visuals and the interactivity are missing, however it keeps mind alive while cleaning out the pool filter or pounding away on the cross trainer at the gym.
If anything is wrong with these podcasts it is that they show that an iPod Shuffle is not really designed around playing 2 hour tracks. I have not figured out how the Shuffle remembers its stopping point in each track, particularly when I switch it off and on again. What I experience is that it always seems to remember the stopping point before the last one so I find myself having to fast forward through many tens of minutes of the podcast. To say this is unwieldy is an understatement.
The Collaboration SIG goes to great pains to make the podcasts work. Most of the audience questions are audible, and the speakers come across very clearly. The only suggestions that I would make are to put on lead in and outro announcing track and perhaps edit out the longer pauses and the inevitable bit where the speaker grapples with getting the projector to work.
The other issue with these podcasts is finding them. The Collaboration SigBlog wiki (???) has 6 feeds from Atom to RSS 0.91, but I could not find a feed for the podcasts. Then I did a search on the iTunes Music Store for "SDForum" and found it immediately. Subscribe now.
If anything is wrong with these podcasts it is that they show that an iPod Shuffle is not really designed around playing 2 hour tracks. I have not figured out how the Shuffle remembers its stopping point in each track, particularly when I switch it off and on again. What I experience is that it always seems to remember the stopping point before the last one so I find myself having to fast forward through many tens of minutes of the podcast. To say this is unwieldy is an understatement.
The Collaboration SIG goes to great pains to make the podcasts work. Most of the audience questions are audible, and the speakers come across very clearly. The only suggestions that I would make are to put on lead in and outro announcing track and perhaps edit out the longer pauses and the inevitable bit where the speaker grapples with getting the projector to work.
The other issue with these podcasts is finding them. The Collaboration SigBlog wiki (???) has 6 feeds from Atom to RSS 0.91, but I could not find a feed for the podcasts. Then I did a search on the iTunes Music Store for "SDForum" and found it immediately. Subscribe now.
Saturday, April 29, 2006
Language Influences Architecture?
Peter Seibel started to give an interesting talk on "How Implementation Language Influences Architecture" at the recent SDForum Software Architecture and Modeling SIG. Peter started with three theories. The Sapir-Whorf theory is a long standing theory of natural language. It claims that language determines thought. The current common consensus is on the weak form of this thesis. (I will give you my opinion of Sapir-Whorf on another day.)
The second theory is the Blub paradox. This theory starts from the assumption that everybody has a favorite programming language that controls how they think about programming, which is where it loses me. The theory itself is too complicated and tenuous to sum up in an elevator pitch, so it may as well be forgotten.
The final theory is Turing equivalence. This is the idea that all programming languages are capable of expressing any computation and are therefore equivalent. The Turing tarpit is the idea that while all computationally complete languages are in theory equivalent, in practice some programming languages are just so much better and easier to use that the notion of equivalence is moot.
Unfortunately, after this strong start, Peter degenerated into a less satisfactory discussion of design patterns. The descent continued as Peter showed some particular problems with Java and claimed that they are better solved in Common Lisp. My experience with Lisp is confined to maintaining my emacs configuration file, something I have to do far too often. However, I think that the problems Peter highlighted were the usual problems with static versus dynamic checking. Java has some restrictions because it tries to detect and eliminate problems at compile time. Lisp has little compile time checking so problems are only discovered at run time. As it is better to discover problems sooner than later, compile time checking beats run time checking any day.
My view is that programming languages are tools. Just as I have four different hammers in my garage for different types of hammering tasks, so I use several different programming languages for different types of programming tasks. The issue is not that the language influences the architecture, rather that the software architecture influences the choice of implementation language.
The second theory is the Blub paradox. This theory starts from the assumption that everybody has a favorite programming language that controls how they think about programming, which is where it loses me. The theory itself is too complicated and tenuous to sum up in an elevator pitch, so it may as well be forgotten.
The final theory is Turing equivalence. This is the idea that all programming languages are capable of expressing any computation and are therefore equivalent. The Turing tarpit is the idea that while all computationally complete languages are in theory equivalent, in practice some programming languages are just so much better and easier to use that the notion of equivalence is moot.
Unfortunately, after this strong start, Peter degenerated into a less satisfactory discussion of design patterns. The descent continued as Peter showed some particular problems with Java and claimed that they are better solved in Common Lisp. My experience with Lisp is confined to maintaining my emacs configuration file, something I have to do far too often. However, I think that the problems Peter highlighted were the usual problems with static versus dynamic checking. Java has some restrictions because it tries to detect and eliminate problems at compile time. Lisp has little compile time checking so problems are only discovered at run time. As it is better to discover problems sooner than later, compile time checking beats run time checking any day.
My view is that programming languages are tools. Just as I have four different hammers in my garage for different types of hammering tasks, so I use several different programming languages for different types of programming tasks. The issue is not that the language influences the architecture, rather that the software architecture influences the choice of implementation language.
Saturday, April 22, 2006
The Future of Music Confirmed
Last year I wrote about the future of music in a world where reproduction (copying) of music costs virtually nothing. The old model was that the publisher controlled the means of reproduction and artists made their living by promoting the sales of their recorded product for the publisher. Now that copying is free, the publisher has lost control. Artist are moving to a new mode where they give their music away to promote themselves and make their living through selling tickets live performances.
Now a Princeton University economist has come out saying the same thing. Of course, being an economist, he has numbers to back up his argument. But remember you read it here first.
Now a Princeton University economist has come out saying the same thing. Of course, being an economist, he has numbers to back up his argument. But remember you read it here first.
Tuesday, April 18, 2006
RSS Reconstructed
The other day I posted some sarcastic comments on RSS. So to put the record straight, here is the real dope. RSS creates the appearance of push on the internet. Push is the idea that new things are pushed out to you automatically. Well not everything is pushed out to you. You subscribe to a "feed" and whenever anything new comes along on that feed, you automatically get it.
For example, I subscribe to a number of podcasts in iTunes. Every time I start iTunes, it goes out and looks for new versions of these podcasts and downloads them. Then I have to fiddle faddle around to get only the latest podcasts, the ones that I have not yet heard loaded onto my iPod. Most RSS readers are less intrusive than iTunes and appear to be more automatic. Then again, most RSS readers only deal with web pages that are quick to download compared to a podcast.
The reason you 'need' a RSS reader is that the internet is pull, not push. Your browser, RSS reader or whatever goes out and pulls back pages for you to read. An RSS reader works by periodically reading a feed page to see if anything has changed. Thus it creates the appearance of push on top of pull technology. At the other end, the most important thing about a RSS feed page is the convention about how it is updated. Unfortunately you can search far and wide and not find anything useful about this convention on how to run a feed.
For example, I subscribe to a number of podcasts in iTunes. Every time I start iTunes, it goes out and looks for new versions of these podcasts and downloads them. Then I have to fiddle faddle around to get only the latest podcasts, the ones that I have not yet heard loaded onto my iPod. Most RSS readers are less intrusive than iTunes and appear to be more automatic. Then again, most RSS readers only deal with web pages that are quick to download compared to a podcast.
The reason you 'need' a RSS reader is that the internet is pull, not push. Your browser, RSS reader or whatever goes out and pulls back pages for you to read. An RSS reader works by periodically reading a feed page to see if anything has changed. Thus it creates the appearance of push on top of pull technology. At the other end, the most important thing about a RSS feed page is the convention about how it is updated. Unfortunately you can search far and wide and not find anything useful about this convention on how to run a feed.
Thursday, March 30, 2006
Emacs???@#$%!!!!
I had to smile when I found this. It contains some things that I absolutely agree with, other things that I completely disagree with and some things that are so totally wacky that I laugh out loud. It is an essay on using the emacs text editor. I use emacs every working day to do my job. I do not like emacs, but I dislike the alternative more, so I have had to come to terms with it.
So what is wrong with emacs? My first complaint, one has me cursing at least once every day is the modal user interface. Emacs is supposed to be non modal, but it has annoying modalities. The most obvious one, and the function that is most difficult to avoid because it is otherwise so useful, is the search function. Emacs is not alone in this area. The find function in Microsoft Word and Notepad have even more annoying behavior.
My second complaint about emacs is that has far and away the worst out-of-the-box experience of any software known to person. The default emacs configuration is unusable, for example backspace brings up help rather than deleting characters as you would expect. Instead of cut and paste there is yank and kill! What yanking and killing have to do with moving text around I do not know and I do not care. Like everyone else who uses emacs, I set up my own key bindings, which I think of as cut and paste, and to this day I do not know which of yank and kill is cut and which is paste.
Of course emacs is highly configurable, but on its own terms. Those terms are the lisp programming language which is extraordinarily ugly to look at. The simple act of assigning a value to a variable is done by the 'setq' function and multiple parens. A more serious criticism is that functional programming gets its power from a lack of side effects which should make it easy to produce correct programs. Emacs completely undoes this by having a huge global state, everything is a side effect and programming it is tedious, difficult and error prone.
There are many other complaints. Emacs keeps changing in what seem to be random and incompatible ways from version to version. Also not all versions work on all systems. I move among different versions of Unix and Linux and I am constantly fiddling with my emacs customization file to keep it working.
Another problem is that as the first free software, emacs drove out all competition. A lack of competition means that it can go on being quirky and unusable and still find an audience. In practice most developers I know use vi because they are not willing to put up with the hassle of using emacs. Vi is an editor designed in the 70's around a design center of people using typewriters, so it is designed to correct like a typewriter does. I find emacs too modal, so I am not going to use vi, and I am stuck with cursing emacs.
So what is wrong with emacs? My first complaint, one has me cursing at least once every day is the modal user interface. Emacs is supposed to be non modal, but it has annoying modalities. The most obvious one, and the function that is most difficult to avoid because it is otherwise so useful, is the search function. Emacs is not alone in this area. The find function in Microsoft Word and Notepad have even more annoying behavior.
My second complaint about emacs is that has far and away the worst out-of-the-box experience of any software known to person. The default emacs configuration is unusable, for example backspace brings up help rather than deleting characters as you would expect. Instead of cut and paste there is yank and kill! What yanking and killing have to do with moving text around I do not know and I do not care. Like everyone else who uses emacs, I set up my own key bindings, which I think of as cut and paste, and to this day I do not know which of yank and kill is cut and which is paste.
Of course emacs is highly configurable, but on its own terms. Those terms are the lisp programming language which is extraordinarily ugly to look at. The simple act of assigning a value to a variable is done by the 'setq' function and multiple parens. A more serious criticism is that functional programming gets its power from a lack of side effects which should make it easy to produce correct programs. Emacs completely undoes this by having a huge global state, everything is a side effect and programming it is tedious, difficult and error prone.
There are many other complaints. Emacs keeps changing in what seem to be random and incompatible ways from version to version. Also not all versions work on all systems. I move among different versions of Unix and Linux and I am constantly fiddling with my emacs customization file to keep it working.
Another problem is that as the first free software, emacs drove out all competition. A lack of competition means that it can go on being quirky and unusable and still find an audience. In practice most developers I know use vi because they are not willing to put up with the hassle of using emacs. Vi is an editor designed in the 70's around a design center of people using typewriters, so it is designed to correct like a typewriter does. I find emacs too modal, so I am not going to use vi, and I am stuck with cursing emacs.
Monday, March 27, 2006
Hurray
The big story of the moment is that the next version of the Windows operating system has been delayed by another 3 to 6 months. The news has been greeted with much wailing, gnashing of teeth and wide distribution of blame. Me, I am glad. In fact I am happy.
I do not want a new version of Windows. The current version is bad enough, and the new one will be even worse. The problem is that the new version will be full of "features", awful awful "features". The current version Windows XP is full of "features". These are simple little things that are meant to be helpful or even useful, but end up being bloody awful. I have written in the past about some of annoying features in XP. The ever changing menus, the cluttered incomprehensible Start menu, the fact that when my son, at the urging of the operating system cleans up his desktop, all the carefully placed icons on my desktop disappear.
Even after years of using XP, I still uncover annoying "features". For example, I recently installed TurboTax, and happened to notice that every time any member of my family uses the computer, they get a little pop up box that says "New program installed". Its as if the Windows is telling my kids to go in and start playing with daddies tax forms! Maybe they can add a deduction or two. I can just see myself sitting in front of a flinty eyed tax auditor saying "the kids must have done it".
The real problem is that Microsoft has to add all these new "features" to Windows. These "features" are what make people perceive that the next version is newer and better than the current version. In practice, all they are going to do is make it different and most likely worse.
The truth is that there are only so many things that an operating system has to do. Windows XP does them all, in some cases well and in other cases badly. Microsoft would serve us much better by fixing the problems rather than trying to give us something new and different. In the mean time I am going to hope that the problems with Windows Vista turn out to be even more intractable than they currently appear and that we do not get the new version for a very long time.
I do not want a new version of Windows. The current version is bad enough, and the new one will be even worse. The problem is that the new version will be full of "features", awful awful "features". The current version Windows XP is full of "features". These are simple little things that are meant to be helpful or even useful, but end up being bloody awful. I have written in the past about some of annoying features in XP. The ever changing menus, the cluttered incomprehensible Start menu, the fact that when my son, at the urging of the operating system cleans up his desktop, all the carefully placed icons on my desktop disappear.
Even after years of using XP, I still uncover annoying "features". For example, I recently installed TurboTax, and happened to notice that every time any member of my family uses the computer, they get a little pop up box that says "New program installed". Its as if the Windows is telling my kids to go in and start playing with daddies tax forms! Maybe they can add a deduction or two. I can just see myself sitting in front of a flinty eyed tax auditor saying "the kids must have done it".
The real problem is that Microsoft has to add all these new "features" to Windows. These "features" are what make people perceive that the next version is newer and better than the current version. In practice, all they are going to do is make it different and most likely worse.
The truth is that there are only so many things that an operating system has to do. Windows XP does them all, in some cases well and in other cases badly. Microsoft would serve us much better by fixing the problems rather than trying to give us something new and different. In the mean time I am going to hope that the problems with Windows Vista turn out to be even more intractable than they currently appear and that we do not get the new version for a very long time.
Wednesday, March 22, 2006
Spreadsheets Rule
On the one hand we keep hearing that the most used Business Intelligence application in the world today is Microsoft Excel. On the other hand most Business Intelligence experts and vendors put down spreadsheets as the enemy of good Business Intelligence. Spreadsheets are silos of information that contain wrong data, unanticipated data, contradictory data and broken formulas.
At the March meeting of the SDForum Business Intelligence SIG meeting we heard a different story. Craig Thomas, CTO of Steelwedge Software spoke on "Why Plans are Always Wrong". Steelwedge software does Enterprise Planning and Performance Management. The jist of their software is that they build a data warehouse for enterprise planning and then deliver the data to the planners in the form that they are most familiar with, Excel spreadsheets.
Steelwedge keeps a tight control on the planning process. Spreadsheets come from a template repository. A spreadsheet is populated with data from the data warehouse, and then checked out and delivered to a planner. Update is disabled on many fields so that the planner can only change plans in controlled ways. After the plan has been updated, it is checked back in and the changed fields integrated into the data warehouse. Workflow keeps the planning process on track.
Finally when the plan has been executed, the plan and execution can be compared to see how good the planning process is and where it needs to be improved. In fact, many Steelwedge customers have implemented Steelwedge because they felt that their planning process was out of control.
Join the Business Intelligence SIG Yahoo group, and you will be able to download the presentation from the "Files" area.
At the March meeting of the SDForum Business Intelligence SIG meeting we heard a different story. Craig Thomas, CTO of Steelwedge Software spoke on "Why Plans are Always Wrong". Steelwedge software does Enterprise Planning and Performance Management. The jist of their software is that they build a data warehouse for enterprise planning and then deliver the data to the planners in the form that they are most familiar with, Excel spreadsheets.
Steelwedge keeps a tight control on the planning process. Spreadsheets come from a template repository. A spreadsheet is populated with data from the data warehouse, and then checked out and delivered to a planner. Update is disabled on many fields so that the planner can only change plans in controlled ways. After the plan has been updated, it is checked back in and the changed fields integrated into the data warehouse. Workflow keeps the planning process on track.
Finally when the plan has been executed, the plan and execution can be compared to see how good the planning process is and where it needs to be improved. In fact, many Steelwedge customers have implemented Steelwedge because they felt that their planning process was out of control.
Join the Business Intelligence SIG Yahoo group, and you will be able to download the presentation from the "Files" area.
Sunday, March 19, 2006
Deconstructing RSS
If RSS stands for "Really Simple Syndication" how come the explanations of it on the web are so complicated and so useless. There are people out there pulling their hair out trying to understand RSS and they don't get it because they don't get how really simple it is.
An RSS 'feed' is nothing more than a web page. The only difference between RSS 'feeds' and other web pages is that web pages are coded in HTML and RSS 'feeds' are coded in another dialect of XML. The content of a RSS 'feed' page is a set of items where each item contains text and links just like any other web page.
A problem is that the promoters of RSS engage in obfuscation, trying to make out that RSS is something more than it really is. They talk about the feed as if it were a push technology. The Wikipedia page on RSS even has a link to push technology. However, like everything else on the web, RSS 'feeds' are fetched by a 'feed' aggregator using HTTP GET. Thus RSS 'feeds' are in reality pull technology, just like any other type of web browsing.
This brings us to the feed aggregator. The normal way to render XML is to provide a style sheet that turns it into XHTML. Unfortunately, there seems to be problems with RSS which prevent this, so you need to have this special thing called an aggregator before you can read a 'feed'. From what I can tell, the first problem is that there are many different dialects of RSS, which vary just enough so that a single style sheet does not work nicely for them all. A second problem is that one RSS dialect wrap the 'feed' in a Resource Descriptor Framework (RDF) tag. The RDF tag indicates that this is metadata and in general metadata is not rendered, so you need the aggregator to strip off the RDF tags before its content can be rendered.
Another thing that an aggregator can do is display several RSS feeds on the same page. When you leave the Bizarro world of RSS pull 'feeds', this feature is known as a portal and each display on the page is known as a portlet. Portals are usually done on the server side, while aggregators more often work client side, but apart from this small difference they pretty much the same thing.
The final thing that an aggregator can do is keep track of which items you have seen in a 'feed' and only show you new and updated items. Exactly how this should work is not stated, and when someone asks, the question is ignored. Doing it properly would require a discipline in generating the feed that is not required by the spec and thus cannot be relied on by the aggregator. In practice there is a convention (unstated as far as I can tell) that items are ordered in the feed page from most recent first. The aggregator renders the page normally and you see as many of the recent items as your portlet for the feed has room to show.
Summarizing, a RSS 'feed' is a web page in a slightly different dialect of XML and an aggregator is a client side portal. Is there anything that I have left out?
An RSS 'feed' is nothing more than a web page. The only difference between RSS 'feeds' and other web pages is that web pages are coded in HTML and RSS 'feeds' are coded in another dialect of XML. The content of a RSS 'feed' page is a set of items where each item contains text and links just like any other web page.
A problem is that the promoters of RSS engage in obfuscation, trying to make out that RSS is something more than it really is. They talk about the feed as if it were a push technology. The Wikipedia page on RSS even has a link to push technology. However, like everything else on the web, RSS 'feeds' are fetched by a 'feed' aggregator using HTTP GET. Thus RSS 'feeds' are in reality pull technology, just like any other type of web browsing.
This brings us to the feed aggregator. The normal way to render XML is to provide a style sheet that turns it into XHTML. Unfortunately, there seems to be problems with RSS which prevent this, so you need to have this special thing called an aggregator before you can read a 'feed'. From what I can tell, the first problem is that there are many different dialects of RSS, which vary just enough so that a single style sheet does not work nicely for them all. A second problem is that one RSS dialect wrap the 'feed' in a Resource Descriptor Framework (RDF) tag. The RDF tag indicates that this is metadata and in general metadata is not rendered, so you need the aggregator to strip off the RDF tags before its content can be rendered.
Another thing that an aggregator can do is display several RSS feeds on the same page. When you leave the Bizarro world of RSS pull 'feeds', this feature is known as a portal and each display on the page is known as a portlet. Portals are usually done on the server side, while aggregators more often work client side, but apart from this small difference they pretty much the same thing.
The final thing that an aggregator can do is keep track of which items you have seen in a 'feed' and only show you new and updated items. Exactly how this should work is not stated, and when someone asks, the question is ignored. Doing it properly would require a discipline in generating the feed that is not required by the spec and thus cannot be relied on by the aggregator. In practice there is a convention (unstated as far as I can tell) that items are ordered in the feed page from most recent first. The aggregator renders the page normally and you see as many of the recent items as your portlet for the feed has room to show.
Summarizing, a RSS 'feed' is a web page in a slightly different dialect of XML and an aggregator is a client side portal. Is there anything that I have left out?
Monday, March 13, 2006
More on UIMA
If you try to look for an explanation of UIMA, you are very likely to run across the following: "UIMA is an open, industrial-strength, scalable and extensible platform for creating, integrating and deploying unstructured information management solutions from combinations of semantic analysis and search components, ..."
They almost had me going there, nodding my head in agreement until I reached the word solution. Whenever I come across the word solution, it is guaranteed to either raise my hackles or to make my mind wander. This time I was in benign mood, so my mind started to wander.
Obviously the solution needs a large vat to contain it, and the unstructured information is probably on soggy pieces of paper floating around in the solution. The industrial-strength platform is made of steel, and stands above the vat so that people can stand up there and stir the solution. Of course the platform is modular so that it can be scaled to meet the needs of the stirrers.
Open? Does that refer to the open grid on the platform or the fact that the vat is open to allow the stirring rods to be put in. The search components are probably people scurrying around looking for more unstructured information to throw in the vat. The only thing that has me scratching my head is the semantic analysis. How can semantic analysis fit into an industrial scene like this?
Got any ideas?
They almost had me going there, nodding my head in agreement until I reached the word solution. Whenever I come across the word solution, it is guaranteed to either raise my hackles or to make my mind wander. This time I was in benign mood, so my mind started to wander.
Obviously the solution needs a large vat to contain it, and the unstructured information is probably on soggy pieces of paper floating around in the solution. The industrial-strength platform is made of steel, and stands above the vat so that people can stand up there and stir the solution. Of course the platform is modular so that it can be scaled to meet the needs of the stirrers.
Open? Does that refer to the open grid on the platform or the fact that the vat is open to allow the stirring rods to be put in. The search components are probably people scurrying around looking for more unstructured information to throw in the vat. The only thing that has me scratching my head is the semantic analysis. How can semantic analysis fit into an industrial scene like this?
Got any ideas?
Thursday, March 09, 2006
UIMA
UIMA stands for Unstructured Information Management Architecture as we heard at the SDForum Emerging Technology SIG's March meeting. IBM has just open-sourced a central part of UIMA so that you can download and play with it yourself. So what is UIMA? Well it seems that like so many other things these days, the presenters did not want to be too specific about what UIMA is, because that would constrict our thinking and prevent us from seeing all sorts of wonderful new applications for it. On the other hand you have to have some kind of grasp of what it is or you cannot do anything with it.
Lead presenter Daniel Gruhl gave the following roundabout explanation. In 1998, Tim Berners Lee introduced the Semantic Web. The idea is that you tag your web pages with metadata in the RDF format and even robots will be able to discover what you really mean. Unfortunately, since then nobody has actually put RDF tags in their web pages and web page metadata has become somewhat discredited as its principal use is to spam search engines.
So what if you could read pages and tag them with your own metadata? Well that is what UIMA is about. It is a framework where your can take a set of documents and generate your own metadata for each documents. The set of documents could be the whole web, or a subset of the web or a set of documents in your own content repository. The documents can be XML, HTML, media files or anything else as all information is now digital.
The next question is what do we do with this metadata? You cannot go and update other peoples web pages, although you could use the metadata to update your own documents and content. In practice, the principal use for the metadata is in building a search index. Although as I write this I can see that there can be plenty of other uses for UIMA for scanning and adding metadata to an existing media or document repository. So maybe the presenters were correct when they say that they do not want to constrain our thinking by being too specific about what UIMA is for.
The final question is why would you want to build your own document analyzer or search engine? Current search engines are very general. If you have specific knowledge about a subject area you can catalog a set of documents much more accurately and usefully than a general purpose search engine. One successful application of UIMA is an annotator that knows petrochemical terms and can create an index of documents useful to a petroleum engineer.
As UIMA is open source, people can build annotators on the work of others. The example shown as a demo was an annotator that discovered business titles. This used an existing annotator that identified peoples names and an annotator that identified business names and would look for business titles between them, so it could easily find the CEO in "Sam Palmisamo, CEO of IBM".
Lead presenter Daniel Gruhl gave the following roundabout explanation. In 1998, Tim Berners Lee introduced the Semantic Web. The idea is that you tag your web pages with metadata in the RDF format and even robots will be able to discover what you really mean. Unfortunately, since then nobody has actually put RDF tags in their web pages and web page metadata has become somewhat discredited as its principal use is to spam search engines.
So what if you could read pages and tag them with your own metadata? Well that is what UIMA is about. It is a framework where your can take a set of documents and generate your own metadata for each documents. The set of documents could be the whole web, or a subset of the web or a set of documents in your own content repository. The documents can be XML, HTML, media files or anything else as all information is now digital.
The next question is what do we do with this metadata? You cannot go and update other peoples web pages, although you could use the metadata to update your own documents and content. In practice, the principal use for the metadata is in building a search index. Although as I write this I can see that there can be plenty of other uses for UIMA for scanning and adding metadata to an existing media or document repository. So maybe the presenters were correct when they say that they do not want to constrain our thinking by being too specific about what UIMA is for.
The final question is why would you want to build your own document analyzer or search engine? Current search engines are very general. If you have specific knowledge about a subject area you can catalog a set of documents much more accurately and usefully than a general purpose search engine. One successful application of UIMA is an annotator that knows petrochemical terms and can create an index of documents useful to a petroleum engineer.
As UIMA is open source, people can build annotators on the work of others. The example shown as a demo was an annotator that discovered business titles. This used an existing annotator that identified peoples names and an annotator that identified business names and would look for business titles between them, so it could easily find the CEO in "Sam Palmisamo, CEO of IBM".
Monday, March 06, 2006
Web Site Measurement Hacks
O'Reilly published "Web Site Measurement Hacks" by Eric T Peterson, in August 2005. Here is a review.
Before going into this book in depth, it is worth saying something about the O'Reilly "Hacks" series. The concept is that one person acting as much as an editor as author puts together 100 separate topics or "hacks" on the subject, with the help of a panel of expert contributors. This creates an authoritative book quickly, useful in a rapidly evolving field where the normal book production process takes so long that a book can be out of date before it is published. Also, the book sums of knowledge of an array of experts, giving the book balance and letting it represent the best practices in its field.
The Hacks books are organized as 100 topics that are copiously cross-referenced and are mostly designed to be read independently. While this means that you can skim and dip in at any point of interest, the books can be tediously repetitive if read linearly. So part of the job of a review of a "Hacks" book is to tell the reader how to read the book.
Eric T. Peterson, author of "Web Site Measurement Hacks: Tips and Tools to Help Optimize Your Online Business", is a senior analyst with JupiterResearch and has also authored "Web Analytics Demystified". He has enlisted a panel of 17 highly qualified experts to cover all aspects of web site measurement and analysis.
So why measure web sites? Well, a 19th century department store magnate said, "Half the money I spend on advertising is wasted; the trouble is I just don't know which half." Today, with the collection and analysis of web site data it is possible to calculate the cost and benefit of a marketing campaign down to the last penny, and that is just one of the measurement activities discussed in the book. Properly used web site measurements can help you optimize every aspect of your web site.
The book is divided into 7 chapters. The first chapter introduces the basic concepts. Unfortunately, these basic concepts topics are intermingled with other topics on such diverse subjects as how to set up a web analytics team, selecting measurement software and when to use packet sniffing. Everyone needs to read the introductory hacks 1, 2, 5, 7 and 9.
Chapter 2 along with some of the later hacks in Chapter 1 goes into the details of implementing a web site measurement system. Most readers should come back to this chapter after they have read the later chapters and decided what they want their web site measurement system to do. There is a lot of good material in this chapter, however it is gritty implementation detail. The approach to measurement espoused by the book is to decide on the business issues that you want the web measurement to address and then figure out how to do it. The reader should stick to the program.
The third chapter covers online marketing measurement. Everyone should read hacks 37 through 39, which cover general visitor measurement terminology leading up to the all important concept of conversion. The rest of the chapter is divided into topics on measuring specific marketing activities such as banner campaigns, email campaigns and paid search campaigns that are of interest to sales and marketing types. The big picture takeaway from these topics is that it is possible to measure the effectiveness of these campaigns in excruciating detail.
Chapter 4 is about measuring web site usability. It should be read by the kind of web designer who is actually interested in making their web site user friendly, and the marketing type who in interested in increasing their conversion rate by improving the site design. Chapter 5 discusses technographics and demographics. Technographics is measurements of the technical capabilities of users in answer to questions like what browsers should I test my web site with and do my pages download fast enough? Demographics is the realm of marketing.
Chapter 6 goes through measurement for online retail in greater depth, covering some more sophisticated topics like cross-sell and estimating the lifetime value of a customer. This is a deeper discussion of online marketing in retail, and leads onto the final chapter on Reporting Strategies and Key Performance Indicators. This chapter is the realm of the business analyst. It starts off with some sage advice on how to distribute measurement results into the rest of the organization. The next few topics explain Key Performance Indicators (KPIs) and the final topics list best practice KPIs for different types of web sites.
Overall this is a comprehensive collection of good material on web site measurement. It contains quite enough material to satisfy a non-technical reader as well as a full JavaScript and Perl implementation of a web measurement system that you can implement your self.
I do have a few criticisms. Firstly, several of the screen shot figures are too small to read. Secondly, I cringed in a few places that confused revenue with income. Finally I was disappointed with the hack on split path testing. This is a valuable technique for objectively measuring web site design, however it is not easy. The subject is big enough that it really needs a chapter of its own, however all we get is one hack that starts with a large chunk of VBScript, followed by a lame explanation. For all aspects of web site measurement apart from split path testing, the book is highly recommended.
Before going into this book in depth, it is worth saying something about the O'Reilly "Hacks" series. The concept is that one person acting as much as an editor as author puts together 100 separate topics or "hacks" on the subject, with the help of a panel of expert contributors. This creates an authoritative book quickly, useful in a rapidly evolving field where the normal book production process takes so long that a book can be out of date before it is published. Also, the book sums of knowledge of an array of experts, giving the book balance and letting it represent the best practices in its field.
The Hacks books are organized as 100 topics that are copiously cross-referenced and are mostly designed to be read independently. While this means that you can skim and dip in at any point of interest, the books can be tediously repetitive if read linearly. So part of the job of a review of a "Hacks" book is to tell the reader how to read the book.
Eric T. Peterson, author of "Web Site Measurement Hacks: Tips and Tools to Help Optimize Your Online Business", is a senior analyst with JupiterResearch and has also authored "Web Analytics Demystified". He has enlisted a panel of 17 highly qualified experts to cover all aspects of web site measurement and analysis.
So why measure web sites? Well, a 19th century department store magnate said, "Half the money I spend on advertising is wasted; the trouble is I just don't know which half." Today, with the collection and analysis of web site data it is possible to calculate the cost and benefit of a marketing campaign down to the last penny, and that is just one of the measurement activities discussed in the book. Properly used web site measurements can help you optimize every aspect of your web site.
The book is divided into 7 chapters. The first chapter introduces the basic concepts. Unfortunately, these basic concepts topics are intermingled with other topics on such diverse subjects as how to set up a web analytics team, selecting measurement software and when to use packet sniffing. Everyone needs to read the introductory hacks 1, 2, 5, 7 and 9.
Chapter 2 along with some of the later hacks in Chapter 1 goes into the details of implementing a web site measurement system. Most readers should come back to this chapter after they have read the later chapters and decided what they want their web site measurement system to do. There is a lot of good material in this chapter, however it is gritty implementation detail. The approach to measurement espoused by the book is to decide on the business issues that you want the web measurement to address and then figure out how to do it. The reader should stick to the program.
The third chapter covers online marketing measurement. Everyone should read hacks 37 through 39, which cover general visitor measurement terminology leading up to the all important concept of conversion. The rest of the chapter is divided into topics on measuring specific marketing activities such as banner campaigns, email campaigns and paid search campaigns that are of interest to sales and marketing types. The big picture takeaway from these topics is that it is possible to measure the effectiveness of these campaigns in excruciating detail.
Chapter 4 is about measuring web site usability. It should be read by the kind of web designer who is actually interested in making their web site user friendly, and the marketing type who in interested in increasing their conversion rate by improving the site design. Chapter 5 discusses technographics and demographics. Technographics is measurements of the technical capabilities of users in answer to questions like what browsers should I test my web site with and do my pages download fast enough? Demographics is the realm of marketing.
Chapter 6 goes through measurement for online retail in greater depth, covering some more sophisticated topics like cross-sell and estimating the lifetime value of a customer. This is a deeper discussion of online marketing in retail, and leads onto the final chapter on Reporting Strategies and Key Performance Indicators. This chapter is the realm of the business analyst. It starts off with some sage advice on how to distribute measurement results into the rest of the organization. The next few topics explain Key Performance Indicators (KPIs) and the final topics list best practice KPIs for different types of web sites.
Overall this is a comprehensive collection of good material on web site measurement. It contains quite enough material to satisfy a non-technical reader as well as a full JavaScript and Perl implementation of a web measurement system that you can implement your self.
I do have a few criticisms. Firstly, several of the screen shot figures are too small to read. Secondly, I cringed in a few places that confused revenue with income. Finally I was disappointed with the hack on split path testing. This is a valuable technique for objectively measuring web site design, however it is not easy. The subject is big enough that it really needs a chapter of its own, however all we get is one hack that starts with a large chunk of VBScript, followed by a lame explanation. For all aspects of web site measurement apart from split path testing, the book is highly recommended.
Thursday, February 02, 2006
The Coming DRM Debacle
This weeks Engadget podcast (a good way to keep up with gadget tech if you have a spare hour) reminded me of a subject that I have mentioned before, the coming DRM debacle. Windows Vista is supposed to ship before the end of the year. The question is, will it be so wrapped up in DRM security that it will be unusable?
Windows has succeeded in the past because it has been an open platform that has accommodated a myriad of components and software. According to the Engadget crew you will not be able to use a Cable Card 2 (the one you want) with a PC unless the whole system, hardware and software has been certified. This means that you will not be able to build your own system and you will not be able to upgrade your certified system if you want to use a cable card.
So what use is a Media PC if it cannot be upgraded and connected to a cable system or a Blu-Ray media player? I am fearful to suggest it, but it sounds like a Media PC may be more useful to the hackers in Russia than it will be to its owner. The worst thing, as the Engadget guys say, is that Microsoft is spinelessly falling in with the media interests rather than showing any sign of standing up to them. Seems like we need them to think different.
Windows has succeeded in the past because it has been an open platform that has accommodated a myriad of components and software. According to the Engadget crew you will not be able to use a Cable Card 2 (the one you want) with a PC unless the whole system, hardware and software has been certified. This means that you will not be able to build your own system and you will not be able to upgrade your certified system if you want to use a cable card.
So what use is a Media PC if it cannot be upgraded and connected to a cable system or a Blu-Ray media player? I am fearful to suggest it, but it sounds like a Media PC may be more useful to the hackers in Russia than it will be to its owner. The worst thing, as the Engadget guys say, is that Microsoft is spinelessly falling in with the media interests rather than showing any sign of standing up to them. Seems like we need them to think different.
Saturday, January 28, 2006
Mobile Device Convergence
Watching the development and evolution of portable digital devices is the most interesting tech story at the moment. In theory, as I have said before, all media is now digital, so we could have one portable device for a media player, portable game console, media capture, and two way communicator of voice, text and anything else digital.
It is obvious that all the players are working towards this convergence from their own angle, and the phone people have pushed it the furthest. Nowadays it is difficult to buy a phone that does not also have a camera, many phones have simple games and phones are quickly developing their media player capabilities. So why am I skeptical? For example, I have just bought a cell-phone without a camera and an iPod. I expect to buy a new digital camera before the summer.
One problem is form factor. In reality there are different sizes to portable devices, and something with a usable keyboard or screen may be too big and clumsy to be taken everywhere. For example, I bought the iPod Shuffle to listen to podcasts mainly at the gym. The Shuffle is perfect size and weight for listening to audio while working out, but it is too small for most anything else.
There is also some utility to keeping functions separate. For example, I do not want to bring my phone with me when I work out, so I have a have a separate iPod for playing media. At other times, I have my phone with me and play media on my iPod. Another example is that I may have a PDA for work and prefer to have a separate cell phone so that I do not have to bring the PDA everywhere.
However the most important problem is ownership. I do not own my phone, "The Man" owns my phone. In this case the man is the phone company and they are not going to let go. A specific example of this is the experience we had with my son's camera phone with a removable media card. We copied pictures he had taken for a school project to the card and used a USB adaptor to load them on to the computer for editing and printing. There we discover that the pictures are in a proprietary format that a photo editing suite cannot handle. The only way to access the pictures is through the phone companies data service and lame web based picture editor.
I was not in the slightest bit surprised by this. Long ago I had concluded that there is no point in buying media for a phone as I am sure that the media will turn out to be incompatible with the next phone that I will have to get in a couple of years time. I do not like being owned, particularly by the phone company, which is why I will not buy a camera phone or a media player phone, and I will be very leery of using a phone based device for business purposes. End of convergence.
It is obvious that all the players are working towards this convergence from their own angle, and the phone people have pushed it the furthest. Nowadays it is difficult to buy a phone that does not also have a camera, many phones have simple games and phones are quickly developing their media player capabilities. So why am I skeptical? For example, I have just bought a cell-phone without a camera and an iPod. I expect to buy a new digital camera before the summer.
One problem is form factor. In reality there are different sizes to portable devices, and something with a usable keyboard or screen may be too big and clumsy to be taken everywhere. For example, I bought the iPod Shuffle to listen to podcasts mainly at the gym. The Shuffle is perfect size and weight for listening to audio while working out, but it is too small for most anything else.
There is also some utility to keeping functions separate. For example, I do not want to bring my phone with me when I work out, so I have a have a separate iPod for playing media. At other times, I have my phone with me and play media on my iPod. Another example is that I may have a PDA for work and prefer to have a separate cell phone so that I do not have to bring the PDA everywhere.
However the most important problem is ownership. I do not own my phone, "The Man" owns my phone. In this case the man is the phone company and they are not going to let go. A specific example of this is the experience we had with my son's camera phone with a removable media card. We copied pictures he had taken for a school project to the card and used a USB adaptor to load them on to the computer for editing and printing. There we discover that the pictures are in a proprietary format that a photo editing suite cannot handle. The only way to access the pictures is through the phone companies data service and lame web based picture editor.
I was not in the slightest bit surprised by this. Long ago I had concluded that there is no point in buying media for a phone as I am sure that the media will turn out to be incompatible with the next phone that I will have to get in a couple of years time. I do not like being owned, particularly by the phone company, which is why I will not buy a camera phone or a media player phone, and I will be very leery of using a phone based device for business purposes. End of convergence.
Saturday, January 21, 2006
HDTV - Not
While there is plenty of chatter about HDTV, there is remarkably little action. Pundits say the problem is consumers who have not upgraded to a HDTV set. However there is another more important problem, and that is that there is absolutely no reason to go out and buy a HDTV set because there is no content.
Four years ago, we bought a 16x9 TV set expecting the HDTV revolution to arrive real soon now. So, for four years we watched even the slimmest TV starlet come across as unnaturally broad. Recently, we upgraded the cable box to HDTV with a DVR. I can report that the DVR is a great hit with my family and that the HDTV component is not used.
The first problem is that there very little HDTV content in the first place. We get 11 HDTV channels and most of them only broadcast HD content for part of the day, or only broadcast for part of the day. Today, one HDTV channel from a local station came up with black bars all around. It was a HDTV show that was being broadcast as a normal TV show with black bars top and bottom, and then it was sent out on a HDTV channel with black bars on either side.
However, the serious problem is that the HDTV channels are in the obscure 7-mumble-mumble range on the cable system. So I frequently find my family watching a show that is available in HDTV on the regular channel. Either they do not know, or they have not looked to see if it is available in HDTV. I cannot get my family to change their channel selection habits, and in truth it is inconvenient to go an look for a show that may not be there in an obscure part of the "dial".
There are more problems. For example, I want to upgrade our second TV to one of these new LCD models, however I am not going to pay the rapacious cable company for a second cable box. (Don't get me started, I can rant about the horror, inconvenience and annoyance of cable boxes for hours.) So I am going to hold out for the elusive cable card, whenever that arrives.
The end result is that HDTV is just not happening and everyone is waiting around for someone else to put the pieces together.
Four years ago, we bought a 16x9 TV set expecting the HDTV revolution to arrive real soon now. So, for four years we watched even the slimmest TV starlet come across as unnaturally broad. Recently, we upgraded the cable box to HDTV with a DVR. I can report that the DVR is a great hit with my family and that the HDTV component is not used.
The first problem is that there very little HDTV content in the first place. We get 11 HDTV channels and most of them only broadcast HD content for part of the day, or only broadcast for part of the day. Today, one HDTV channel from a local station came up with black bars all around. It was a HDTV show that was being broadcast as a normal TV show with black bars top and bottom, and then it was sent out on a HDTV channel with black bars on either side.
However, the serious problem is that the HDTV channels are in the obscure 7-mumble-mumble range on the cable system. So I frequently find my family watching a show that is available in HDTV on the regular channel. Either they do not know, or they have not looked to see if it is available in HDTV. I cannot get my family to change their channel selection habits, and in truth it is inconvenient to go an look for a show that may not be there in an obscure part of the "dial".
There are more problems. For example, I want to upgrade our second TV to one of these new LCD models, however I am not going to pay the rapacious cable company for a second cable box. (Don't get me started, I can rant about the horror, inconvenience and annoyance of cable boxes for hours.) So I am going to hold out for the elusive cable card, whenever that arrives.
The end result is that HDTV is just not happening and everyone is waiting around for someone else to put the pieces together.
Saturday, January 14, 2006
More on Microformats
Don't let the tone of my last post fool you, the Emerging Tech SIG meeting on Microformats was not a waste of time. My problem is that I know what Microformats are, or at least I know what I want them to be, and I am frustrated that they not being presented in a way that is clear and comprehensible to everybody. Microformats are a good thing, and a good clear story will help their broad adoption more than anything else.
Apart from that I took away a couple of ideas of note. Because a Microformat is both human and machine readable, there is only one copy of the information. As a good database person, I know that duplicated data is dangerous. Previous attempts to achieve the same goals as Microformats had the information in a human readable format and the same information repeated in metadata on the same page. This immediately leads to a data quality problem as the software readable form cannot be easily proof read and quickly diverges from the human readable copy as the page is edited.
In this context, the acronym DRY (Don't Repeat Yourself) was used. I keep hearing this acronym, particularly at the Emerging Tech SIG. Perhaps it is the battlecry of the Noughties.
Apart from that I took away a couple of ideas of note. Because a Microformat is both human and machine readable, there is only one copy of the information. As a good database person, I know that duplicated data is dangerous. Previous attempts to achieve the same goals as Microformats had the information in a human readable format and the same information repeated in metadata on the same page. This immediately leads to a data quality problem as the software readable form cannot be easily proof read and quickly diverges from the human readable copy as the page is edited.
In this context, the acronym DRY (Don't Repeat Yourself) was used. I keep hearing this acronym, particularly at the Emerging Tech SIG. Perhaps it is the battlecry of the Noughties.
Tuesday, January 10, 2006
Microformats
Tonight's Emerging Technology SIG meeting on microformats was a mixed bag. On the one hand there were a lot of clever people in the room, in the audience as well as the panel, who knew a lot about microformats. During the discussion there was some interesting fencing between certain audience members and the panel and they maneuvered to capture the high ground.
On the other hand most of the talks went over the head of the general audience who came along to find out what microformats are. Fortunately there was a person in the front row who after the initial talk baffled most of us, was old and wise enough to be able to ask the question "What are microformats and can you give us three simple examples of how they are used?"
Part of my problem is that I went into the meeting with some concept of what I want microformats to be. I want little pieces of embedded HTML/XML that can go in web pages, emails etc. that both renders as a normal text and at the same time contains structured data that can be interpreted by software.
For example if I am looking at a web page or an email that contains a meeting announcement in a microformat, I would like to both read the meeting announcement and to right click it and be given a context menu that would contain "Add to Calendar ..." amongst other actions. Selecting "Add to Calendar ..." would bring up the Calendar application which then could add the event without further intervention.
To make this happen the browser or email client would have to know that I was right clicking a microformat, and know a list of applications that would be able to deal with that microformat. For example, I may want to add the calendar entry to my calendar or to my blog. Also, the application receiving the microformat needs to know how to deal with it.
From the meeting, I gather that this is close to what microformats are, although they also seem to be something that is elusively more that this. Unfortunately the microformats.org web site is particularly unwilling to take a position on what they are, preferring to have a completely abstract definition, while at the same time give concrete examples of particular microformats.
On the other hand most of the talks went over the head of the general audience who came along to find out what microformats are. Fortunately there was a person in the front row who after the initial talk baffled most of us, was old and wise enough to be able to ask the question "What are microformats and can you give us three simple examples of how they are used?"
Part of my problem is that I went into the meeting with some concept of what I want microformats to be. I want little pieces of embedded HTML/XML that can go in web pages, emails etc. that both renders as a normal text and at the same time contains structured data that can be interpreted by software.
For example if I am looking at a web page or an email that contains a meeting announcement in a microformat, I would like to both read the meeting announcement and to right click it and be given a context menu that would contain "Add to Calendar ..." amongst other actions. Selecting "Add to Calendar ..." would bring up the Calendar application which then could add the event without further intervention.
To make this happen the browser or email client would have to know that I was right clicking a microformat, and know a list of applications that would be able to deal with that microformat. For example, I may want to add the calendar entry to my calendar or to my blog. Also, the application receiving the microformat needs to know how to deal with it.
From the meeting, I gather that this is close to what microformats are, although they also seem to be something that is elusively more that this. Unfortunately the microformats.org web site is particularly unwilling to take a position on what they are, preferring to have a completely abstract definition, while at the same time give concrete examples of particular microformats.
Subscribe to:
Posts (Atom)