Build and Break

Wednesday, March 22, 2006

Spreadsheets Rule

On the one hand we keep hearing that the most used Business Intelligence application in the world today is Microsoft Excel. On the other hand most Business Intelligence experts and vendors put down spreadsheets as the enemy of good Business Intelligence. Spreadsheets are silos of information that contain wrong data, unanticipated data, contradictory data and broken formulas.

At the March meeting of the SDForum Business Intelligence SIG meeting we heard a different story. Craig Thomas, CTO of Steelwedge Software spoke on "Why Plans are Always Wrong". Steelwedge software does Enterprise Planning and Performance Management. The jist of their software is that they build a data warehouse for enterprise planning and then deliver the data to the planners in the form that they are most familiar with, Excel spreadsheets.

Steelwedge keeps a tight control on the planning process. Spreadsheets come from a template repository. A spreadsheet is populated with data from the data warehouse, and then checked out and delivered to a planner. Update is disabled on many fields so that the planner can only change plans in controlled ways. After the plan has been updated, it is checked back in and the changed fields integrated into the data warehouse. Workflow keeps the planning process on track.

Finally when the plan has been executed, the plan and execution can be compared to see how good the planning process is and where it needs to be improved. In fact, many Steelwedge customers have implemented Steelwedge because they felt that their planning process was out of control.

Join the Business Intelligence SIG Yahoo group, and you will be able to download the presentation from the "Files" area.

Sunday, March 19, 2006

Deconstructing RSS

If RSS stands for "Really Simple Syndication" how come the explanations of it on the web are so complicated and so useless. There are people out there pulling their hair out trying to understand RSS and they don't get it because they don't get how really simple it is.

An RSS 'feed' is nothing more than a web page. The only difference between RSS 'feeds' and other web pages is that web pages are coded in HTML and RSS 'feeds' are coded in another dialect of XML. The content of a RSS 'feed' page is a set of items where each item contains text and links just like any other web page.

A problem is that the promoters of RSS engage in obfuscation, trying to make out that RSS is something more than it really is. They talk about the feed as if it were a push technology. The Wikipedia page on RSS even has a link to push technology. However, like everything else on the web, RSS 'feeds' are fetched by a 'feed' aggregator using HTTP GET. Thus RSS 'feeds' are in reality pull technology, just like any other type of web browsing.

This brings us to the feed aggregator. The normal way to render XML is to provide a style sheet that turns it into XHTML. Unfortunately, there seems to be problems with RSS which prevent this, so you need to have this special thing called an aggregator before you can read a 'feed'. From what I can tell, the first problem is that there are many different dialects of RSS, which vary just enough so that a single style sheet does not work nicely for them all. A second problem is that one RSS dialect wrap the 'feed' in a Resource Descriptor Framework (RDF) tag. The RDF tag indicates that this is metadata and in general metadata is not rendered, so you need the aggregator to strip off the RDF tags before its content can be rendered.

Another thing that an aggregator can do is display several RSS feeds on the same page. When you leave the Bizarro world of RSS pull 'feeds', this feature is known as a portal and each display on the page is known as a portlet. Portals are usually done on the server side, while aggregators more often work client side, but apart from this small difference they pretty much the same thing.

The final thing that an aggregator can do is keep track of which items you have seen in a 'feed' and only show you new and updated items. Exactly how this should work is not stated, and when someone asks, the question is ignored. Doing it properly would require a discipline in generating the feed that is not required by the spec and thus cannot be relied on by the aggregator. In practice there is a convention (unstated as far as I can tell) that items are ordered in the feed page from most recent first. The aggregator renders the page normally and you see as many of the recent items as your portlet for the feed has room to show.

Summarizing, a RSS 'feed' is a web page in a slightly different dialect of XML and an aggregator is a client side portal. Is there anything that I have left out?

Monday, March 13, 2006

More on UIMA

If you try to look for an explanation of UIMA, you are very likely to run across the following: "UIMA is an open, industrial-strength, scalable and extensible platform for creating, integrating and deploying unstructured information management solutions from combinations of semantic analysis and search components, ..."

They almost had me going there, nodding my head in agreement until I reached the word solution. Whenever I come across the word solution, it is guaranteed to either raise my hackles or to make my mind wander. This time I was in benign mood, so my mind started to wander.

Obviously the solution needs a large vat to contain it, and the unstructured information is probably on soggy pieces of paper floating around in the solution. The industrial-strength platform is made of steel, and stands above the vat so that people can stand up there and stir the solution. Of course the platform is modular so that it can be scaled to meet the needs of the stirrers.

Open? Does that refer to the open grid on the platform or the fact that the vat is open to allow the stirring rods to be put in. The search components are probably people scurrying around looking for more unstructured information to throw in the vat. The only thing that has me scratching my head is the semantic analysis. How can semantic analysis fit into an industrial scene like this?

Got any ideas?

Thursday, March 09, 2006

UIMA

UIMA stands for Unstructured Information Management Architecture as we heard at the SDForum Emerging Technology SIG's March meeting. IBM has just open-sourced a central part of UIMA so that you can download and play with it yourself. So what is UIMA? Well it seems that like so many other things these days, the presenters did not want to be too specific about what UIMA is, because that would constrict our thinking and prevent us from seeing all sorts of wonderful new applications for it. On the other hand you have to have some kind of grasp of what it is or you cannot do anything with it.

Lead presenter Daniel Gruhl gave the following roundabout explanation. In 1998, Tim Berners Lee introduced the Semantic Web. The idea is that you tag your web pages with metadata in the RDF format and even robots will be able to discover what you really mean. Unfortunately, since then nobody has actually put RDF tags in their web pages and web page metadata has become somewhat discredited as its principal use is to spam search engines.

So what if you could read pages and tag them with your own metadata? Well that is what UIMA is about. It is a framework where your can take a set of documents and generate your own metadata for each documents. The set of documents could be the whole web, or a subset of the web or a set of documents in your own content repository. The documents can be XML, HTML, media files or anything else as all information is now digital.

The next question is what do we do with this metadata? You cannot go and update other peoples web pages, although you could use the metadata to update your own documents and content. In practice, the principal use for the metadata is in building a search index. Although as I write this I can see that there can be plenty of other uses for UIMA for scanning and adding metadata to an existing media or document repository. So maybe the presenters were correct when they say that they do not want to constrain our thinking by being too specific about what UIMA is for.

The final question is why would you want to build your own document analyzer or search engine? Current search engines are very general. If you have specific knowledge about a subject area you can catalog a set of documents much more accurately and usefully than a general purpose search engine. One successful application of UIMA is an annotator that knows petrochemical terms and can create an index of documents useful to a petroleum engineer.

As UIMA is open source, people can build annotators on the work of others. The example shown as a demo was an annotator that discovered business titles. This used an existing annotator that identified peoples names and an annotator that identified business names and would look for business titles between them, so it could easily find the CEO in "Sam Palmisamo, CEO of IBM".

Monday, March 06, 2006

Web Site Measurement Hacks

O'Reilly published "Web Site Measurement Hacks" by Eric T Peterson, in August 2005. Here is a review.

Before going into this book in depth, it is worth saying something about the O'Reilly "Hacks" series. The concept is that one person acting as much as an editor as author puts together 100 separate topics or "hacks" on the subject, with the help of a panel of expert contributors. This creates an authoritative book quickly, useful in a rapidly evolving field where the normal book production process takes so long that a book can be out of date before it is published. Also, the book sums of knowledge of an array of experts, giving the book balance and letting it represent the best practices in its field.

The Hacks books are organized as 100 topics that are copiously cross-referenced and are mostly designed to be read independently. While this means that you can skim and dip in at any point of interest, the books can be tediously repetitive if read linearly. So part of the job of a review of a "Hacks" book is to tell the reader how to read the book.

Eric T. Peterson, author of "Web Site Measurement Hacks: Tips and Tools to Help Optimize Your Online Business", is a senior analyst with JupiterResearch and has also authored "Web Analytics Demystified"”. He has enlisted a panel of 17 highly qualified experts to cover all aspects of web site measurement and analysis.

So why measure web sites? Well, a 19th century department store magnate said, "Half the money I spend on advertising is wasted; the trouble is I just don't know which half."” Today, with the collection and analysis of web site data it is possible to calculate the cost and benefit of a marketing campaign down to the last penny, and that is just one of the measurement activities discussed in the book. Properly used web site measurements can help you optimize every aspect of your web site.

The book is divided into 7 chapters. The first chapter introduces the basic concepts. Unfortunately, these basic concepts topics are intermingled with other topics on such diverse subjects as how to set up a web analytics team, selecting measurement software and when to use packet sniffing. Everyone needs to read the introductory hacks 1, 2, 5, 7 and 9.

Chapter 2 along with some of the later hacks in Chapter 1 goes into the details of implementing a web site measurement system. Most readers should come back to this chapter after they have read the later chapters and decided what they want their web site measurement system to do. There is a lot of good material in this chapter, however it is gritty implementation detail. The approach to measurement espoused by the book is to decide on the business issues that you want the web measurement to address and then figure out how to do it. The reader should stick to the program.

The third chapter covers online marketing measurement. Everyone should read hacks 37 through 39, which cover general visitor measurement terminology leading up to the all important concept of conversion. The rest of the chapter is divided into topics on measuring specific marketing activities such as banner campaigns, email campaigns and paid search campaigns that are of interest to sales and marketing types. The big picture takeaway from these topics is that it is possible to measure the effectiveness of these campaigns in excruciating detail.

Chapter 4 is about measuring web site usability. It should be read by the kind of web designer who is actually interested in making their web site user friendly, and the marketing type who in interested in increasing their conversion rate by improving the site design. Chapter 5 discusses technographics and demographics. Technographics is measurements of the technical capabilities of users in answer to questions like what browsers should I test my web site with and do my pages download fast enough? Demographics is the realm of marketing.

Chapter 6 goes through measurement for online retail in greater depth, covering some more sophisticated topics like cross-sell and estimating the lifetime value of a customer. This is a deeper discussion of online marketing in retail, and leads onto the final chapter on Reporting Strategies and Key Performance Indicators. This chapter is the realm of the business analyst. It starts off with some sage advice on how to distribute measurement results into the rest of the organization. The next few topics explain Key Performance Indicators (KPIs) and the final topics list best practice KPIs for different types of web sites.

Overall this is a comprehensive collection of good material on web site measurement. It contains quite enough material to satisfy a non-technical reader as well as a full JavaScript and Perl implementation of a web measurement system that you can implement your self.

I do have a few criticisms. Firstly, several of the screen shot figures are too small to read. Secondly, I cringed in a few places that confused revenue with income. Finally I was disappointed with the hack on split path testing. This is a valuable technique for objectively measuring web site design, however it is not easy. The subject is big enough that it really needs a chapter of its own, however all we get is one hack that starts with a large chunk of VBScript, followed by a lame explanation. For all aspects of web site measurement apart from split path testing, the book is highly recommended.

Thursday, February 02, 2006

The Coming DRM Debacle

This weeks Engadget podcast (a good way to keep up with gadget tech if you have a spare hour) reminded me of a subject that I have mentioned before, the coming DRM debacle. Windows Vista is supposed to ship before the end of the year. The question is, will it be so wrapped up in DRM security that it will be unusable?

Windows has succeeded in the past because it has been an open platform that has accommodated a myriad of components and software. According to the Engadget crew you will not be able to use a Cable Card 2 (the one you want) with a PC unless the whole system, hardware and software has been certified. This means that you will not be able to build your own system and you will not be able to upgrade your certified system if you want to use a cable card.

So what use is a Media PC if it cannot be upgraded and connected to a cable system or a Blu-Ray media player? I am fearful to suggest it, but it sounds like a Media PC may be more useful to the hackers in Russia than it will be to its owner. The worst thing, as the Engadget guys say, is that Microsoft is spinelessly falling in with the media interests rather than showing any sign of standing up to them. Seems like we need them to think different.

Saturday, January 28, 2006

Mobile Device Convergence

Watching the development and evolution of portable digital devices is the most interesting tech story at the moment. In theory, as I have said before, all media is now digital, so we could have one portable device for a media player, portable game console, media capture, and two way communicator of voice, text and anything else digital.

It is obvious that all the players are working towards this convergence from their own angle, and the phone people have pushed it the furthest. Nowadays it is difficult to buy a phone that does not also have a camera, many phones have simple games and phones are quickly developing their media player capabilities. So why am I skeptical? For example, I have just bought a cell-phone without a camera and an iPod. I expect to buy a new digital camera before the summer.

One problem is form factor. In reality there are different sizes to portable devices, and something with a usable keyboard or screen may be too big and clumsy to be taken everywhere. For example, I bought the iPod Shuffle to listen to podcasts mainly at the gym. The Shuffle is perfect size and weight for listening to audio while working out, but it is too small for most anything else.

There is also some utility to keeping functions separate. For example, I do not want to bring my phone with me when I work out, so I have a have a separate iPod for playing media. At other times, I have my phone with me and play media on my iPod. Another example is that I may have a PDA for work and prefer to have a separate cell phone so that I do not have to bring the PDA everywhere.

However the most important problem is ownership. I do not own my phone, "The Man" owns my phone. In this case the man is the phone company and they are not going to let go. A specific example of this is the experience we had with my son's camera phone with a removable media card. We copied pictures he had taken for a school project to the card and used a USB adaptor to load them on to the computer for editing and printing. There we discover that the pictures are in a proprietary format that a photo editing suite cannot handle. The only way to access the pictures is through the phone companies data service and lame web based picture editor.

I was not in the slightest bit surprised by this. Long ago I had concluded that there is no point in buying media for a phone as I am sure that the media will turn out to be incompatible with the next phone that I will have to get in a couple of years time. I do not like being owned, particularly by the phone company, which is why I will not buy a camera phone or a media player phone, and I will be very leery of using a phone based device for business purposes. End of convergence.

Saturday, January 21, 2006

HDTV - Not

While there is plenty of chatter about HDTV, there is remarkably little action. Pundits say the problem is consumers who have not upgraded to a HDTV set. However there is another more important problem, and that is that there is absolutely no reason to go out and buy a HDTV set because there is no content.

Four years ago, we bought a 16x9 TV set expecting the HDTV revolution to arrive real soon now. So, for four years we watched even the slimmest TV starlet come across as unnaturally broad. Recently, we upgraded the cable box to HDTV with a DVR. I can report that the DVR is a great hit with my family and that the HDTV component is not used.

The first problem is that there very little HDTV content in the first place. We get 11 HDTV channels and most of them only broadcast HD content for part of the day, or only broadcast for part of the day. Today, one HDTV channel from a local station came up with black bars all around. It was a HDTV show that was being broadcast as a normal TV show with black bars top and bottom, and then it was sent out on a HDTV channel with black bars on either side.

However, the serious problem is that the HDTV channels are in the obscure 7-mumble-mumble range on the cable system. So I frequently find my family watching a show that is available in HDTV on the regular channel. Either they do not know, or they have not looked to see if it is available in HDTV. I cannot get my family to change their channel selection habits, and in truth it is inconvenient to go an look for a show that may not be there in an obscure part of the "dial".

There are more problems. For example, I want to upgrade our second TV to one of these new LCD models, however I am not going to pay the rapacious cable company for a second cable box. (Don't get me started, I can rant about the horror, inconvenience and annoyance of cable boxes for hours.) So I am going to hold out for the elusive cable card, whenever that arrives.

The end result is that HDTV is just not happening and everyone is waiting around for someone else to put the pieces together.

Saturday, January 14, 2006

More on Microformats

Don't let the tone of my last post fool you, the Emerging Tech SIG meeting on Microformats was not a waste of time. My problem is that I know what Microformats are, or at least I know what I want them to be, and I am frustrated that they not being presented in a way that is clear and comprehensible to everybody. Microformats are a good thing, and a good clear story will help their broad adoption more than anything else.

Apart from that I took away a couple of ideas of note. Because a Microformat is both human and machine readable, there is only one copy of the information. As a good database person, I know that duplicated data is dangerous. Previous attempts to achieve the same goals as Microformats had the information in a human readable format and the same information repeated in metadata on the same page. This immediately leads to a data quality problem as the software readable form cannot be easily proof read and quickly diverges from the human readable copy as the page is edited.

In this context, the acronym DRY (Don't Repeat Yourself) was used. I keep hearing this acronym, particularly at the Emerging Tech SIG. Perhaps it is the battlecry of the Noughties.

Tuesday, January 10, 2006

Microformats

Tonight's Emerging Technology SIG meeting on microformats was a mixed bag. On the one hand there were a lot of clever people in the room, in the audience as well as the panel, who knew a lot about microformats. During the discussion there was some interesting fencing between certain audience members and the panel and they maneuvered to capture the high ground.

On the other hand most of the talks went over the head of the general audience who came along to find out what microformats are. Fortunately there was a person in the front row who after the initial talk baffled most of us, was old and wise enough to be able to ask the question "What are microformats and can you give us three simple examples of how they are used?"

Part of my problem is that I went into the meeting with some concept of what I want microformats to be. I want little pieces of embedded HTML/XML that can go in web pages, emails etc. that both renders as a normal text and at the same time contains structured data that can be interpreted by software.

For example if I am looking at a web page or an email that contains a meeting announcement in a microformat, I would like to both read the meeting announcement and to right click it and be given a context menu that would contain "Add to Calendar ..." amongst other actions. Selecting "Add to Calendar ..." would bring up the Calendar application which then could add the event without further intervention.

To make this happen the browser or email client would have to know that I was right clicking a microformat, and know a list of applications that would be able to deal with that microformat. For example, I may want to add the calendar entry to my calendar or to my blog. Also, the application receiving the microformat needs to know how to deal with it.

From the meeting, I gather that this is close to what microformats are, although they also seem to be something that is elusively more that this. Unfortunately the microformats.org web site is particularly unwilling to take a position on what they are, preferring to have a completely abstract definition, while at the same time give concrete examples of particular microformats.