Build and Break: March 2006

Thursday, March 30, 2006

Emacs???@#$%!!!!

I had to smile when I found this. It contains some things that I absolutely agree with, other things that I completely disagree with and some things that are so totally wacky that I laugh out loud. It is an essay on using the emacs text editor. I use emacs every working day to do my job. I do not like emacs, but I dislike the alternative more, so I have had to come to terms with it.

So what is wrong with emacs? My first complaint, one has me cursing at least once every day is the modal user interface. Emacs is supposed to be non modal, but it has annoying modalities. The most obvious one, and the function that is most difficult to avoid because it is otherwise so useful, is the search function. Emacs is not alone in this area. The find function in Microsoft Word and Notepad have even more annoying behavior.

My second complaint about emacs is that has far and away the worst out-of-the-box experience of any software known to person. The default emacs configuration is unusable, for example backspace brings up help rather than deleting characters as you would expect. Instead of cut and paste there is yank and kill! What yanking and killing have to do with moving text around I do not know and I do not care. Like everyone else who uses emacs, I set up my own key bindings, which I think of as cut and paste, and to this day I do not know which of yank and kill is cut and which is paste.

Of course emacs is highly configurable, but on its own terms. Those terms are the lisp programming language which is extraordinarily ugly to look at. The simple act of assigning a value to a variable is done by the 'setq' function and multiple parens. A more serious criticism is that functional programming gets its power from a lack of side effects which should make it easy to produce correct programs. Emacs completely undoes this by having a huge global state, everything is a side effect and programming it is tedious, difficult and error prone.

There are many other complaints. Emacs keeps changing in what seem to be random and incompatible ways from version to version. Also not all versions work on all systems. I move among different versions of Unix and Linux and I am constantly fiddling with my emacs customization file to keep it working.

Another problem is that as the first free software, emacs drove out all competition. A lack of competition means that it can go on being quirky and unusable and still find an audience. In practice most developers I know use vi because they are not willing to put up with the hassle of using emacs. Vi is an editor designed in the 70's around a design center of people using typewriters, so it is designed to correct like a typewriter does. I find emacs too modal, so I am not going to use vi, and I am stuck with cursing emacs.

Monday, March 27, 2006

Hurray

The big story of the moment is that the next version of the Windows operating system has been delayed by another 3 to 6 months. The news has been greeted with much wailing, gnashing of teeth and wide distribution of blame. Me, I am glad. In fact I am happy.

I do not want a new version of Windows. The current version is bad enough, and the new one will be even worse. The problem is that the new version will be full of "features", awful awful "features". The current version Windows XP is full of "features". These are simple little things that are meant to be helpful or even useful, but end up being bloody awful. I have written in the past about some of annoying features in XP. The ever changing menus, the cluttered incomprehensible Start menu, the fact that when my son, at the urging of the operating system cleans up his desktop, all the carefully placed icons on my desktop disappear.

Even after years of using XP, I still uncover annoying "features". For example, I recently installed TurboTax, and happened to notice that every time any member of my family uses the computer, they get a little pop up box that says "New program installed". Its as if the Windows is telling my kids to go in and start playing with daddies tax forms! Maybe they can add a deduction or two. I can just see myself sitting in front of a flinty eyed tax auditor saying "the kids must have done it".

The real problem is that Microsoft has to add all these new "features" to Windows. These "features" are what make people perceive that the next version is newer and better than the current version. In practice, all they are going to do is make it different and most likely worse.

The truth is that there are only so many things that an operating system has to do. Windows XP does them all, in some cases well and in other cases badly. Microsoft would serve us much better by fixing the problems rather than trying to give us something new and different. In the mean time I am going to hope that the problems with Windows Vista turn out to be even more intractable than they currently appear and that we do not get the new version for a very long time.

Wednesday, March 22, 2006

Spreadsheets Rule

On the one hand we keep hearing that the most used Business Intelligence application in the world today is Microsoft Excel. On the other hand most Business Intelligence experts and vendors put down spreadsheets as the enemy of good Business Intelligence. Spreadsheets are silos of information that contain wrong data, unanticipated data, contradictory data and broken formulas.

At the March meeting of the SDForum Business Intelligence SIG meeting we heard a different story. Craig Thomas, CTO of Steelwedge Software spoke on "Why Plans are Always Wrong". Steelwedge software does Enterprise Planning and Performance Management. The jist of their software is that they build a data warehouse for enterprise planning and then deliver the data to the planners in the form that they are most familiar with, Excel spreadsheets.

Steelwedge keeps a tight control on the planning process. Spreadsheets come from a template repository. A spreadsheet is populated with data from the data warehouse, and then checked out and delivered to a planner. Update is disabled on many fields so that the planner can only change plans in controlled ways. After the plan has been updated, it is checked back in and the changed fields integrated into the data warehouse. Workflow keeps the planning process on track.

Finally when the plan has been executed, the plan and execution can be compared to see how good the planning process is and where it needs to be improved. In fact, many Steelwedge customers have implemented Steelwedge because they felt that their planning process was out of control.

Join the Business Intelligence SIG Yahoo group, and you will be able to download the presentation from the "Files" area.

Sunday, March 19, 2006

Deconstructing RSS

If RSS stands for "Really Simple Syndication" how come the explanations of it on the web are so complicated and so useless. There are people out there pulling their hair out trying to understand RSS and they don't get it because they don't get how really simple it is.

An RSS 'feed' is nothing more than a web page. The only difference between RSS 'feeds' and other web pages is that web pages are coded in HTML and RSS 'feeds' are coded in another dialect of XML. The content of a RSS 'feed' page is a set of items where each item contains text and links just like any other web page.

A problem is that the promoters of RSS engage in obfuscation, trying to make out that RSS is something more than it really is. They talk about the feed as if it were a push technology. The Wikipedia page on RSS even has a link to push technology. However, like everything else on the web, RSS 'feeds' are fetched by a 'feed' aggregator using HTTP GET. Thus RSS 'feeds' are in reality pull technology, just like any other type of web browsing.

This brings us to the feed aggregator. The normal way to render XML is to provide a style sheet that turns it into XHTML. Unfortunately, there seems to be problems with RSS which prevent this, so you need to have this special thing called an aggregator before you can read a 'feed'. From what I can tell, the first problem is that there are many different dialects of RSS, which vary just enough so that a single style sheet does not work nicely for them all. A second problem is that one RSS dialect wrap the 'feed' in a Resource Descriptor Framework (RDF) tag. The RDF tag indicates that this is metadata and in general metadata is not rendered, so you need the aggregator to strip off the RDF tags before its content can be rendered.

Another thing that an aggregator can do is display several RSS feeds on the same page. When you leave the Bizarro world of RSS pull 'feeds', this feature is known as a portal and each display on the page is known as a portlet. Portals are usually done on the server side, while aggregators more often work client side, but apart from this small difference they pretty much the same thing.

The final thing that an aggregator can do is keep track of which items you have seen in a 'feed' and only show you new and updated items. Exactly how this should work is not stated, and when someone asks, the question is ignored. Doing it properly would require a discipline in generating the feed that is not required by the spec and thus cannot be relied on by the aggregator. In practice there is a convention (unstated as far as I can tell) that items are ordered in the feed page from most recent first. The aggregator renders the page normally and you see as many of the recent items as your portlet for the feed has room to show.

Summarizing, a RSS 'feed' is a web page in a slightly different dialect of XML and an aggregator is a client side portal. Is there anything that I have left out?

Monday, March 13, 2006

More on UIMA

If you try to look for an explanation of UIMA, you are very likely to run across the following: "UIMA is an open, industrial-strength, scalable and extensible platform for creating, integrating and deploying unstructured information management solutions from combinations of semantic analysis and search components, ..."

They almost had me going there, nodding my head in agreement until I reached the word solution. Whenever I come across the word solution, it is guaranteed to either raise my hackles or to make my mind wander. This time I was in benign mood, so my mind started to wander.

Obviously the solution needs a large vat to contain it, and the unstructured information is probably on soggy pieces of paper floating around in the solution. The industrial-strength platform is made of steel, and stands above the vat so that people can stand up there and stir the solution. Of course the platform is modular so that it can be scaled to meet the needs of the stirrers.

Open? Does that refer to the open grid on the platform or the fact that the vat is open to allow the stirring rods to be put in. The search components are probably people scurrying around looking for more unstructured information to throw in the vat. The only thing that has me scratching my head is the semantic analysis. How can semantic analysis fit into an industrial scene like this?

Got any ideas?

Thursday, March 09, 2006

UIMA

UIMA stands for Unstructured Information Management Architecture as we heard at the SDForum Emerging Technology SIG's March meeting. IBM has just open-sourced a central part of UIMA so that you can download and play with it yourself. So what is UIMA? Well it seems that like so many other things these days, the presenters did not want to be too specific about what UIMA is, because that would constrict our thinking and prevent us from seeing all sorts of wonderful new applications for it. On the other hand you have to have some kind of grasp of what it is or you cannot do anything with it.

Lead presenter Daniel Gruhl gave the following roundabout explanation. In 1998, Tim Berners Lee introduced the Semantic Web. The idea is that you tag your web pages with metadata in the RDF format and even robots will be able to discover what you really mean. Unfortunately, since then nobody has actually put RDF tags in their web pages and web page metadata has become somewhat discredited as its principal use is to spam search engines.

So what if you could read pages and tag them with your own metadata? Well that is what UIMA is about. It is a framework where your can take a set of documents and generate your own metadata for each documents. The set of documents could be the whole web, or a subset of the web or a set of documents in your own content repository. The documents can be XML, HTML, media files or anything else as all information is now digital.

The next question is what do we do with this metadata? You cannot go and update other peoples web pages, although you could use the metadata to update your own documents and content. In practice, the principal use for the metadata is in building a search index. Although as I write this I can see that there can be plenty of other uses for UIMA for scanning and adding metadata to an existing media or document repository. So maybe the presenters were correct when they say that they do not want to constrain our thinking by being too specific about what UIMA is for.

The final question is why would you want to build your own document analyzer or search engine? Current search engines are very general. If you have specific knowledge about a subject area you can catalog a set of documents much more accurately and usefully than a general purpose search engine. One successful application of UIMA is an annotator that knows petrochemical terms and can create an index of documents useful to a petroleum engineer.

As UIMA is open source, people can build annotators on the work of others. The example shown as a demo was an annotator that discovered business titles. This used an existing annotator that identified peoples names and an annotator that identified business names and would look for business titles between them, so it could easily find the CEO in "Sam Palmisamo, CEO of IBM".

Monday, March 06, 2006

Web Site Measurement Hacks

O'Reilly published "Web Site Measurement Hacks" by Eric T Peterson, in August 2005. Here is a review.

Before going into this book in depth, it is worth saying something about the O'Reilly "Hacks" series. The concept is that one person acting as much as an editor as author puts together 100 separate topics or "hacks" on the subject, with the help of a panel of expert contributors. This creates an authoritative book quickly, useful in a rapidly evolving field where the normal book production process takes so long that a book can be out of date before it is published. Also, the book sums of knowledge of an array of experts, giving the book balance and letting it represent the best practices in its field.

The Hacks books are organized as 100 topics that are copiously cross-referenced and are mostly designed to be read independently. While this means that you can skim and dip in at any point of interest, the books can be tediously repetitive if read linearly. So part of the job of a review of a "Hacks" book is to tell the reader how to read the book.

Eric T. Peterson, author of "Web Site Measurement Hacks: Tips and Tools to Help Optimize Your Online Business", is a senior analyst with JupiterResearch and has also authored "Web Analytics Demystified"”. He has enlisted a panel of 17 highly qualified experts to cover all aspects of web site measurement and analysis.

So why measure web sites? Well, a 19th century department store magnate said, "Half the money I spend on advertising is wasted; the trouble is I just don't know which half."” Today, with the collection and analysis of web site data it is possible to calculate the cost and benefit of a marketing campaign down to the last penny, and that is just one of the measurement activities discussed in the book. Properly used web site measurements can help you optimize every aspect of your web site.

The book is divided into 7 chapters. The first chapter introduces the basic concepts. Unfortunately, these basic concepts topics are intermingled with other topics on such diverse subjects as how to set up a web analytics team, selecting measurement software and when to use packet sniffing. Everyone needs to read the introductory hacks 1, 2, 5, 7 and 9.

Chapter 2 along with some of the later hacks in Chapter 1 goes into the details of implementing a web site measurement system. Most readers should come back to this chapter after they have read the later chapters and decided what they want their web site measurement system to do. There is a lot of good material in this chapter, however it is gritty implementation detail. The approach to measurement espoused by the book is to decide on the business issues that you want the web measurement to address and then figure out how to do it. The reader should stick to the program.

The third chapter covers online marketing measurement. Everyone should read hacks 37 through 39, which cover general visitor measurement terminology leading up to the all important concept of conversion. The rest of the chapter is divided into topics on measuring specific marketing activities such as banner campaigns, email campaigns and paid search campaigns that are of interest to sales and marketing types. The big picture takeaway from these topics is that it is possible to measure the effectiveness of these campaigns in excruciating detail.

Chapter 4 is about measuring web site usability. It should be read by the kind of web designer who is actually interested in making their web site user friendly, and the marketing type who in interested in increasing their conversion rate by improving the site design. Chapter 5 discusses technographics and demographics. Technographics is measurements of the technical capabilities of users in answer to questions like what browsers should I test my web site with and do my pages download fast enough? Demographics is the realm of marketing.

Chapter 6 goes through measurement for online retail in greater depth, covering some more sophisticated topics like cross-sell and estimating the lifetime value of a customer. This is a deeper discussion of online marketing in retail, and leads onto the final chapter on Reporting Strategies and Key Performance Indicators. This chapter is the realm of the business analyst. It starts off with some sage advice on how to distribute measurement results into the rest of the organization. The next few topics explain Key Performance Indicators (KPIs) and the final topics list best practice KPIs for different types of web sites.

Overall this is a comprehensive collection of good material on web site measurement. It contains quite enough material to satisfy a non-technical reader as well as a full JavaScript and Perl implementation of a web measurement system that you can implement your self.

I do have a few criticisms. Firstly, several of the screen shot figures are too small to read. Secondly, I cringed in a few places that confused revenue with income. Finally I was disappointed with the hack on split path testing. This is a valuable technique for objectively measuring web site design, however it is not easy. The subject is big enough that it really needs a chapter of its own, however all we get is one hack that starts with a large chunk of VBScript, followed by a lame explanation. For all aspects of web site measurement apart from split path testing, the book is highly recommended.

Build and Break