Friday, December 31, 2010

The Year in Posts

Looking back at posts in this blog over the last year I see a couple of themes emerge. Firstly there were many posts on technology and media, in particular several on the iPad which has had an extraordinary effect as the first device specifically designed for consuming media. Other issues of concern included television, 3D, aspect ratios and the problem of registration at web sites. We are going through huge changes in the media world as digitialization and the internet delivery system changes everything. I have written many posts on this in the past and I will continue to do so.

The SDForum Business Intelligence SIG that I chair had a banner year with so many memorable meetings, it is difficult to pick out the best one. A fantastic talk from Google Analytics Evangelist Avinash Kaushik on "Web Analytics 2.0" drew by far the biggest crowd. We had two great big data talks: "Winning with Big Data" from Michael Driscoll of Dataspora and "Mad Skills for Big Data" from Brian Dolan, both very impressive. Donovan Schneider from SalesForce.com spoke on "Real Time Analytics" and Dan Graham from Teradata spoke on "Data Management in the Cloud". Finally Peter Farago and Sean Byrnes of Flurry talked about the extraordinary information they collect about smartphone usage that they collect from their Mobile App analytics platform. Co-chair Paul O'Rorke who organized several of these meetings has stepped down and we will miss him greatly.

Finally, Blogger started collecting statistics in May of this year. Looking at the page views on this blog, my last post on "Windows File Type Fail" has generated a lot of interest in the few days since it was posted. The most viewed post is a 2009 post on "Ruby versus Scala" followed closely by the Windows post. In my view, the post last year about the Windows Autorun feature is a better rant than the current one. You can feel the veins bulging in that rant whereas this years rant is very laid back in comparison. Do not worry, there are many more misfeatures of Microsoft Windows to rant about so I am not going to run out of material for a long time.

Tuesday, December 28, 2010

Windows File Type Fail

It is that time of year when I rant about an awful, awful, awful feature of the Microsoft Windows operating system. This year the subject of my diatribe is file types. You see, Windows thinks that every file has a type and the type connects the file to a program that can handle it. Like many "features" in Windows, file types are intended to make your life easier while in practice doing the opposite. Note that some time ago, I wrote about file systems and Content Management as opposed to a file type manager. I still think there are some good ideas in there that need to be explored.

If you do not know what a file type is, here is a primer. Every file has a name. The file type is a usually 3 letter extension to the name. So for example, the program for Windows Explorer, is called "explorer.exe", the dot is a separator and exe is the file type. The type exe means a program that Windows can run. To look at all the file types on an XP system, bring up the control panel, select Folder Options and then click the File Types tab. On Vista and 7, the path through the control panel is slightly different. The dialog shows a huge list of registered file types and the programs that will handle them. Note that the first few entries in the list are not representative, go down to the middle or bottom of the list so see what it is really all about.

Windows goes to great length to hide file types from you. By default they are not shown anywhere and you can go for a long time without even knowing that files have types. One way to run into file types is to double click on a file with a type that Windows does not know about. Windows shows a dialog asking you what program you want to use with it. You can either look up the file type on the web or select a program from a list. The most annoying aspect is that when you select a program from a list, there is a little check box that says "Always use the selected program to open this type of file." If you test a program that does not work without unchecking the box the mistake is remembered and thereafter every time you open a file of that type, the wrong program is chosen. If you uncheck the box, a mistake is not remembered, however neither is a success. Either way, you can lose. Moreover, to recover from a mistake, you have to find the entry for the file extension in the File Types window discussed above and delete it, which is not a trivial task, given the number of file types.

Another little problem with file types is that they can be wrong, confused or direct Windows to do the wrong thing. I wrote about a problem with .avi files from a Canon camera breaking Windows Explorer. There are security issues where Windows is penetrated because it trusts the file type information and then does the wrong thing with a broken file.

However, the real problem with file types appears when you install a new program. Programs are greedy. They want to control as much of your experience as possible so they will try to register as many different file types as they can. If you have one program that deals with a type of file and you install another program that deals with similar files, the new program should pop up a dialog asking you which types of files it should handle. Then you have to make all sorts of complicated decisions about which file types the new program should handle.

Programs for handling media are the worst in this respect because there are lots of different media types and it is common to have several media players installed to handle different special cases. For example, on my home computer I have Windows Media Player and a DVD player because they came with the system. Then there is iTunes for my iPod, the QuickTime video player that comes with iTunes, a RealPlayer for the BBC iPlayer and finally a program for ripping and burning CDs and DVDs. There may well be other media players amongst the shovelware preinstalled on the box. There are also programs for editing specific media types like at least two picture editors and a video editor or several.

A typical scenario is that you are installing a new media player program because you want to use it to view a particular type of media. Unfortunately, the program installer knows about all the media types that it can handle and asks you to chose what media types types it should handle. Thus you have to disengage your thoughts from the one media type that is the object of your attention and instead start to think about all those other media types that you are not interested in. Unfortunately, there is the worry that if you give in to the new media player and let it handle certain types of media, other things will stop working. Maybe you will not be able to watch videos, or maybe videos will stop syncing with your portable media player because you changed the program associations with a particular file type. Given the complexity of these systems, who knows what may go wrong.

I said that the media player installer should ask you which file types you want associated with the program. A few years ago, Real managed to destroy much of their franchise by not playing nice and fair with file types. The RealPlayer installer switched all file types that it could handle to use the RealPlayer without bothering to ask or notify. Worse, if you went in and installed another program that changed the file type associations or even tried used the File Types dialog screen to change file type associations, it would just change them back to the RealPlayer, again without a notification. When this came to light, many people, myself included, uninstalled RealPlayer and swore never to install any software from Real again. Recently I caved on this resolution so that I could listen to old BBC radio shows like "The Goon Show" with the BBC iPlayer which it turns out to be just a rebadged Real player.

Since the RealPlayer imbroglio, installer programs have been a lot more careful about asking users about file types, but that just throws the problem back to the user. As the whole point of file types is to hide system complexity from the user, this it is no solution at all. A better path is to do without file types. Why are they necessary? Do they really serve a purpose? Other operating systems get along fine without file types, so why does Windows need them. Lets just throw them out and make life easier.

Monday, December 20, 2010

Is That Annoying Modal Caps Lock Key Going Away?

So Google came out with their new Chrome Operating System, loaded it onto a laptop and gave the whole caboodle of people to play with and comment on. While Chrome OS has generated a lot of comments, the largest and most active discussion has been about the Caps Lock key. You see, Google has changed the behavior of the key that used to be Caps Lock to instead call up a search page. I am sure this change was made to pander to keyboard weenies who want to Google without having to lift their hands from the keyboard. Anyway, the change has backfired. Instead of talking about Chrome OS, everyone is engaged in a furious discussion of why the Caps Lock is either essential or should have been disposed of a long time ago.

I have two problems with the Caps Lock, no make that three. The first problem is that it sits right between two important keys. Below is the Shift key whose importance needs no explanation. Above it is the Tab key, used for next field, command completion, automatic indent and plenty of other useful purposes. In the middle sits Caps Lock just waiting to be hit by accident. This brings to the next problem, Caps Lock is modal. Hit the Caps Lock key by accident and you do not make just one typing mistake, rather the whole keyboard is shifted into a new mode and the error compounds. By the time I look at the screen, I have typed half a sentence in the wrong case.

I am a member of the tribe that hates modal user interfaces with a passion. Some of my compatriots physically remove the Caps Lock key or reprogram their keyboard to reduce typing errors. I have only gone as far as to disable that other annoying modal key. The Insert key is used by many editors to switch between insert mode and overtype mode. If you hit Caps Lock by accident, the result is obvious, if you hit Insert by accident you can go on for some time before you realize that you are seriously damaging the document that you are trying to fix up. Of course, the Insert key is slightly off the main keyboard, right above the really useful Delete key and just waiting to be hit by accident.

My final problem with the Caps Lock key is that if you are in Caps Lock mode and you press shift, it reverts back to entering lower case. This means that when I hit cAPS lOCK by accident every key I type is in the wrong case, not just some of them. I happen to have an old typewriter from the 1930's so I know what shift really means. The Shift key causes the whole paper carriage and platen to move so that when the typebar comes down a different type piece strikes the ink ribbon and paper. Shifting the platen is why it is called the Shift key and it is a heavy key to hold, so there is a Shift Lock key that is a mechanical lock to hold the platen in the shifted position. With the platen locked in the shift position, hitting the shift key does nothing, so why has someone gone to the trouble of programming bogus behavior in out modern and supposedly more convenient keyboards?

Now, I know that there are people who love the Caps Lock key and who use it all the time. For my part, given the choice between a key that causes a small typing mistake every time I hit it by accident and a key that brings up a new web page by accident, I will choose the Caps Lock function every time. Caps Lock is annoying but I have lived with it for a long time and it is much smaller surprise than a new page that I do not want.

Saturday, December 18, 2010

The Gawker Password Fiasco

Last month I wrote about password security, just a little too soon. This month the popular blog site owner Gawker admitted to a huge security breach where hackers had broken into their web servers and stolen their entire database of user account names with email addresses and passwords. The attack has brought password security to every ones attention, with people reporting that their email and other accounts have been compromised. There are a lot of discussions of protocols for password security with good information, and unfortunately there is also a lot of misinformation. Here is my take.

The Forbes magazine web-site has a clear description of the attack on Gawker, (although their discussion of the password encryption is not correct). The short story is that the break-in was done by a hacker group called Gnosis who were annoyed by Gawker. Frankly, given Gawker's arrogant style, who has not been annoyed by them at some time? Gnosis first broke in to Gawker in July and got the passwords to accounts for Nick Denton and 16 other staffers there. In November, Denton noticed some possible tampering in a web account, and finally in December Gnosis announced their break in and released data they had gathered.

Although, Gawker had used encryption to hide the users passwords, they are susceptible to a brute force attack and many passwords have been broken. Gawker lost over 1 million accounts and more than 100,000 passwords have been cracked and published so far. The Wall Street Journal has a nice analysis of the most popular passwords including a frequency graph.

There is a lot of misunderstanding about how passwords are stored on a web site and how a brute force attack takes place. For example, the Forbes article I mentioned earlier obviously does not have a clue. I do not know for certain how Gawker protects their passwords, however the best practice is to use a salted hash. With this technique, the web-site chooses a salt, which is just a random string of characters. When a user sets a password, the salt is appended to the password and the whole string is hashed with a cryptographic hash function like SHA-1. The resulting hash value is a seemingly random string of bits, and this is stored as the encrypted users password. When the user wants to log in, the salt is added to the supplied password, the resulting string hashed, and the hash value compared to the saved hash. If they are the same, the user must have provided the correct password and is allowed to log in. By using a salted hash, the web-site does not save the users password, they just save a cryptographic hash that is used to confirm that the user knows their password. To make things more secure, the web-site can save a different salt for each user or just add the user name to a common salt so that even if two users have the same password, the salted hash of their passwords are not the same.

In a brute force attack the attacker knows the algorithm used to generate the salted hash and has the salted hash of the password. The attacker generates a list of potential passwords, applies the password checking algorithm to each password and if the results are the same, they have guessed the users password. If the attacker can try 20 passwords a second, they can test well over a million passwords a day on a single computer.

It is very easy to generate a list of potential passwords. One good starting point is a list of broken passwords, such as published by Gnosis from the attack on Gawker. The next step is a dictionary of common words and proper names. Many applications have a spelling dictionary that can be used as a starting point. Then try some simple variations like adding a number to the beginning or of words, capitalizing letters in the word and make common substitutions for letters such as 1 for the letter 'i' and 5 or $ for 's'.

So now that you now how it is done, think about your passwords and how easy they can be attacked by brute force, and excuse me while I go and change some of mine.

Saturday, December 11, 2010

Now You See It: the Book

If you are of a data analytics bent or know someone who is and are looking for a book to put on the Christmas list, consider Now You See It: Simple Visualization Techniques for Quantitative Analysis by Stephen Few. This is a beautiful book that would not look out of place on a coffee table, yet at the same time, is full of practical information about how to do analytics with charts, graphs and other visual tools.

The book is divided into three sections. The first section covers visual perception and general visualization techniques for looking at data. Then the second section goes into more detail with chapters on specific techniques for different types of analysis including time-series analysis, ranking analysis, deviation analysis and multivariate analysis amongst others. Each chapter in this section ends with a summary of the techniques and best practices for that type of analysis. Finally the book ends with a shorter section that looks at promising new trends in visualization.

There are copious examples of graphs and charts drawn by different software tools. While some of these graphs come from high end tools like Tableau and Spotfire, others are drawn by Microsoft Excel. In fact there are several specific procedures for using features of Excel to do sophisticated analytics. That is not to say that the book suggests that you can do everything with a spreadsheet. The first part shows you what to look for in visual analytics software and it essential reading before going out and choosing which tool to use.

So, if you are looking for a quality and practical gift for an analytician, choose "Now You See It".