Friday, December 31, 2010

The Year in Posts

Looking back at posts in this blog over the last year I see a couple of themes emerge. Firstly there were many posts on technology and media, in particular several on the iPad which has had an extraordinary effect as the first device specifically designed for consuming media. Other issues of concern included television, 3D, aspect ratios and the problem of registration at web sites. We are going through huge changes in the media world as digitialization and the internet delivery system changes everything. I have written many posts on this in the past and I will continue to do so.

The SDForum Business Intelligence SIG that I chair had a banner year with so many memorable meetings, it is difficult to pick out the best one. A fantastic talk from Google Analytics Evangelist Avinash Kaushik on "Web Analytics 2.0" drew by far the biggest crowd. We had two great big data talks: "Winning with Big Data" from Michael Driscoll of Dataspora and "Mad Skills for Big Data" from Brian Dolan, both very impressive. Donovan Schneider from SalesForce.com spoke on "Real Time Analytics" and Dan Graham from Teradata spoke on "Data Management in the Cloud". Finally Peter Farago and Sean Byrnes of Flurry talked about the extraordinary information they collect about smartphone usage that they collect from their Mobile App analytics platform. Co-chair Paul O'Rorke who organized several of these meetings has stepped down and we will miss him greatly.

Finally, Blogger started collecting statistics in May of this year. Looking at the page views on this blog, my last post on "Windows File Type Fail" has generated a lot of interest in the few days since it was posted. The most viewed post is a 2009 post on "Ruby versus Scala" followed closely by the Windows post. In my view, the post last year about the Windows Autorun feature is a better rant than the current one. You can feel the veins bulging in that rant whereas this years rant is very laid back in comparison. Do not worry, there are many more misfeatures of Microsoft Windows to rant about so I am not going to run out of material for a long time.

Tuesday, December 28, 2010

Windows File Type Fail

It is that time of year when I rant about an awful, awful, awful feature of the Microsoft Windows operating system. This year the subject of my diatribe is file types. You see, Windows thinks that every file has a type and the type connects the file to a program that can handle it. Like many "features" in Windows, file types are intended to make your life easier while in practice doing the opposite. Note that some time ago, I wrote about file systems and Content Management as opposed to a file type manager. I still think there are some good ideas in there that need to be explored.

If you do not know what a file type is, here is a primer. Every file has a name. The file type is a usually 3 letter extension to the name. So for example, the program for Windows Explorer, is called "explorer.exe", the dot is a separator and exe is the file type. The type exe means a program that Windows can run. To look at all the file types on an XP system, bring up the control panel, select Folder Options and then click the File Types tab. On Vista and 7, the path through the control panel is slightly different. The dialog shows a huge list of registered file types and the programs that will handle them. Note that the first few entries in the list are not representative, go down to the middle or bottom of the list so see what it is really all about.

Windows goes to great length to hide file types from you. By default they are not shown anywhere and you can go for a long time without even knowing that files have types. One way to run into file types is to double click on a file with a type that Windows does not know about. Windows shows a dialog asking you what program you want to use with it. You can either look up the file type on the web or select a program from a list. The most annoying aspect is that when you select a program from a list, there is a little check box that says "Always use the selected program to open this type of file." If you test a program that does not work without unchecking the box the mistake is remembered and thereafter every time you open a file of that type, the wrong program is chosen. If you uncheck the box, a mistake is not remembered, however neither is a success. Either way, you can lose. Moreover, to recover from a mistake, you have to find the entry for the file extension in the File Types window discussed above and delete it, which is not a trivial task, given the number of file types.

Another little problem with file types is that they can be wrong, confused or direct Windows to do the wrong thing. I wrote about a problem with .avi files from a Canon camera breaking Windows Explorer. There are security issues where Windows is penetrated because it trusts the file type information and then does the wrong thing with a broken file.

However, the real problem with file types appears when you install a new program. Programs are greedy. They want to control as much of your experience as possible so they will try to register as many different file types as they can. If you have one program that deals with a type of file and you install another program that deals with similar files, the new program should pop up a dialog asking you which types of files it should handle. Then you have to make all sorts of complicated decisions about which file types the new program should handle.

Programs for handling media are the worst in this respect because there are lots of different media types and it is common to have several media players installed to handle different special cases. For example, on my home computer I have Windows Media Player and a DVD player because they came with the system. Then there is iTunes for my iPod, the QuickTime video player that comes with iTunes, a RealPlayer for the BBC iPlayer and finally a program for ripping and burning CDs and DVDs. There may well be other media players amongst the shovelware preinstalled on the box. There are also programs for editing specific media types like at least two picture editors and a video editor or several.

A typical scenario is that you are installing a new media player program because you want to use it to view a particular type of media. Unfortunately, the program installer knows about all the media types that it can handle and asks you to chose what media types types it should handle. Thus you have to disengage your thoughts from the one media type that is the object of your attention and instead start to think about all those other media types that you are not interested in. Unfortunately, there is the worry that if you give in to the new media player and let it handle certain types of media, other things will stop working. Maybe you will not be able to watch videos, or maybe videos will stop syncing with your portable media player because you changed the program associations with a particular file type. Given the complexity of these systems, who knows what may go wrong.

I said that the media player installer should ask you which file types you want associated with the program. A few years ago, Real managed to destroy much of their franchise by not playing nice and fair with file types. The RealPlayer installer switched all file types that it could handle to use the RealPlayer without bothering to ask or notify. Worse, if you went in and installed another program that changed the file type associations or even tried used the File Types dialog screen to change file type associations, it would just change them back to the RealPlayer, again without a notification. When this came to light, many people, myself included, uninstalled RealPlayer and swore never to install any software from Real again. Recently I caved on this resolution so that I could listen to old BBC radio shows like "The Goon Show" with the BBC iPlayer which it turns out to be just a rebadged Real player.

Since the RealPlayer imbroglio, installer programs have been a lot more careful about asking users about file types, but that just throws the problem back to the user. As the whole point of file types is to hide system complexity from the user, this it is no solution at all. A better path is to do without file types. Why are they necessary? Do they really serve a purpose? Other operating systems get along fine without file types, so why does Windows need them. Lets just throw them out and make life easier.

Monday, December 20, 2010

Is That Annoying Modal Caps Lock Key Going Away?

So Google came out with their new Chrome Operating System, loaded it onto a laptop and gave the whole caboodle of people to play with and comment on. While Chrome OS has generated a lot of comments, the largest and most active discussion has been about the Caps Lock key. You see, Google has changed the behavior of the key that used to be Caps Lock to instead call up a search page. I am sure this change was made to pander to keyboard weenies who want to Google without having to lift their hands from the keyboard. Anyway, the change has backfired. Instead of talking about Chrome OS, everyone is engaged in a furious discussion of why the Caps Lock is either essential or should have been disposed of a long time ago.

I have two problems with the Caps Lock, no make that three. The first problem is that it sits right between two important keys. Below is the Shift key whose importance needs no explanation. Above it is the Tab key, used for next field, command completion, automatic indent and plenty of other useful purposes. In the middle sits Caps Lock just waiting to be hit by accident. This brings to the next problem, Caps Lock is modal. Hit the Caps Lock key by accident and you do not make just one typing mistake, rather the whole keyboard is shifted into a new mode and the error compounds. By the time I look at the screen, I have typed half a sentence in the wrong case.

I am a member of the tribe that hates modal user interfaces with a passion. Some of my compatriots physically remove the Caps Lock key or reprogram their keyboard to reduce typing errors. I have only gone as far as to disable that other annoying modal key. The Insert key is used by many editors to switch between insert mode and overtype mode. If you hit Caps Lock by accident, the result is obvious, if you hit Insert by accident you can go on for some time before you realize that you are seriously damaging the document that you are trying to fix up. Of course, the Insert key is slightly off the main keyboard, right above the really useful Delete key and just waiting to be hit by accident.

My final problem with the Caps Lock key is that if you are in Caps Lock mode and you press shift, it reverts back to entering lower case. This means that when I hit cAPS lOCK by accident every key I type is in the wrong case, not just some of them. I happen to have an old typewriter from the 1930's so I know what shift really means. The Shift key causes the whole paper carriage and platen to move so that when the typebar comes down a different type piece strikes the ink ribbon and paper. Shifting the platen is why it is called the Shift key and it is a heavy key to hold, so there is a Shift Lock key that is a mechanical lock to hold the platen in the shifted position. With the platen locked in the shift position, hitting the shift key does nothing, so why has someone gone to the trouble of programming bogus behavior in out modern and supposedly more convenient keyboards?

Now, I know that there are people who love the Caps Lock key and who use it all the time. For my part, given the choice between a key that causes a small typing mistake every time I hit it by accident and a key that brings up a new web page by accident, I will choose the Caps Lock function every time. Caps Lock is annoying but I have lived with it for a long time and it is much smaller surprise than a new page that I do not want.

Saturday, December 18, 2010

The Gawker Password Fiasco

Last month I wrote about password security, just a little too soon. This month the popular blog site owner Gawker admitted to a huge security breach where hackers had broken into their web servers and stolen their entire database of user account names with email addresses and passwords. The attack has brought password security to every ones attention, with people reporting that their email and other accounts have been compromised. There are a lot of discussions of protocols for password security with good information, and unfortunately there is also a lot of misinformation. Here is my take.

The Forbes magazine web-site has a clear description of the attack on Gawker, (although their discussion of the password encryption is not correct). The short story is that the break-in was done by a hacker group called Gnosis who were annoyed by Gawker. Frankly, given Gawker's arrogant style, who has not been annoyed by them at some time? Gnosis first broke in to Gawker in July and got the passwords to accounts for Nick Denton and 16 other staffers there. In November, Denton noticed some possible tampering in a web account, and finally in December Gnosis announced their break in and released data they had gathered.

Although, Gawker had used encryption to hide the users passwords, they are susceptible to a brute force attack and many passwords have been broken. Gawker lost over 1 million accounts and more than 100,000 passwords have been cracked and published so far. The Wall Street Journal has a nice analysis of the most popular passwords including a frequency graph.

There is a lot of misunderstanding about how passwords are stored on a web site and how a brute force attack takes place. For example, the Forbes article I mentioned earlier obviously does not have a clue. I do not know for certain how Gawker protects their passwords, however the best practice is to use a salted hash. With this technique, the web-site chooses a salt, which is just a random string of characters. When a user sets a password, the salt is appended to the password and the whole string is hashed with a cryptographic hash function like SHA-1. The resulting hash value is a seemingly random string of bits, and this is stored as the encrypted users password. When the user wants to log in, the salt is added to the supplied password, the resulting string hashed, and the hash value compared to the saved hash. If they are the same, the user must have provided the correct password and is allowed to log in. By using a salted hash, the web-site does not save the users password, they just save a cryptographic hash that is used to confirm that the user knows their password. To make things more secure, the web-site can save a different salt for each user or just add the user name to a common salt so that even if two users have the same password, the salted hash of their passwords are not the same.

In a brute force attack the attacker knows the algorithm used to generate the salted hash and has the salted hash of the password. The attacker generates a list of potential passwords, applies the password checking algorithm to each password and if the results are the same, they have guessed the users password. If the attacker can try 20 passwords a second, they can test well over a million passwords a day on a single computer.

It is very easy to generate a list of potential passwords. One good starting point is a list of broken passwords, such as published by Gnosis from the attack on Gawker. The next step is a dictionary of common words and proper names. Many applications have a spelling dictionary that can be used as a starting point. Then try some simple variations like adding a number to the beginning or of words, capitalizing letters in the word and make common substitutions for letters such as 1 for the letter 'i' and 5 or $ for 's'.

So now that you now how it is done, think about your passwords and how easy they can be attacked by brute force, and excuse me while I go and change some of mine.

Saturday, December 11, 2010

Now You See It: the Book

If you are of a data analytics bent or know someone who is and are looking for a book to put on the Christmas list, consider Now You See It: Simple Visualization Techniques for Quantitative Analysis by Stephen Few. This is a beautiful book that would not look out of place on a coffee table, yet at the same time, is full of practical information about how to do analytics with charts, graphs and other visual tools.

The book is divided into three sections. The first section covers visual perception and general visualization techniques for looking at data. Then the second section goes into more detail with chapters on specific techniques for different types of analysis including time-series analysis, ranking analysis, deviation analysis and multivariate analysis amongst others. Each chapter in this section ends with a summary of the techniques and best practices for that type of analysis. Finally the book ends with a shorter section that looks at promising new trends in visualization.

There are copious examples of graphs and charts drawn by different software tools. While some of these graphs come from high end tools like Tableau and Spotfire, others are drawn by Microsoft Excel. In fact there are several specific procedures for using features of Excel to do sophisticated analytics. That is not to say that the book suggests that you can do everything with a spreadsheet. The first part shows you what to look for in visual analytics software and it essential reading before going out and choosing which tool to use.

So, if you are looking for a quality and practical gift for an analytician, choose "Now You See It".

Tuesday, November 30, 2010

The Registration Dilemma

To register or not to register, that is the question:
Whether 'tis better to create a new online account,
or just make do with with the existing ones,
and so lead a slightly less ennobled life.

Online account registration is a barrier, something that we are all thinking about as this is the season for buying stuff. As I said previously, I have about 70 online accounts where I actively maintain a user identity, and I have created many many more. Thus every time I am presented with the choice of registering for a new site, I stop and think, do I really want to create another account? In the past couple of weeks I have decided to forgo on creating 3 new online accounts and just stick to my well traveled paths.

Registration is not always thought of as a bad thing. For example, Dave McLure, Master of 500 Hats, micro Venture Capitalists and relentless promoter of analytics to improve web based businesses, has Activation as the second of his 5 step program to web enterprise success. Now Activation does not necessarily imply Registration, however Registration is the most common and strongest form of Activation. Dave's perspective is that to succeed on the net, your product needs to be strong enough to overcome any barriers to Activation.

There have been many initiatives to vault over the registration hurdle. The most promising one is OpenId, an open system that allows you to use your account at one web site to log onto other web sites. A couple of years ago I thought that this was a good solution to the Single Sign-on problem and worth promoting. Now OpenId seems to be moribund and it is not widely used. I am not sure what happened, but I did hear rumors of a argument and a split which diminished the organization.

One of the problems with OpenId and any other system is that it tends to favor and strengthen the big players like Yahoo and Google. Another idea the people often trot out is some form of micro-payments system that would obviate the need for registration at many sites. There are a couple of problems. Firstly, any payment is its own barrier, and creating many little barriers instead of one is not a path that is likely to lead to success. For a broader discussion of this issue I recommend the book Free by Chris Anderson.

The second problem is that a successful micro-payment system will favors and strengthen the big players that operate it. It has to be a big player as no one is going to trust their payments to some small and unknown start-up. In practice, the only really successful micro-payment site is iTunes, and it shows up all these problems. In the beginning we all cheered as Steve Jobs took on the record companies. Now that iTunes is the leading purveyor of music, many people have taken to railing against the power of Apple.

The Registration Dilemma is this. We can either continue with the current system that has a chaos of millions of sites, each with their own registration that we need to manage, or we can give in to consolidation and just deal with a few giants. Every time I think about it, I end up siding with chaos.

Tuesday, November 16, 2010

Yeah, Yeah, Yeah

This morning I woke up to the local newspaper headline "Do you want to know a Secret?", and knew that something was going on. Later they changed their tune to something more like the The Wall Street Journal which starts their piece "Steve Jobs is nearing the end of his long and winding pursuit of the Beatles catalog." Other newspapers had headlines like "All you need is iTunes", "Let it be Available" and "Apple and The Beatles finally come together on iTunes". All in all, it seems like bunch stupid headline tricks from the old media, a sure sign that they are getting past it.

Meanwhile the new media is a lot more standoffish. Wired News is like "Yawn". TechCrunch is all business with "All 17 Beatles Albums Are In The Top 100 On iTunes". Of course Fake Steve Jobs had a field day, providing by far the best commentary on the whole event.

Monday, November 15, 2010

Open Source Coopetition

Coopetition is the driving force behind many of the best Open Source projects. In the past, I have written about several different reasons that Open Source projects exist. There are business models like the low cost sales channel. Open Source can act as a home for old software that is still useful, but not commercially essential to a business. There have been attempts to use Open Source as a weapon, to suck the air out of a competitors lungs, by devaluing the intellectual property of the competitor, although many of these attempts have been less successful than their originator hoped.

A presentation on Hadoop got me thinking about Coopetition and Open Source. Hadoop is a big Open Source project to implement all the components of what I have called the Google Database System and a lot more. The major contributors to Hadoop are Yahoo!, Facebook and Powerset - now a part of Microsoft. While these companies are related in that Microsoft owns a stake in Facebook, has tried to buy Yahoo! and now Yahoo! uses Microsoft's Bing search engine, they are also competitors, fighting each other for the attention of web users.

So is it strange that these three companies should cooperate to build Hadoop, an incredibly useful and widely used Open source project? Firstly, the genius of Open Source is that they are not cooperating directly with each other they are all contributing code to a third party, the non-profit Apache foundation that oversees the Hadoop project. Secondly, by spreading the cost of the software over many contributors, they all gain much more than they contribute. Finally, many eyes and the public nature of the code tends to make it better than code that is bottled up in secret and protected from prying eyes. Because the Open Source model allows for the kind of coopetition that brings us software like Hadoop, we all benefit.

Thursday, November 11, 2010

Write Down Your Password

If Bruce Schneier says that you should write down your password, then write down your password. What he means is that given the choice between having a weak password that is so easy to remember that you do not need to write down and a strong password that you do need to write down to remember, it is better to go for the strong password. However, the problem of online identity management is much more complicated. Note that even the terminology is broken. We need to distinguish "online reputation management" which is about managing your personal brand online, with "online identity management" which is about managing how you authorize yourself with websites. Often, the term online identity management is used for online reputation management.

The problems of online identity management starts long before you need to provide a password. First you have to provide a user name. Each site has its own rules about what your user name should be. About half of web sites use an email address as a user identifier, while the other half insist that you play the game of user name roulette where you have to keep guessing a user name until you find one that has not been used. I have enough different user names that I have to write down my user name for each site, before even thinking about writing down a password.

Next problem is the large number of sites where you have an account. I have about 70 sites where I actively maintain a user identity, and there are many more sites where I have registered an identity and then abandoned. Of those 70 site, about 15 are sites like banking sites that are important to protect with a strong password.

One site that is particularly important to protect is your email account. Use a strong password with your email account and do not use that password on any other account. If your email account is compromised, you are in a lot of trouble. For example, many sites allow you to reset your password by mailing you a new one. Remember, an attacker who gains access to your email account is able to read your email including emails from other sites where you are registered. Many sites store your email address and password, so if they are compromised, and you use the same password for all accounts, the attacker has got your email address and the password to the account.

Another serious problem is any account that gives you access after answering security questions. The security questions are effectively another password and they encourage answers that are easy to guess. You are better giving nonsense answers to security questions, except for the fact that you now need to write down the answers to those questions as well. All in all, online identity management is a pain.

Sunday, October 31, 2010

The New OLAP

Just as there are new approaches to database management with the NoSQL movement, so is there a move to a new OLAP, although this movement is just emerging and has not taken a name yet. This month at the SDForum Business Intelligence SIG meeting, Flurry talked about how they put their data on mobile app usage in a giant data-cube. More recently, Chris Riccomini of LinkedIn spoke to the SDForum SAM SIG about the scalable data cubing system that they have developed. Here is what I learned about Avatara, the LinkedIn OLAP server. DJ Cline has also written a report of the event.

If you do not know what OLAP is, I had hoped to just point to an online explanation, but could not find any that made sense. The Wikipedia entries are pretty deplorable, so here is a short description. Conceptually, OLAP stores data in a multi-dimensional data cube, and this allows users to look at the data from different perspectives in real time. For example, take a simple cube of sales data has three dimensions, a date dimension, a sales person dimension, and a product dimension. In reality, OLAP cubes have more than dimensions than this. Each dimension contains a hierarchy, so the sales person dimension groups sales person by state then sales region, then country. At the base level the cube contains a data point called a measure for each sale of each product made by each sales person and the date when the sales was made. OLAP allows the user to look at the data in aggregate, and then drill down on the dimensions. In the example cube, a user could start by looking at the sales of all products grouped by quarter. Then they could drill down to look at the sales in the most recent quarter divided by sales region. Next they could drill down again to look at sales in the most recent quarter by sales person, comparing say the the Northern region to the Western region, and so on.

The new OLAP is enabled by the same forces that are changing databases with NoSQL. Firstly, the rise of commodity hardware that runs Linux, the commodity operating system, allows the creation of cheap server farms that encourages parallel distributed processing. Secondly, the inevitable march of Moore's law is increasing the size of main memory so that now you can spec a commodity server with more main memory that a commodity server had in disk space 10 years ago. An OLAP data cube can be partitioned along one or more of its dimensions to be distributed over a server farm, although at this stage partitioning is more of a research topic than standard practice. Huge main memory allows large cubes to reside in main memory, giving near instantaneous response to queries. For another perspective on in memory OLAP, read the free commentary by Nigel Pendse at the BI-Verdict (it used to be called the OLAP Report) on "What in-memory BI ‘revolution’?"

LinkedIn is a fast growing business oriented social networking site. They have developed Avatara to support their business needs and currently run several cubes on it. The plan is to to open source the code later this year. Avatara is an in memory OLAP server that uses partitioning to provide scalability beyond the capabilities of a single server.

The presentation was fast paced and it has taken me some time to appreciate the full implications of what was said. Here are some thoughts. Avatara offers an API that is reminiscent of Jolap rather than the MDX language that is the standard way of programming OLAP, probably because an API is easier to implement than a programming language. Avatara does not support hierarchies in its dimensions, but the number of dimension in a typical cube seems to be higher than usual. It may be that they use more dimensions rather than hierarchies within a dimension to represent the same information. This is a trade off of roll-up within the cube for slicing of dimensions. Slicing is probably more efficient and easier to implement while a hierarchy is easier for the user to understand as it allows for drill up and down.

Chris mentioned that most dimensions are small and that can be true, however the real problems with OLAP implementations start when you have more than one large dimension and you have to deal with the issue of sparsity in the data cube. Chris spent some time on the problem of a dimension with more than 4 billion elements and this seems to be a real requirement at LinkedIn. Current OLAP servers seem to be limited to 2 billion elements in a dimension, so they are going to be even more constraining than Avatara.

Sunday, October 24, 2010

Accidental Data Empires

In the new world of big data and analytics a winning business model is to find a novel way to collect interesting big data. Once you have the data, the ways to exploit it are endless. It is a phenomenon that I have seen several times, the latest example is Flurry, a company that collects and aggregates data from mobile applications. Peter Farago, VP Marketing, and Sean Byrnes, CTO abd Co-founder of Flurry spoke to the October meeting of the SDForum Business Intelligence SIG on "Your Company’s Mobile App Blind Spot".

The Flurry proposition is simple, they offer a toolkit that an app developer combines with their mobile app. The app developer goes to the Flurry website, creates a free account and downloads the toolkit. Whenever an instance of the app with the Flurry code is activated or used, it collects information about the usage that is sent back to the Flurry. The amount of information is small, usually about 1.2 kB compressed, so the burden of collection is small. At Flurry, the data is collected, cleansed and put in a gigantic data cube. At any time, an app developer can log into the Flurry website and get reports on how their application is being used. You can get a feel for their service by taking the short Analytics developer tour. Flurry have committed that their Analytics service will always be free.

While there are some issues with data collection that Flurry deals with, the quality of the data is great. Every mobile phone has a unique identifier so there is no problem with identifying individual usage patterns. As the service is free, there is very little friction to its use. Flurry estimates that they are in one in five mobile apps that are out there. In fact, for an app developer, the only reason for not using Flurry is that they have chosen to use a rival data collection service.

In the end however, the big winner is Flurry, who collect huge amounts of information about mobile app and phone usage. In the meeting Peter Farago gave us many different analyses of where the mobile smartphone market is and where it is going, including adoption rates for iPhones versus Android based phones and how the follow on market for apps on each platform is developing. You can get a mouthwatering feel for the information they presented by looking at their blog in which they publish a series of analyses from their data. As I write their latest post shows a graph on the "Revenue Shift from Advertising to Virtual Goods Sales" which shows that apps are growing their revenue from sales of virtual goods, while advertising revenue seems to be stagnant.

With data aggregators, there is always something creepy when you discover just how much data they have on you. Earlier this year there was an incident where a Flurry blog post described some details of the iPad a few days before it was announced that they had gleaned from apps running on these new devices in the Apple offices. Steve Jobs was so provoked by this that he called out Flurry by name and changed the iPhone app developer terms of service to prevent apps from collecting certain sorts of data. You can read more about this incident in the blog report on the meeting by my colleague Paul O'Rorke.

The title of this piece is a reference to the entertaining and still readable book Accidental Empires by Robert X. Cringely about the birth of the personal computer industry and the rivalry between Steve Jobs and Bill Gates.

Wednesday, October 13, 2010

A Critique of SQL

SQL is not a perfect solution as I told the audience at the SDForum Business Intelligence SIG September meeting, where I spoke about "Analytics: SQL or NoSQL". The presentation discusses the difference between SQL and structured data on the one hand versus the NoSQL movement and semi-structured data on the other hand. There is more to the presentation than I can fit in one blog post, so here is what I had to say about the SQL language itself. I will write more about the presentation at another time. You can download the presentation from the BI SIG web site.

Firstly the good. SQL has given us a model of a query language that seems so useful as to be essential. Every system that provides persistence has developed a query language. Here are a smattering of examples. The Hibernate object persistence system has Hibernate Query Language (HQL) which has been developed into the Java Persistence Query language (JPQL). Other Java based object oriented persistence systems either use JPQL or their own variant. Hive is a query interface built on top of the Hadoop Map-Reduce engine. Hive was initially developed by Facebook as a simplified way of accessing their Map-Reduce infrastructure when they discovered that many of the people who need to write queries did not have the programming skills to handle a raw Map-Reduce environment. XQuery is a language for querying a set of XML documents. It has been adopted into the SQL language and is also used with stand alone XML systems. If data is important enough to persist, there is almost always a requirement to provide a simple and easy to use reporting system on that data. A query language handles the simple reporting requirements easily.

On the other hand, SQL has many problems. Here is my thoughts on the most important ones. The first problem is that SQL is not a programming language, it is a data access language. SQL is not designed for writing complete programs, it is intended to fetch data from the database and then anything more than a simply formatted report is done in another programming language. This concept of a data access language for accessing a database goes back to the original concept of a database as promulgated by the CODASYL committee in the late 1960's.

While most implementations of SQL add extra features to make it a complete programming language, they do not solve the problem because SQL is a language unlike any of the other other programming language we have. Firstly, SQL is a relational language. Every statement in SQL starts with a table and results in a table. (Table means a table like in a document, a fixed number of columns and as many rows as are required to express the data.) This is a larger chunk of data than programmers are used to handling. The procedural languages that interface to SQL expects to deal with data at most a row at a time. Also, the rigid table of SQL does not fit well into the more flexible data structures of procedural languages.

Moreover SQL is a declarative language where you specify the desired results and the database system works out the best way to produce them. Our other programming languages are procedural where you describe to the system how it should produce the desired result. Programming SQL requires a different mindset from programming in procedural languages. Many programmers, most of whom just dabble in SQL as a sideline, have difficulty making the leap and are frustrated by SQL because it is just not like the programming languages that they are used to. The combination of a relational language and a declarative language creates a costly mismatch between SQL and our other programming systems.

Finally, SQL becomes excessively wordy, repetitive and opaque as the queries becomes more complicated. Sub-Queries start to abound and the need for correlated sub-queries, outer joins and pivoting data for presentation cause queries to explode in length and complexity. Analytics is the province of complicated queries so this is a particular problem for data analysts. In the past I have suggested that persistence is a ripe area for a new programming language, however although there are many new languages being proposed none of them are concerned with persistence or analytics. The nearest thing to an analytics programming language is R which is powerful but neither new nor easy to use.

Wednesday, October 06, 2010

Vertical Pixels are Disappearing

The quality of monitors for PCs is going backwards. A few years ago, noticing the rise of the widescreen monitor and fearful that all reasonably proportioned monitors would soon disappear, I bought a Samsung Syncmaster 204B (20.1" screen, 1600x1200 pixels). Last night it popped and stopped working. When I went online to research a replacement monitor, the full gravity of the situation became obvious.

Not only is it virtually impossible to find a monitor that is not widescreen, almost all monitors that you can buy, whatever the size of their screen, are 1920x1080 pixels. In the years since I bought the 204B, the number of pixels that we get in the vertical direction has shrunk from 1200 to 1080! Funnily enough, there is a post on Slashdot this morning titled "Why are we losing our vertical pixels" about this topic. The post has drawn many more that the usual number of comments.

For me, the vertical height of the screen is important. I use my computer for reading, writing, programming, editing media and some juggling with numbers. For each activity, having a good height to the screen helps, and width after a certain point does not add much. A Television uses a 1920x1080 pixels for a full 1080p display. The monitor manufacturers are just giving us monitors made from cheap LCD panels designed for televisions. When I watch TV, I have a much larger screen in another room with more comfortable chairs and more room between me and the screen. Thus, I do not need or want a computer monitor that is expressly designed for watching TV.

The real problem is that 1920x1080 monitors are so ubiquitous that it is difficult to find anything else. After a lot of searching I only found a couple of alternatives. Apple has a 27" widescreen monitor that is 2560x1440 pixels at a cost of ~$1000, and only works well with some recent Apple systems. Dell has a 20" monitor in their small business section that is 1600x1200 and costs ~$400. However, Dell seems to vary the type of LCD panel that they use between production runs and one type of panel is a lot better than the other. Unfortunately, you do not know which type of panel you are going to get until it arrives at your door. Neither alternative gets me really excited. One thing is certain, technology is supposed to be about progress, and I am not going backwards and accepting less pixels in any dimension for my next monitor.

Thursday, September 30, 2010

Tablet Aspect Ratios

One important issue with tablet computers that is getting little attention is the screen aspect ratio. Some time ago I wrote about "aspect ratio hell" while trying to decide how to crop holiday photographs. The answer seems to be that you have to crop each photograph independently for each way the photograph is going to be output or displayed. For photographs, the variety of different aspect ratios is a perplexing problem that has no good answer.

Tablet computers have the same problem except that the responsibility lies with app developers who need to make their app work well with the aspect ratios of their target platforms. Aspect ratios for a tablet needs to take into consideration that it will be used in both portrait and landscape mode. The iPad has an aspect ratio of 4:3 (AR 1.33...), which is the same as the iPod Classic while the iPhone and iPod touch have an aspect ratio of 3:2 (AR 1.5). Anyone trying to develop apps for Apple products needs to take this difference into account. On the other hand, both Blackberry and Samsung has announced their Android based tablets with a 7 inch screen which has an aspect ratio of 128:75 (AR 1.706...), which is close to 16:9 (AR 1.77...).

When we look to media, television uses 16:9 and most cinema has a higher ratio like 2.40:1 except for iMax (AR 1.44) which is much squarer. Books and newspaper use a 3:2 ratio (AR 1.5) while magazines tend to be broader with a lower aspect ratio. Frankly, anything with an aspect ratio of more than 3:2 tends to look unnaturally skinny when viewed in portrait mode. A cell phone can get away with a higher aspect ratio because it has to be pocketable, but larger devices meant for viewing both media in both landscape and portrait mode needs to keep its aspect ratio to 3:2 or less. For example, the Kindle, which is mostly used in portrait mode has an aspect ratio of 4:3 (AR 1.33...). From this point of view, the Samsung and Blackberry tablets seem to be designed to be used in landscape mode and not in portrait mode. I hope that other tablet makers do not make the same mistake.

Saturday, September 04, 2010

Understanding the iPad

Some people still struggle to understand the iPad. When it was first announced, there were shrieks of outrage from techies, complaining that it was not a free and open computer system and so nobody should buy one. Then it came out and was adopted by the millions. Steve Ballmer, CEO of Microsoft, expressed dismay that the iPad is easily outselling any tablet computer that Microsoft and ever had a hand in. More recently an executive from LG told the Wall Street Journal that they would bring out a Tablet that would be better than the iPad because it would be oriented towards content creation rather than content consumption.

Then there are many people who get it. For example, Jerry Kaplan, founder of Go Computing, an early slate computer in an interview with Chris O'Brian of the San Jose Mercury News understood that the iPad is oriented for media consumption as opposed to the more general purpose Go slate computer. My belief is that the iPad is a new category of device that addresses a new market.

Last year I wrote about Media Convergence, the idea that in the past, each type of media was different. Books were bound paper sold by booksellers, video was delivered as movies in movie theaters and broadcast as television, records were vinyl goods sold in record stores and heard over the radio, magazines were sold by booksellers or delivered by mail, newspapers had their own content delivery network to ensure that everybody got the previous days news by the following morning. With the digital revolution, all these different types of media are now the same. They are all just buckets of digital bits that are delivered through the Internet. Given this, the next thing we need are devices for consuming all this media. Audio just needs a device the size of your thumb and headphones, whereas video, books, magazines etc. need a screen that is big enough to see, and that is what the iPad is for.

When thinking about these things, I find it useful to draw up some requirements and use cases and then see how the offered devices match those requirements. Here is what I want from my Personal Information Appliance (PIA - remember that acronym).
  1. Light enough that I can lie in bed and read or view media with it.
  2. Instant on, long battery life, able to handle all media types.
  3. Get media without having to plug it into anything else.
  4. A screen large enough to read or view and small enough to make the device portable.
So how does the iPad match these requirements? At 1.5 pounds it is a little heavier than most "light" reading, but there are plenty of hardback books that weigh more. For the second requirement, Adobe Flash is the major missing media type, however there is probably an app to do that. As for screen size, we are going to have to resign ourselves to having multiple devices with different screen sizes until they work out the technology to project images directly onto the retina.

The funny thing is that even although the iPad is speced as a device for consuming media it turns out to be capable of much more. Computer games are the newest type of media and the iPad is a great games platform with a lot of future as Steve Jobs boasted in the recent iPod announcement event. There are many instances in the business world where it will be useful, for example in sales and marketing for giving a presentation or demonstration to an individual. The other day I was astonished to find my boss using his iPad for email while waiting for his laptop to be repaired.

Tuesday, August 31, 2010

Software Update Business Models

These days software updates are a fact of life. If we do not keep our software up to date we risk all sorts of horrendous infections and debilitating attacks. Unfortunately, the providers of our software know this and are starting to use software update to make money or at least remind us that they exist. I have done several software updates recently and noticed this in action.

Adobe just wants to remind me of their presence, so they insist on putting a shortcut to the Adobe Reader on my desktop every time they update. This is relatively benign as it is a matter of a few seconds at most to confirm that it is a shortcut and delete it. Apple is more pushy. I expect to get a new version of iTunes any day now, and I will need to carefully uncheck boxes to ensure that I do not get several applications more than I want. Most insidious is Java, now owned by Oracle. On one system they offered me the Yahoo tool bar, on another system which already had the Yahoo tool bar, they offered me some other software, so they obviously look to see what is installed to guide the offer. Judging by the fact that these offers were for third party software, I am sure that they get some sort of compensation for it.

Soon we will see advertisements and offers in the installer, and new ways to confuse us. The tactic that always gets me is to require some input that I forget to fill in, then when I go back to fill in this information, all the boxes I so carefully unchecked have been mysteriously filled in again. In a hurry, I just click "Install" not noticing that I am now getting all the extras that I had carefully tried to avoid. It is coming to a computer near you soon.

Saturday, August 28, 2010

Mad Skills for Big Data

Big Data is a big deal these days, so it was with great interest that we welcomed Brian Dolan to the SDForum Business Intelligence SIG August meeting to speak on "MAD Skills: New Analysis Practices for Big Data". MAD is an acronym for Magnetic Agile Deep, and as Brian explained, these skills are all important in handling big data. Brian is a mathematician who came to Fox Interactive Media as Lead Analyst. There he had to help the marketing group with deciding how to price and serve advertisements to users. As they had tens of millions of users that they often knew quite a lot about, and served billions of advertisements per day, this was a big data problem. They used a 40 node Greenplum parallel database system and also had access to a 105 node map reduce cluster.

The presentation started with the three skills. Magnetic, means drawing the analyst in by giving them a free reign over their data and access to use their own methods. At Fox, Brian grappled with a button down DBA to establish his own his own private sandbox where he could access and manipulate his own data. There he could bring in his own data sets, both internal and external. Over time the analysts group established a set of mathematical operations that could be run in parallel over the data in the database system speeding up their analyses by orders of magnitude.

Agile means analytics that adjust react and learn from your business. Brian talked about the virtuous cycle of analytics, where the analyst first acquires new data to be analyzed, then runs analytics to improve performance and finally the analytics causes business practices to suit. He talked through the issues at each step in the cycle and led us through a case study of audience forecasting at Fox which illustrated problems with sampling and scaling results.

Deep analytics is about producing more than reports. In fact Brian pointed out that even data mining can concentrate on finding a single answer to a single problem where big analytics has the need to solve millions of problems at the same time. For example, he suggested that statistical density methods may be better at dealing with big analytics than other more focused techniques. Another problem with deep analysis of big data is that, given the volume of data, it is possible to find data that supports almost any conclusion. Brian used the parable of the Zen Tea Cup to illustrate the issue. The analyst needs to be to approach their analysis without preconceived notions or they will just find exactly what they are looking for.

Of all the topics that came up during the presentation, the one the caused most frissons with the audience was dirty data. Brian's experience has been that cleaning data can lose valuable information and that a good analyst can easily handle dirty data as a part of their analysis. When pressed by an audience member he said "well 'clean' only means that it fits your expectation". As an analyst is looking for the nuggets that do not meet obvious expectations, sanitizing data can lose those very nuggets. The recent trend to load data and then do the cleaning transformations in the database means that the original data is in the database as well as the cleaned data. If that original data is saved, the analyst can do their analysis with either data as they please.

Mad Skills also refers to the ability to do amazing and unexpected things, especially in motocross motor bike riding. Brian's personal sensibilities were more forged in punk rock, so you could say that he showed us the "kick out the jams" approach to analytics. You can get the presentation from the BI SIG web site. The original MAD Skills paper was presented at the 2009 VLDB conference and a version of it is available online.

Monday, August 23, 2010

End of Moore's Law

The recent announcement that Intel is buying McAfee, the security software company, has the analysts and pundits talking. The ostensible reason for the deal is that Intel wants the security company to help them add security to their chips. Now, while security is important, I do not believe that is the reason Intel bought McAfee. In my opinion, this purchase signals that Intel sees the coming end of Moore's Law.

In 2005, the Computer History Museum celebrated 40 years of Moore's Law, the technology trend that every 2 years, the number of transistors on a silicon chip, and thus its capabilities doubles. On the stage Gordon Moore told us that throughout the 40 years, "they have always been able to see out about 3 generations of manufacturing technology", where each generation is about 2 years. So Intel can see its technology path for about the next 6 years. At that time Moore told us that they could still see how they were going to carry on Moore's Law for the next three generations.

Now what would happen if Intel looked 6 years into the future and saw that it was no longer there. That they could see the end of Moore's law and that meant that they would no longer have the ability to create new and more powerful chips to keep their revenue growing. I believe that they would start looking to buy up other profitable companies in related lines of business to diversify their revenue.

McAfee is a large security software company, its main business is selling security solutions to large enterprises. If Intel had wanted to buy security technology they could have gone out and bought a security start-up with better technology than McAfee for a few hundred million dollars. Instead they are spending an expensive 8 billion dollars on an enterprise security software company. This deal does not make sense for the reasons given, however it does make sense if Intel wants to start buying its way into other lines of business.

Now there are many reasons that Intel wants diversify their business. Perhaps they see the profitable sales of processor chips disappearing as chips gain so many transistors that they do not know what to do with them. However the most likely reason is that they can see the end of Moore's Law and that it is now time to move on and add some other lines of business.

Saturday, August 14, 2010

Analytics at Work

Analytics has become a major driving force for competitive advantage in business. The new book "Analytics at Work: Smarter Decisions, Better Results" by Thomas H. Davenport, Jeanne G. Harris and Robert Morison discusses what analytics can do for a business, how to manage analytics and how to make a business more analytical.

Analytics at Work has a useful introductory chapter and then divides into two parts. The first part discusses five major aspects of analytics in a business environment. The second part looks at the lifecycle of managing analytics in a business. The organization is good and there is no overlap between the topics in each part, however the order in which the information is presented seems designed to put the reader off.

The first part starts with a plodding chapter on what needs to be done to get the data organized and related topics, followed by a diffuse chapter called Enterprise. The interesting chapters in this part are the last two chapters. The Targets chapter discusses the important topic of picking targets for analytics. The Analysts chapter discusses how to effectively employ and organize analysts in a large enterprise. Similarly the second part of the book starts with a plodding chapter on how to Embed Analytics in Business Processes, followed by much more inspiring chapters on building an analytical culture, and the need to continually review a business comprehensively as part of an analytics push. If you find yourself stuck reading the book, try skipping to one of the interesting chapters that I have indicated.

Scattered throughout the book are many useful tools. In the introductory chapter there are the six key questions that an analyst asks. We come back to these questions from several places in the book. Running throughout the book is a five step capability maturity model for judging how analytical an organizations is and showing the path to making the organization more analytical. Each chapter in the first part ends with a discussion on how to take that aspect of the organization through the five steps.

It is important to understand the target audience. The book is aimed at senior management and executives, particularly in large enterprises. While the book contains many brief case studies as inspiration and it touches on all the important management issues that need to be considered, it does not go into great depth about what analytics is or the specific analytical techniques and how they can be used. This is not a book for analysts, unless they have ambitions to grow their career beyond analytics. I recommend this book to anyone in the target audience who wants to grow their organizations analytics capabilities.

Saturday, July 24, 2010

Data Management in the Cloud

Over the last couple of years, I have seen several presentations on the computing Cloud and how it is the next big thing. I realized that I still have a lot to learn from Daniel Graham's presentation "Data Management in the Cloud" at the July meeting of the Business Intelligence SIG. Dan leads Active Data Warehouse marketing programs for Teradata. If you have been living under a rock and do not know what cloud computing is, Wikipedia has a reasonable explanation. Dan distinguished between the public cloud as a rentable computing resource like Amazon's Elastic Computing Service and a private cloud which is your businesses computing resources in a datacenter behind the company firewall which uses virtualization software like VMWare to allow many applications to share hardware.

The big picture that Dan painted is that cloud computing is coming and that you need to get ready for it. By 2015, 20% of computing resources worldwide will be in the cloud. Start now by getting experience with the cloud to find out what works, what needs to be changed to make it work and what does not work. Teradata has been experimenting with cloud computing and is working with hardware and software vendors like VMWare and Amazon to ensure that Teradata database systems work well in the cloud. Informatica is another example of a software vendor that is working to ensure that their data integration software works well in the cloud and between clouds. NetFlix is an example of a company that has adopted cloud computing and recently announced that they were moving all their movie hosting into the Amazon computing cloud. The US Government is the leading user of cloud services having moved much of their computing needs into the cloud.

Cloud computing uses commodity hardware, which combined with the overhead of virtual machine software will not give you the best performance, however it is "good enough" for most applications. Dan took the well known quote from the movie Forrest Gump and bent it to his needs. “Clouds are like a box of chocolates. You never know what you're gonna get.” There is some high end software that is not suitable for cloud computing, the main problem coming from high IO requirements. The size and capabilities of a cloud computing host is often optimized to run a single instance Oracle database doing OLTP. In practice most applications are less demanding than this.

There were many other interesting tidbits in the presentation. Here are some examples. It is more expensive to get data out of a cloud than to bring it in. Why is unknown, but something to take into consideration when using a cloud. An interesting application for cloud computing is what Dan called "Workload Isolation". The idea is that when you have partners or consultants who need access to your data it is often preferable to put the data they need in the cloud rather than let them inside your firewall. In all the examples that Dan showed of Business Intelligence applications in the cloud, he talked about a Data Mart with the implication that a full Enterprise Data Warehouse was too large and demanding an application for the cloud for now.

The slides from the presentation are available at the SDForum Business Intelligence SIG web site.

Thursday, July 15, 2010

Another Angle on the iPhone Woes

In all the discussion over the antenna problems with the new Apple iPhone 4, there is one thing that I have not heard, and that is how few product lines that Apple has. The iPhone 4 is a prime example. It comes in two memory sizes and we are promised a second color real soon now. On the Apple site you can still buy the previous generation iPhone 3, but that still makes it 3 models available with another one or two to come. Compare this with BlackBerry which has six ranges and thirteen models. Blackberry is restrained in the number of models it offers compared to the likes of Motorola, Samsung or Nokia.

The same its true in other areas. With the Mac, Apple has three ranges of laptops, three ranges of desktops and one rackable server. Compared to HP, Dell or Acer this is a ridiculously small number of product lines. Again with the iPod there are 4 product ranges each with a couple of different memory size and and more variation on color.

There are a number of advantages in having a small number of product lines. Economy of scale will make the product cheaper to manufacture, however by the time you get to millions of devices, the additional advantage is not that great. More important are a brutally strong brand image and a lack of consumer confusion. There is no question about which version of the iPhone to get, the only question is whether you are willing to pay more for the extra memory.

However, there is one big disadvantage to having a single product line like the iPhone, and that is that you have all your eggs in one basket. If the product should prove to have a flaw, there is no other product line to take up the slack. If a consumer want to buy an iPhone now, they either have to go ahead and take the risk that it might be a lemon or wait until the problem is fixed. They cannot go out and buy the other Apple phone because it does not exist. For a big consumer goods company, Apple has had remarkably few dud products, but their life depends on getting each one right.

Monday, July 05, 2010

The HP Tablet and the Elephant

Recently HP bought Palm and in the acquisition press release announced that they are developing "... webOS based hardware products, from a robust smartphone roadmap to future slate PCs and netbooks". In all the discussion of this event, nobody seems to be discussing the elephant in the room, or more correctly, the elephant who is no longer in the room.

Ten years ago, HP would not have dared announce that it was going produce its own operating system (OS) in competition with the dominant Microsoft Windows OS. Then, most hardware developers had been cowed by Microsoft's aggressive and successful response to any attempt to develop a rival operating system. To give a couple of examples, in the early 90's the Go Corporation had developed its Penpoint OS for handheld computing. Then in 1992, Microsoft announced its own Windows for Pen Computing. Go Corporation faltered, was taken over by AT&T and then the project was shuttered. Another example is the fate of Be Inc. who had developed BeOS, initially to power their own hardware. In 2002, Be Inc. sued Microsoft claiming that Hitachi had been dissuaded from selling PCs loaded with BeOS, and that Compaq had been pressured to not market an Internet appliance in partnership with Be. The case was eventually settled out of court with no admission of liability on Microsoft's part. However by this time Be Inc had admitted defeat and sold its intellectual property to Palm Inc.

In the late 90's Microsoft was so dominant that no Silicon valley Venture Capital firm would fund a start up that would have the remotest chance of challenging Microsoft in any way. Since then Microsoft seems to have been transformed from a lithe competitor into a stumbling giant. The Vista version of the Windows OS is widely regarded as a failure, and was quickly replaced by Windows 7. While the Windows Mobile OS for smartphones has been around for a long time and gone through several versions, it has been losing market share for some time. Recently Microsoft introduced a new smartphone, the Kin with much ballyhoo, only to give up on it six week later. There are plenty other examples of Microsoft's left hand not knowing what the right hand was doing, like the PlayForSure debacle.

We have come to the point where Microsoft is so crippled by its own self inflicted wounds that one of its most important OEM customers is going to use its own operating system on future slate PCs and netbooks. The elephant is no longer in the room.

Saturday, June 26, 2010

Winning With Big Data

Michael Driscoll gave us Secrets of a Successful Data Scientist at the June meeting of the SDForum Business Intelligence SIG in his talk "Winning With Big Data". Michael is founder of a data consultancy Dataspora, where he has done work on projects ranging to analyzing baseball pitchers through helping cell phone companies understand their customer churn. You can see slides for the talk here, and follow Micheal's thoughts in his excellent blog on the Dataspora site.

After Michael revved up the crowd by giving the Hal Varian quote that "... the sexy job in the next ten years will be statisticians", he went through 9 ways to win as a Data Scientist. His first suggestion is to use the right tools. Michael uses a variety of tools including database systems, Hadoop and the R language. Large data takes a long time to process and often we can gain insights by just working with a sample of the data, however you have to be careful when taking a sample to ensure that it makes sense and that the results will scale. Which leads us to the another way to win, which is to know, understand and use statistics.

Statistics is a field of mathematics that is still developing and it is not easy, however statistics is a core competence of a Data Scientist. It is not enough to do the analysis, the Data Scientist has to be able to present the results and turn them into a compelling story. Both analysis and presentation requires good visualization tools and the knowledge of how to use them.

To illustrate his ways to win, Michael led us through a specific example of a successful data analysis that he had done. He had been asked by a cell phone company to investigate customer churn. Although he looked at the data in several different ways, his successful analysis went as follows. The starting point was Call Data Record (CDR) which records each call that a customer makes. Cell phone traffic generates billions of CDRs, so Michael first cut the data set down to a more manageable size by just looking at the CDRs for a single city. He then created social graphs between customers that call one another frequently, and was able to show that if one customer dropped service it was a predictor that other customers in that social graph would also leave the service. The study ended with a clever visualization of connected customers leaving the cell phone provider.

Thursday, June 24, 2010

Which Cloud Standards Matter?

The SDForum Cloud Services SIG June meeting was a panel session with multiple speakers devoted to the question "Which Cloud Standards Matter?". The answer came through loud an clear as speaker after speaker discussed Open Virtualization Format (OVF). No other standard got more than a mention or so.

OVF is a container that defines the contents of a virtual machine. It is simply a set of file in a directory and an XML descriptor file. The standard is managed by the Distributed Management Task Force (DMTF). Panel speaker Priya Ketkar of Abiquo showed OVF being used to move a virtual machine from one cloud service provider to another. Winston Bumpus, the final panel speaker, is President of the of the DMTF and Director of Director of Standards Architecture for VMWare. He made a convincing case for DMTF and its management of the OVF standard.

Another panel member James Urquhart of Cisco mentioned several standards including OVF, however he spent considerable time on XMPP, surely the most unlikely standard for cloud computing. I discussed XMPP some time ago. It is a standard for exchanging instant messages and Twitter feeds between large service providers. While it is a useful standard I do not see its place in cloud computing. If you can explain how XMPP helps cloud computing, please enlighten me.

Sunday, June 13, 2010

Reporting from the Production Database

Salesforce.com does their analytics directly out of their production database. For me, this was the interesting story that emerged from the talk on "Real Time Analytics at Salesforce.com" at the May meeting of the SDForum Business Intelligence SIG. Note that this post is not a report on the meeting, rather it is a reflection on a topic that came up during the meeting. Both my co-chair Paul O'Rorke and SIG member James Downey have written great summaries of the meeting.

Directly reporting from a production database is an issue that comes up from time to time. Deciding on whether to do it is a two step process. The first question is to ask whether it is possible. A database can be oriented to report the current state of affairs or alternatively to contain a record of how we got to the current state of affairs. In practice we need both views, and it is common to have a production database that is oriented to the maintaining the current status and a data warehouse that maintains the historical record. Typically an enterprise has several databases with production information and the historical record is combined in a single reporting data warehouse.

The tension between the requirements for production and reporting databases shows up in a number of ways. Production needs a fast transaction execution. One way to achieve this is to make the database small, cutting out anything that is not really needed. On the other hand, we want to keep as much information as possible for reporting, so that we can compare this time period with a year ago or maybe even two years ago. Reporting wants a simple database structure like a star schema that makes it straightforward to write ad-hoc queries that that generate good answers. Production databases tend to have more interlinked structures.

Salesforce.com is in the business of Customer Relationship Management (CRM), where it is useful to keep the historical record of interactions with each customer. As Salesforce.com has the historical record in their production database, reporting from that database makes perfect sense. In fact much of the impetus for real time data warehousing has come from CRM like applications. One common example is where a business wants to drive call center applications from data in their data warehouse.

The next question is whether it is a good idea to combine reporting and production queries in the same database. Production queries are short, usually reading a few records and then updating and inserting a few records. Reporting queries are read only, but they are longer running and may touch many records to produce aggregate results. A potential issue is that a longer running reporting query may interfere with production queries and prevent them from doing their job. This is the other major reason for doing reporting from a separate database than the production database.

The Oracle database used by Salesforce.com has optimistic read locking so that read only queries do not lock out queries that update the database. Also, as came out in the presentation, Salesforce.com has a multi-tenant database where each customer customizes their use of data fields in a different ways. Because of this, they sometimes copy the data out of the big table into a smaller temporary table to transform the data into the form that the customers query expects. Making a copy of the relevant data for further massaging is a common tactic in data reporting tools so this is not unusual. It also gets the reporting data out of the way of production data so they two do not interfere with one another.

Finally, Salesforce.com is large enough that they can afford a luxury of having a performance team whose sole purpose is to look at queries that take the longest to run or use up the most resources. Any database application requires some performance tuning, however it is especially important when doing reporting from a production database.

Thursday, June 10, 2010

Google's Got Background

Go away for a few days and when I come back, Google looks like Bing. Instead of a restful blank page they had a background picture. Arrgh! Fortunately, it lasted for less than a day, and then we went back to the blank page we knew and loved.

Actually it is very clever. Firstly it tells the people who might be attracted to Bing because they can customize how the page looks that they can do the same thing with Google. Secondly, and more importantly, it encourages people to create and log in to their Google account so that they can customize their Google home page. Google can give you a better search experience when it knows who you are, and it can make more money from the advertisements that are pitched at you when it knows who you are.

I thought of trying to customize the page to something less distracting when I realized that I would have to give up something of my identity in exchange. On weighing this transaction I decided that what I would give up outweighed the benefit, particularly when the backlash would probably cause the background image to be a short lived experiment.

Sunday, May 30, 2010

Dancing About Architecture

"Writing about music is like dancing about architecture." is one of these quotes that never seem to die. Last week I heard it again while listening to a podcast. The quote is attributed to many people including Elvis Costello in a 1983 interview, although the origin seems to be older, perhaps much older than that. More interesting, from reading the linked piece is that someone tested dancing about architecture to see if it "was really so strange".

Whoever said it, they certainly caught the truth that written words cannot adequately capture an aural sensation. A great illustration of this is the 1998 interview* of Ray Manzarek of The Doors by Terry Gross on the Fresh Aire radio program. Manzarek describes with the help of a keyboard how the Doors worked as a group and how they wrote the song "Light My Fire". A written transcript of this interview would be unintelligible, whereas the audio interview is a revelation. Terry Gross has recorded many interviews with musicians where they play their music and they are all worth hearing.

Although I spend plenty of time listening, I have never found it useful to read about music. That is not to say that there cannot be good writing about music. In my experience the best has been in fiction, particularly the novels of Ian McEwan. In "Saturday", there are a few pages with a magical description of a rock band performing one song, followed by a meditation on how certain passages of music, and certain performances can affect us to the core.

A good part of "Amsterdam" is about the process of composing a classical symphony. While I have never composed music, I do design and write software and there are similarities to the process. I will write more about this another time. Unfortunately the novel and the symphony are cut short by the books annoying ending.

* Unfortunately, to listen to this piece, you have to have the Real Player. I have it because I have installed the BBC iPlayer to listen to old comedy shows including the Goon Show. If you do not want the Real Player, here is a another piece from NPR Music about "Light My Fire" with more palatable download requirements.

Wednesday, May 19, 2010

The App Economy

The evolution of the App Economy is a marvelous thing to watch. In March I questioned whether apps for the iPad would develop with the same strength as apps for the iPhone, because more content is accessible through the browser. Jacob Weisberg at Slate discussed the same thing recently in more depth. He exhorts publishers to beware of getting tangled up with Apple for both monetary and censorship reasons.

On the other hand, web content is not fully available on the iPad. Steve Jobs has denigrated Flash for being slow, buggy and inefficient, and has sworn that it will never be seen on the iPad. In its place Jobs suggests HTML5. The problem is that HTML5 does not do everything that Flash does. This recent piece on on Apple Insider explains the shortcomings of HTML5 and why Hulu will not be using it any time soon for their video distribution.

If Hulu cannot use Flash, then its only alternative is to develop an App, which it is reportedly doing. If Hulu has an App, it may charge a subscription as is being discussed. If Hulu charges a subscription, some of that revenue flows to Apple. By banning a rival development platform, Apple is encouraging the App Economy to its own advantage. Thus it is a pity that so many of the early publishing apps have received such bad reviews.

Sunday, May 09, 2010

Google Books Rocks

Awesome is too small a word to express what Google Books has achieved. Last year Google settled the class action law suit that allows them to index out of print books that they had digitized. As part of the settlement they also have to sell the books, which means that Google is now a bookseller. The most important part of the settlement is the Books Right Registry:
"The agreement will also create an independent, not-for-profit Book Rights Registry to represent authors, publishers and other rightsholders. In essence, the Registry will help locate rightsholders and ensure that they receive the money their works earn under this agreement. You can visit the settlement administration site, the Authors Guild or the AAP to learn more about this important initiative."
One of the biggest practical issue with Intellectual Property is that it is impossible to use most Intellectual Property because you do not know who owns it, and therefore you do not know who to ask for permission to use it. Laurence Lessig has been talking about this for a long time as a part of his campaign to fix copyright laws. The establishment of a Book Rights Registry goes some way to address the problem with one type of Intellectual Property. Perhaps this will be the beginning of a trend.

I will write more about this issue another day. For now, here is how I stumbled upon the awesomeness of Google Books. My father would often quote "but tomorrow by the living god, we'll try the game again" after some setback. I knew it was from a poem, but not much more. So the other day, I typed "but tomorrow by the living god" into Google and was astonished by the progress that has been made in search over the last few years. The first entry in the search results linked to a poetry anthology in Google Books that has the full poem by John Masefield.

Masefield is best known for his poems "Sea Fever", "I must go down to the seas again, to the lonely seas and the sky, ..." and "Cargoes", "Quinquireme of Nineveh from distant Ophir, ..." For poem collectors, here is the rarely seen poem TOMORROW by John Masefield:
Oh yesterday the cutting edge drank thirstily and deep,
The upland outlaws ringed us in and herded us as sheep,
They drove us from the stricken field and bayed us into keep;
But tomorrow
By the living God, we'll try the game again!

Oh yesterday our little troop was ridden through and through,
Our swaying, tattered pennons fled a broken, beaten few,
And all a summer afternoon, they hunted us and slew;
But tomorrow
By the living God, we'll try the game again!

And here upon the turret-top the bale-fires glower red,
The wake-lights burn and drip about our hacked, disfigured dead,
And many a broken heart is here and many a broken head;
But tomorrow
By the living God, we'll try the game again!
In my original search results, there was a link to Google newspapers where a Virgin Islands Daily News edition from 1950 quotes part of the poem. This time when I did the search, that link did not come up. Instead there was a link to a 1991 zine for Vietnam War vets that quotes a verse of the poem. Who knows what you may find when you do the search.

Friday, April 30, 2010

More on The Big Short

When I wrote that Michael Lewis had written an almost uplifting account of the financial crisis in "The Big Short" by concentrating on some of the winners, I did not consider that he was keeping something back. If you want to find out what he really thinks, read this interview on Bloomberg.com. He explains many of the choices that he made in the book, like for instance not including John Paulson who has been celebrated in other places for "The Greatest Trade Ever". He also expresses his outrage over what happened and suggests that part of the reason he left Wall Street in 1989 was because his job was basically "exploiting the idiocy of my customers". It is a long interview and well worth reading in its entirety.

One issue that Lewis touches on is the fact that shorting the market is supposed to dampen the market and perhaps bring sanity into it, but in this case the structured investment vehicles like synthetic CDOs had the opposite effect of amplifying the market and making the subsequent downfall much worse. The "This American Life" radio show and podcast has a recent segment where they discuss the role of the Magnetar Hedge Fund in creating many several subprime bonds and then making huge sums of money by shorting parts of them. Again well worth hearing.

Finally, Lewis discusses the poisonous interface between the big Wall Street firms and their customers. If Goldman Sachs is responsible for defrauding its customers as the recent lawsuit suggests, there is the question of why anyone would want to do business with them. The Big Money blog posits that Goldman Sachs is losing its "Social License" to operate in an interesting post. Given their behavior, this may be a good thing.

Monday, April 26, 2010

Business Rules OK!

Performance Management Systems collect the data to make decisions but they do not make decisions, they do not ensure that decisions get made or even track the results of the decision so made. James Taylor (no relation) called this the "over-instrumented" enterprise when he spoke to the the April meeting of the SDForum Business Intelligence SIG on "Performance Management and Agility". James is CEO of Decision Management Solutions where he consults on using technology to better effect decision making.

James divides the decisions that an organization makes into three levels: strategic, tactical and operational. He is interested in the operation decisions, the little decisions that are taken all the time. An example of an operational decision is what offer to make to a customer that has called a call center. Every enterprise has their own set of operational decisions, however they have the characteristic that is a large number of them that in aggregate they represent a lot of value, so they are well worth managing.

Many operational decisions are or should be automated, and there are a set of principles that need to be recognized when decision making is automated. The first principle is that no decision is going to be forever, so the logic for making the decision should not be locked up into something inflexible such as program code. Much better to use a rules based decision engine which allows everybody to see the rules in a language that they can understand. Another principle is that making a decision is a business process and as such should be managed. A good business rules engine allows rules to be tested, measured and perhaps even simulated in action to understand what they are doing and how they can be optimized.

According to James, the purpose of the information gathered for a Performance Management Systems is to make decisions, so it should be used to make decisions. Too many enterprises are over-instrumented. They have spent all their effort to get and present the data, however they have no measurable ability to turn that data into actions. You can read more about these ideas in the book Smart Enough Systems: How to Deliver Competitive Advantage by Automating Hidden Decisions by James Taylor and Neil Raden. You can also read my co-chair Paul O'Rorke's take on the meeting in his blog.