Build and Break

Saturday, February 19, 2011

Agile BI at Pentaho

Ian Fyfe, Chief Technology Evangelist at Pentaho showed us what the Pentaho Open Source Business Intelligence Suite is and where they are going with it when he spoke to the February meeting of the SDForum Business Intelligence SIG on "Agile BI". Here are my notes on Pentaho from the meeting.

Ian started off with positioning. Pentaho is an open source Business Intelligence Suite with a full set of data integration, reporting, analysis and data mining tools. Their perspective is that 80% of the work in a BI project is acquiring the data and getting it into a suitable form and then other 20% is reporting and analysis of the data. Thus the centerpiece of their suite is the Kettle data integration tool. They have a strong Mondrian OLAP analysis tool and Weka Data Mining tools. Their reporting tool is perhaps not quite as strong as other Open Source BI suites that started from a reporting tool. All the code is written in Java. It is fully embeddable in other applications and can be branded for that application.

Ian showed us a simple example of loading data from a spreadsheet, building a data model from the data and then generating reports from the data. All of these things could be done from within the data integration tool, although they can also be done with stand alone tools. Pentaho is working in the direction of a fully integrated set of tools with common metadata between them all. Currently some of the tools are thick clients and some web based clients. They are moving to have all their client tools be web based.

We had come to hear a presentation on agile BI and Ian gave us the Pentaho view. In an enterprise, the task of generating useful business intelligence is usually done by the IT department in consultation with the end users who want the product. The IT people are involved because they supposedly know the data sources and they own the expensive BI tools. Also, the tools are complicated and using them is usually too difficult for the end user. However, IT works to their own schedule, through their own processes and take their time to produce the product. Often, by the time IT has produced a report, the need for it has moved on.

Pentaho provides a tightly integrated set of tools with a common metadata layer so there is no need to export the metadata from one tool and import it into the next one. The idea is that that the end to end task of generating business intelligence from source data can be done within a single tool or with a tightly integrated suite of tools. This simplifies and speeds up the process of building BI products to the point that it can be delivered while it is still useful. In some cases, the task is simplified to such an extent that it may be done by a power user rather than being thrown over the wall to IT.

The audience was somewhat sceptical of the idea that a sprinkling of common metadata can make for agile BI. All the current BI suites, commercial and open source, have been pulled together from a set of disparate products and they all have rough edges in the way the components work together. I can see that deep and seamless integration between the tools in a suite will make the work of producing Business Intelligence faster and easier. Whether it will be fast enough to call agile we will have to learn from experience.

Sunday, February 06, 2011

Revolution: The First 2000 Years of Computing

For years, the Computer History Museum (CHM) has a open storage area where they put their collection of old computers, but without any interpretation except for docent led tours. I had no problem wandering through this treasure trove because I knew a lot about what they had on show, from slide rules and abacuses to the Control Data 6600 and the Cray machines. Even then, a docent could help by pointing out features that I would miss, such as the ash tray on each workstation of the Sage early warning computer system.

Now the CHM has opened their "Revolution: The First 2000 Years of Computing" exhibition, and I recommend a visit. They still have all the interesting computer hardware as they had in the visible storage area, however it is placed in a larger space and there is all kind of interpretive help from explanation of the exhibits to video clips that you can browse. In my visit, I saw a lot of new things and learned much.

For example, Napier's Bones are an old time calculation aid that turns long multiplication into addition. The Napier's Bones exhibit explains how they work and allows you to do calculations using a set. The exhibit on computers and rocketry has the guidance computer for a large missile arrayed in a circle around the inside of the missile skin leaving an ominously empty space in the middle for the payload. In the semiconductor area they had examples of silicon wafers that ranged from the size of a small coin from the early days to a current wafer that is the size of a large dinner plate. There is also an interesting video discussion of the marketing of the early microprocessors like the 8086, the Z8000, the M68000 and the absolute importance of landing the design win for the IBM PC that led to the current era where Intel is biggest and most profitable chip maker. These are just a sample of the many fascinating exhibits there.

I spent over 2 hours in the exhibition and only managed to get through half of it. I am a long time member of the museum and can go back any time, so this is a warning to non-members to allow enough time for their visit.

Saturday, February 05, 2011

Greenplum at Big Data Camp

I was at the Big Data Camp at the Strata Big Data conference the other day and one of the breakout sessions was with Greenplum. They had several interesting things to say on performance in the cloud, map-reduce and performance. Greenplum is a parallel database system that runs on a distributed set of servers. To the user, Greenplum looks like a conventional database server except that it should be faster and able to handle large data because it farms out the data and the workload over all the hosts in the system. Greenplum also has a map-reduce engine in the server and distributed Hadoop file system. Thus the user can use Greenplum both as a big data relational database and as a big data NoSQL database.

Map-reduce is good for taking semi-structured data and reducing it to more structured data. The example of map-reduce that I gave some time ago does exactly that. Thus a good use of Map Reduce is to do the Transformation part of ETL (Extract-Transform-Load), which is the way data gets into a data warehouse. The Greenplum people confirmed that this is a common use pattern for map-reduce in their system.

Next was a discussion of performance. Greenplum has compared performance and asserted that their relational database is 100 times faster than their map-reduce engine for doing the same query. I was somewhat surprised by the magnitude of this number, however I know that at the low end of system and data size, a relational database can be much faster than map-reduce and at the high end there places you can go with map-reduce that conventional database servers will not go, so it is never a level comparison. I will write more on this in another post.

Finally we got to performance on Virtual Machines (VMs) and in the cloud. Again Greenplum had measured their performance and offered the following. In a place where where conditions are well controlled like a private cloud, they expect to see a 30% performance reduction from running on VMs. In a public cloud like the Amazon EC2 server cloud, they see a 4 times performance reduction. The problem in a public cloud is inconsistent speed for data access and networks. They see both an overall speed reduction and inconsistent speeds when the same query is run over and over again.

It is worth remembering that Greenplum and other distributed database systems are designed to run on a set of servers with the same performance. In practice this means that the whole database system tends to run at the speed of the slowest instance. On the other hand, map-reduce is designed to run on distributed systems with inconsistent performance. The workload is dynamically balanced as the map-reduce job progresses, so map-reduce will work relatively better in a public cloud than a distributed database server.

Monday, January 31, 2011

Security in the Cloud

Although I am not an expert, I have been asked more than once about security in the cloud. Now I can help because last week I got an education on best security practices in the cloud at the SDForum Cloud SIG meeting. Dave Asprey VP of Cloud Security at Trend Micro gave us 16 best practices for ensuring that data is safe in a public cloud like the Amazon cloud services. I will not list all of them, but here is the gist.

Foremost is to encrypt all data. The cloud is a very dynamic place with instances being created and destroyed all over the place, and your instances or data storage may be moved about to optimize performance. When this happens, the residual copy of your data can be left behind for the next occupier of that space to see. Although this would happen by accident, you do not want to expose confidential data for other to see. The only cure is to encrypt all data so that whatever may be left behind is not recognizable. Thus you should only use encrypted file systems, encrypt data in shared memory and encrypt all data on the network.

Management of encryption keys is important. For example, you should only allow the decryption key to enter the cloud when it is needed, and make sure that it is wiped from memory after it has been used. Passwords are a thing of the past. Instead of a password, being able to provide the key to decrypt your data is sufficient to identify you to your cloud system. There should be no password based authentication and access to root privileges should not be mediated by a password, but should be enabled as needed by mechanisms like encryption keys.

Passive measures are not the end of cloud security. There are system hardening tools and security testing services. Also use an active intrusion detection system, for example OSSEC. Finally, and most importantly, the best advice is to "Write Better Applications!"

Monday, January 17, 2011

The Steve Jobs Media Playbook

Information wants to be free. Steve Jobs is not usually associated with setting information free, however he set music free and may well be on the way to set more media free. Here is the playbook that he used to set music free, and an examination of whether he can set other media free.

Back at the turn of the millennium digital music was starting to make waves and Apple introduced their first iPod in 2001. At the beginning, it was not a great seller. Next year the second generation iPod that worked with Microsoft Windows came out and sales started to take off. The next problem with promoting sales of the iPod was to let people buy music directly. In those days, to buy music you had to buy a CD, rip it onto a computer and then sync the music onto the iPod.

The record companies did not like digital music. It was in the process of destroying their business model of selling physical goods, that is CDs, which had been plenty profitable until the internet and file sharing had come along. Thus the record companies knew that if they were going to allow anyone to sell digital music, the music content had to be protected by a strong Digital Rights Management (DRM) system. Basically DRM encrypts digital content so that it can only be accessed by a legitimate user on a accredited device.

Now there is one important thing about any encryption, it depends upon a secret key to unlock the content. If too many people know a secret, it is no longer a secret. So it made perfect sense for Apple to have their own DRM system and be responsible for keeping their secret safe. The only problem was that Apple effectively controlled the music distribution channel because of the DRM system and its secret. By providing exactly what the music business had asked for, Apple managed to wrest control of the distribution channel from them.

In the past I have joked about the music business controlling the industry by controlling the means of production. In fact they controlled the business by controlling the distribution channel between the artists and the record stores who sold the music. When the iTunes store became the prime music distribution channel it was game over for the recording industry. They had to climb down and offer their music without DRM to escape from its deadly embrace. DRM free music has not stopped iTunes but it does open up other sales channels.

The remaining question is what will happen with other media? Apple will not dominate the tablet market as it has the music player market so it will not be able to exert the same influence. On the other hand, other media is not a collectible as music. We collect music because we want to listen to it over and over again. With most other media, we are happy to consume it once and then move on. Thus we do not feel the need to own the media in the same way. I have some more thoughts that will have to wait for another time.

Friday, December 31, 2010

The Year in Posts

Looking back at posts in this blog over the last year I see a couple of themes emerge. Firstly there were many posts on technology and media, in particular several on the iPad which has had an extraordinary effect as the first device specifically designed for consuming media. Other issues of concern included television, 3D, aspect ratios and the problem of registration at web sites. We are going through huge changes in the media world as digitialization and the internet delivery system changes everything. I have written many posts on this in the past and I will continue to do so.

The SDForum Business Intelligence SIG that I chair had a banner year with so many memorable meetings, it is difficult to pick out the best one. A fantastic talk from Google Analytics Evangelist Avinash Kaushik on "Web Analytics 2.0" drew by far the biggest crowd. We had two great big data talks: "Winning with Big Data" from Michael Driscoll of Dataspora and "Mad Skills for Big Data" from Brian Dolan, both very impressive. Donovan Schneider from SalesForce.com spoke on "Real Time Analytics" and Dan Graham from Teradata spoke on "Data Management in the Cloud". Finally Peter Farago and Sean Byrnes of Flurry talked about the extraordinary information they collect about smartphone usage that they collect from their Mobile App analytics platform. Co-chair Paul O'Rorke who organized several of these meetings has stepped down and we will miss him greatly.

Finally, Blogger started collecting statistics in May of this year. Looking at the page views on this blog, my last post on "Windows File Type Fail" has generated a lot of interest in the few days since it was posted. The most viewed post is a 2009 post on "Ruby versus Scala" followed closely by the Windows post. In my view, the post last year about the Windows Autorun feature is a better rant than the current one. You can feel the veins bulging in that rant whereas this years rant is very laid back in comparison. Do not worry, there are many more misfeatures of Microsoft Windows to rant about so I am not going to run out of material for a long time.

Tuesday, December 28, 2010

Windows File Type Fail

It is that time of year when I rant about an awful, awful, awful feature of the Microsoft Windows operating system. This year the subject of my diatribe is file types. You see, Windows thinks that every file has a type and the type connects the file to a program that can handle it. Like many "features" in Windows, file types are intended to make your life easier while in practice doing the opposite. Note that some time ago, I wrote about file systems and Content Management as opposed to a file type manager. I still think there are some good ideas in there that need to be explored.

If you do not know what a file type is, here is a primer. Every file has a name. The file type is a usually 3 letter extension to the name. So for example, the program for Windows Explorer, is called "explorer.exe", the dot is a separator and exe is the file type. The type exe means a program that Windows can run. To look at all the file types on an XP system, bring up the control panel, select Folder Options and then click the File Types tab. On Vista and 7, the path through the control panel is slightly different. The dialog shows a huge list of registered file types and the programs that will handle them. Note that the first few entries in the list are not representative, go down to the middle or bottom of the list so see what it is really all about.

Windows goes to great length to hide file types from you. By default they are not shown anywhere and you can go for a long time without even knowing that files have types. One way to run into file types is to double click on a file with a type that Windows does not know about. Windows shows a dialog asking you what program you want to use with it. You can either look up the file type on the web or select a program from a list. The most annoying aspect is that when you select a program from a list, there is a little check box that says "Always use the selected program to open this type of file." If you test a program that does not work without unchecking the box the mistake is remembered and thereafter every time you open a file of that type, the wrong program is chosen. If you uncheck the box, a mistake is not remembered, however neither is a success. Either way, you can lose. Moreover, to recover from a mistake, you have to find the entry for the file extension in the File Types window discussed above and delete it, which is not a trivial task, given the number of file types.

Another little problem with file types is that they can be wrong, confused or direct Windows to do the wrong thing. I wrote about a problem with .avi files from a Canon camera breaking Windows Explorer. There are security issues where Windows is penetrated because it trusts the file type information and then does the wrong thing with a broken file.

However, the real problem with file types appears when you install a new program. Programs are greedy. They want to control as much of your experience as possible so they will try to register as many different file types as they can. If you have one program that deals with a type of file and you install another program that deals with similar files, the new program should pop up a dialog asking you which types of files it should handle. Then you have to make all sorts of complicated decisions about which file types the new program should handle.

Programs for handling media are the worst in this respect because there are lots of different media types and it is common to have several media players installed to handle different special cases. For example, on my home computer I have Windows Media Player and a DVD player because they came with the system. Then there is iTunes for my iPod, the QuickTime video player that comes with iTunes, a RealPlayer for the BBC iPlayer and finally a program for ripping and burning CDs and DVDs. There may well be other media players amongst the shovelware preinstalled on the box. There are also programs for editing specific media types like at least two picture editors and a video editor or several.

A typical scenario is that you are installing a new media player program because you want to use it to view a particular type of media. Unfortunately, the program installer knows about all the media types that it can handle and asks you to chose what media types types it should handle. Thus you have to disengage your thoughts from the one media type that is the object of your attention and instead start to think about all those other media types that you are not interested in. Unfortunately, there is the worry that if you give in to the new media player and let it handle certain types of media, other things will stop working. Maybe you will not be able to watch videos, or maybe videos will stop syncing with your portable media player because you changed the program associations with a particular file type. Given the complexity of these systems, who knows what may go wrong.

I said that the media player installer should ask you which file types you want associated with the program. A few years ago, Real managed to destroy much of their franchise by not playing nice and fair with file types. The RealPlayer installer switched all file types that it could handle to use the RealPlayer without bothering to ask or notify. Worse, if you went in and installed another program that changed the file type associations or even tried used the File Types dialog screen to change file type associations, it would just change them back to the RealPlayer, again without a notification. When this came to light, many people, myself included, uninstalled RealPlayer and swore never to install any software from Real again. Recently I caved on this resolution so that I could listen to old BBC radio shows like "The Goon Show" with the BBC iPlayer which it turns out to be just a rebadged Real player.

Since the RealPlayer imbroglio, installer programs have been a lot more careful about asking users about file types, but that just throws the problem back to the user. As the whole point of file types is to hide system complexity from the user, this it is no solution at all. A better path is to do without file types. Why are they necessary? Do they really serve a purpose? Other operating systems get along fine without file types, so why does Windows need them. Lets just throw them out and make life easier.

Monday, December 20, 2010

Is That Annoying Modal Caps Lock Key Going Away?

So Google came out with their new Chrome Operating System, loaded it onto a laptop and gave the whole caboodle of people to play with and comment on. While Chrome OS has generated a lot of comments, the largest and most active discussion has been about the Caps Lock key. You see, Google has changed the behavior of the key that used to be Caps Lock to instead call up a search page. I am sure this change was made to pander to keyboard weenies who want to Google without having to lift their hands from the keyboard. Anyway, the change has backfired. Instead of talking about Chrome OS, everyone is engaged in a furious discussion of why the Caps Lock is either essential or should have been disposed of a long time ago.

I have two problems with the Caps Lock, no make that three. The first problem is that it sits right between two important keys. Below is the Shift key whose importance needs no explanation. Above it is the Tab key, used for next field, command completion, automatic indent and plenty of other useful purposes. In the middle sits Caps Lock just waiting to be hit by accident. This brings to the next problem, Caps Lock is modal. Hit the Caps Lock key by accident and you do not make just one typing mistake, rather the whole keyboard is shifted into a new mode and the error compounds. By the time I look at the screen, I have typed half a sentence in the wrong case.

I am a member of the tribe that hates modal user interfaces with a passion. Some of my compatriots physically remove the Caps Lock key or reprogram their keyboard to reduce typing errors. I have only gone as far as to disable that other annoying modal key. The Insert key is used by many editors to switch between insert mode and overtype mode. If you hit Caps Lock by accident, the result is obvious, if you hit Insert by accident you can go on for some time before you realize that you are seriously damaging the document that you are trying to fix up. Of course, the Insert key is slightly off the main keyboard, right above the really useful Delete key and just waiting to be hit by accident.

My final problem with the Caps Lock key is that if you are in Caps Lock mode and you press shift, it reverts back to entering lower case. This means that when I hit cAPS lOCK by accident every key I type is in the wrong case, not just some of them. I happen to have an old typewriter from the 1930's so I know what shift really means. The Shift key causes the whole paper carriage and platen to move so that when the typebar comes down a different type piece strikes the ink ribbon and paper. Shifting the platen is why it is called the Shift key and it is a heavy key to hold, so there is a Shift Lock key that is a mechanical lock to hold the platen in the shifted position. With the platen locked in the shift position, hitting the shift key does nothing, so why has someone gone to the trouble of programming bogus behavior in out modern and supposedly more convenient keyboards?

Now, I know that there are people who love the Caps Lock key and who use it all the time. For my part, given the choice between a key that causes a small typing mistake every time I hit it by accident and a key that brings up a new web page by accident, I will choose the Caps Lock function every time. Caps Lock is annoying but I have lived with it for a long time and it is much smaller surprise than a new page that I do not want.

Saturday, December 18, 2010

The Gawker Password Fiasco

Last month I wrote about password security, just a little too soon. This month the popular blog site owner Gawker admitted to a huge security breach where hackers had broken into their web servers and stolen their entire database of user account names with email addresses and passwords. The attack has brought password security to every ones attention, with people reporting that their email and other accounts have been compromised. There are a lot of discussions of protocols for password security with good information, and unfortunately there is also a lot of misinformation. Here is my take.

The Forbes magazine web-site has a clear description of the attack on Gawker, (although their discussion of the password encryption is not correct). The short story is that the break-in was done by a hacker group called Gnosis who were annoyed by Gawker. Frankly, given Gawker's arrogant style, who has not been annoyed by them at some time? Gnosis first broke in to Gawker in July and got the passwords to accounts for Nick Denton and 16 other staffers there. In November, Denton noticed some possible tampering in a web account, and finally in December Gnosis announced their break in and released data they had gathered.

Although, Gawker had used encryption to hide the users passwords, they are susceptible to a brute force attack and many passwords have been broken. Gawker lost over 1 million accounts and more than 100,000 passwords have been cracked and published so far. The Wall Street Journal has a nice analysis of the most popular passwords including a frequency graph.

There is a lot of misunderstanding about how passwords are stored on a web site and how a brute force attack takes place. For example, the Forbes article I mentioned earlier obviously does not have a clue. I do not know for certain how Gawker protects their passwords, however the best practice is to use a salted hash. With this technique, the web-site chooses a salt, which is just a random string of characters. When a user sets a password, the salt is appended to the password and the whole string is hashed with a cryptographic hash function like SHA-1. The resulting hash value is a seemingly random string of bits, and this is stored as the encrypted users password. When the user wants to log in, the salt is added to the supplied password, the resulting string hashed, and the hash value compared to the saved hash. If they are the same, the user must have provided the correct password and is allowed to log in. By using a salted hash, the web-site does not save the users password, they just save a cryptographic hash that is used to confirm that the user knows their password. To make things more secure, the web-site can save a different salt for each user or just add the user name to a common salt so that even if two users have the same password, the salted hash of their passwords are not the same.

In a brute force attack the attacker knows the algorithm used to generate the salted hash and has the salted hash of the password. The attacker generates a list of potential passwords, applies the password checking algorithm to each password and if the results are the same, they have guessed the users password. If the attacker can try 20 passwords a second, they can test well over a million passwords a day on a single computer.

It is very easy to generate a list of potential passwords. One good starting point is a list of broken passwords, such as published by Gnosis from the attack on Gawker. The next step is a dictionary of common words and proper names. Many applications have a spelling dictionary that can be used as a starting point. Then try some simple variations like adding a number to the beginning or of words, capitalizing letters in the word and make common substitutions for letters such as 1 for the letter 'i' and 5 or $ for 's'.

So now that you now how it is done, think about your passwords and how easy they can be attacked by brute force, and excuse me while I go and change some of mine.

Saturday, December 11, 2010

Now You See It: the Book

If you are of a data analytics bent or know someone who is and are looking for a book to put on the Christmas list, consider Now You See It: Simple Visualization Techniques for Quantitative Analysis by Stephen Few. This is a beautiful book that would not look out of place on a coffee table, yet at the same time, is full of practical information about how to do analytics with charts, graphs and other visual tools.

The book is divided into three sections. The first section covers visual perception and general visualization techniques for looking at data. Then the second section goes into more detail with chapters on specific techniques for different types of analysis including time-series analysis, ranking analysis, deviation analysis and multivariate analysis amongst others. Each chapter in this section ends with a summary of the techniques and best practices for that type of analysis. Finally the book ends with a shorter section that looks at promising new trends in visualization.

There are copious examples of graphs and charts drawn by different software tools. While some of these graphs come from high end tools like Tableau and Spotfire, others are drawn by Microsoft Excel. In fact there are several specific procedures for using features of Excel to do sophisticated analytics. That is not to say that the book suggests that you can do everything with a spreadsheet. The first part shows you what to look for in visual analytics software and it essential reading before going out and choosing which tool to use.

So, if you are looking for a quality and practical gift for an analytician, choose "Now You See It".