Monday, August 22, 2005

Data Mining Insight

Data mining is a difficult subject. On the one hand it is presented as this thing that will tell you all sorts of wonderful facts that you never knew about your data. On the other hand when you start getting into it, it is this daunting thing that is difficult to approach, seems to require a PhD in statistics to use and ends up telling you stuff that you already know, like when people buy bread at the market they are also likely to buy milk (or was that the other way round?)

At the August meeting of the SDForum Business Intelligence SIG Joerg Rathenberg, VP Marketing and Communications at KXEN gave a talk "Shaping the Future" about predictive analytics, which is the latest way of saying data mining. KXEN is a young, privately held company that is devoted data mining and that is successful, while many other data mining startups have fallen by the wayside.

The kernel of KXEN's success comes from powerful robust algorithms that do not require a specialist to tweak, high performance so that you can get results quickly and finally and most importantly, ease of use. As part of his presentation Joerg ran through a couple of data mining exercises showing us how you could take a reasonable sized data set in say the form of comma separated values (CSV), and using a few clicks and a several seconds processing, generate an interesting data analysis.

For me the key insight of the evening was on how to use data mining. I had always thought of data mining as a tool of last resort, when the data is too large or complicated, and nothing else seems to work, you resort to data mining to try and find something that you cannot see with the naked eye. On the other hand Joerg suggested that data mining is the first thing that you do when presented with a new business question, or a new data set. You use data mining to for the initial analysis of the data to find out which factors in the data really affect the outcome that you are interested in. Once these factors are identified, you can build reports or OLAP cubes using these factors as dimensions to explore in depth what is going on.

Thus data mining is something that you should be doing early and often in your data exploration. Joerg called this "Exploratory Data Mining" and it certainly resonated with audience members who do data analysis for a living. KXEN has designed their software to make exploratory data mining possible and even easy, and hope that by this means it becomes accessible to the masses.

Wednesday, August 03, 2005

Pauseless Garbage Collection

Programs generate a surprising amount of garbage, little pieces of memory that are used and then discarded. Low level programming languages like C require that the programmer manage the storage themselves, which is a surprisingly painful, time-consuming and error prone task. So these days application programs are written in languages like Java (or C#) where the system manages the storage and does the garbage collection. The result is much higher programmer productivity and much better and far more reliable programs.

The overhead of doing automatic garbage collection has always been a concern. However, another problem with automatic garbage collection is that up to now it has required that the system pauses, sometimes for a considerable length of time while parts of the garbage collector runs. A pause in a web application server stops customers from doing whatever they are trying to do. This can range from absolutely unacceptable in online stock trading to just very bad for customer satisfaction in a typical e-commerce application.

At the August meeting of the SDForum Java SIG, Cliff Click spoke on pauseless garbage collection. Cliff is part of Azul Systems, a startup that has developed an attached processor to run Java applications. As Azul Systems sells to large enterprises that run Java web applications to support their business, being able to do automatic garbage collection without pausing is an important feature.

Cliff is an engaging speaker who has spoken to the Java SIG before. Previously Cliff gave an overview of what Azul is doing. At this meeting, Cliff described the pauseless garbage collection algorithm in detail, and then went on to give us some indication of its performance. He had taken a part of the standard SPEC JBB Enterprise Java warehouse benchmark, modified it by adding a large slow-moving object cache and a much longer runtime that makes the benchmark more realistic and garbage collection more of an issue.

When the benchmark is run on an Azul system, the longest "Stop the world" pause is 25 milliseconds, whereas running the benchmark on other Java systems exhibited pauses of up to 5 seconds (yes seconds). On any platform almost all of the benchmark transactions run in under a millisecond. On the Azul system, no transaction took more than 26 milliseconds, which is very close to their maximum pause time, and well over 99% of the transactions ran in under 2 milliseconds. On the other Java systems, over half of the total transaction time could be taken up by transactions that took more than two milliseconds to complete.

While Cliff and Azul are proud of what they have done so far, they are not satisfied. So they are working on removing the last few vestiges of a pause from their system. We can expect even better performance in the future.

Monday, August 01, 2005

Aspect Ratio Hell

I returned from summer vacation with a large number of pictures which I am editing. Which leads to a difficult decision. When cropping the pictures, what aspect ratio do I chose for the images? This is not a clear cut question, and any investigation of which aspect ratio to use for cropping pictures leads to much confusion.

Last year, for example I had a beautiful picture of us getting Lei-ed as we arrived in Hawaii. I cropped the picture for a 4 x 6 print (aspect ratio 1.5) and then deciding that it made such a good picture, printed it on a 5 x 7 (AR 1.4) only to discover that the printer chopped off the tops of our heads. Looking further I see that it would get other results if I had tried to print an 8.5 x 11 (AR 1.294...) or a 11 x 17 (AR 1.545...) or a 13 x 19 (AR 1.461...). Fortunately the last two choices are moot because my printer cannot handle theses sized sheets.

There is more. I take a group of pictures and put them on a DVD that we can watch on TV. 60 pictures at 6 seconds each with a sound track will make a high energy 6 minute video of our vacation. However, this creates more aspect ratio choices. Currently TV's are changing their aspect ratio from 4 x 3 (AR 1.333...) to 16 x 9 (1.777...). (And why has the convention changed to putting the larger number first?) In practice TV's are even more difficult as they naturally chop off some lines at the top and bottom of the picture so that a video of still images introduces even more uncertainty when deciding how to crop the images.

If we want to see the whole image in the video, we can look at it on a computer monitor which does not lose scan lines from the top and bottom and will scale everything to fit. But there is a problem even in the logical world of computers. Most of the standard display settings have an aspect ratio of 1.333... (800 x 600, 1024 x 768, 1600 x 1200) however the majority of computer displays sold today are LCD panels at 1280 x 1024 (AR 1.25).

More confusing yet, the display size of 1280 x 768 (AR 1.666...) is becoming popular in both laptops that can be used for watching DVD's on long flights and with LCD TV's where the aspect ratio seems to match what is shown on a 16 x 9 TV even although 1.66... is not 1.77...

There is a lot more to aspect ratio and it only gets worse. For example, the above discussion assumed that the pixels were square which they do not need to be. I have some ideas on what can be done which will have to wait until another time.