Build and Break: October 2005

Saturday, October 29, 2005

Everything Bad Is Good For You

Slashdot has just alerted me to Steven Johnson new book "Everything Bad Is Good For You" on how today's popular culture and new fangled inventions like television, video games and the internet are actually making us smarter. Both the above links have reviews and plenty of the expected reactionary comments.

What we have to remember is that they said the same kind of thing when printing was invented. Then they said that the new printed books were not as good as the old hand-written ones and that printing took away the artistry in book production. Another line of attack was that if everyone could have books to read that there would be nobody out there tilling the fields. It took about 50 years until the older generation had died off and the people who grew up with printed books fully accepted them.

I have been Googling all day to find suitable links for the above paragraph with no success. Maybe I will have get a subscription to Lexis/Nexis.

Sunday, October 23, 2005

A Cure for Aspect Ratio Madness

Some time ago I wrote about Aspect Ratio Madness, the problem that every device for displaying pictures and video has a different aspect ratio. In that entry I promised to suggest a cure, so here it is. We now live in the digital age and as you know metadata is the cure for all problems digital. For digital images there is already a standard for the metadata that goes with the picture.

The idea is very simple. My proposal is that we include metadata with the image about how to crop it and the display device, knowing its aspect ratio, uses that metadata to display the image. The new metadata is a minimum bounding box for display of the image.

The minimum bounding box is the part of the image that must be displayed when it is presented. The maximum bounding box is the picture itself. So when we come to edit a picture we crop the picture to the maximum that can be rescued from the image and we also crop the an interior section of the image that is the part that we really want to see. This inner crop is saved in the metadata.

When we go to display the image, either by printing it on a printer, or showing it on a computer slide show or when rendering a slide show movie, the display device decides how much of the picture to show, always including all of the minimum bounding box and then filling the display out as needed with rest of the image to fill the frame. If there is no image to show the display device uses its default, which is black bars for screen and blank (white) for printing.

The display device also knows whether it can rotate the image for display. When an image has a minimum bounding box that is taller that it is wide, a printer can rotate the image by 90 degrees, while a computer display or movie render cannot rotate the picture.

This works for still images because there is metadata that goes with each image. For video, there is metadata for the whole movie, however there is no metadata for each frame or shot. If we add the metadata for each shot in a movie, we can create video that can be shown on any display, 16x9, 4x3, or whatever and still look correct.

Wednesday, October 19, 2005

@#$% Threads

There are many different paradigms for concurrent programming and threads is the worst. We are working on a sophisticate threaded system. It uses supposedly standard technology, pthreads on Linux. The number and variety of problems is quite astonishing. Recently we have had some problems with thread termination.

One "little" problem is that part of the system links with a well known database management system. The database installs a thread exit handler which clean up whenever a thread exits. The problem is that not all out threads access the database system. When we cancel a thread that has not run any database code the database exit handler is called and because it has not been initialized, it SEGVs and brings down the whole system.

Another "little" problem is that I moved our code to a new system and could reliably get it to spin forever in the pthread_join procedure. As this comes just after a call to pthread_cancel, the conclusion is that the pthreads library on that system is not thread safe. The pthreads library is in the same location as it is on the system that works, however it is not exactly the same, which suggests that we have a duff pthreads library. After spending a couple of hours, I could not find out where the thread library on either system came from.

Neither of the problems is really a problem with threads, however they are typical of what a threads programmer has to deal with day to day. I have much to unload on real threads problems that we can look at other times. This is important topic as concurrent programming is the way of the future.

Thursday, October 13, 2005

Data Quality, The Accuracy Dimension

A couple of years ago Jack Olson spoke to the Business Intelligence SIG on his then recently published book "Data Quality, The Accuracy Dimension". I just finished reading the book and felt that it is well worth a review.

Data quality is a huge problem for Information Technology. In theory, IT systems capture all sorts of useful information that can be used to analyze the business and help make better decisions. In practice when we look at the data, quality problems mean that the information is not there. Data quality is about identifying problems with data and fixing them.

For example, the same customer may appear many different times in different forms so we cannot form an integrated view of all the business interactions with the customer. And then the address may be incomplete so we cannot mail the customer an exciting new offer that fits their profile exactly.

The book has several examples of databases with curious data. There is a HR database where the oldest employee appeared to have been born before the Civil war and the youngest employee had not yet been born. Then there is a medical database where people appeared to have operations inappropriate to their gender. There is also an auto insurance claims database with many different creative spellings for the color beige.

The book itself is divided into three sections. The first section describes the data quality problem, what data quality is and how the problem arises. The second section explains how to implement a data quality assurance program. The accent of this section is towards the processes needed to do data quality assurance, however it includes a chapter on the important topic of making the business case for data quality.

The final and longest section is a more technical look at implementing data quality, through data profiling technology. Data profiling is a set of analytic tools for analyzing data to find quality problems. In a simple case, grouping, counts and an order are enough to identify outlier data, like the multiple spellings of beige mentioned earlier. In other cases sophisticated algorithms are used to identify correlations that may indicate keys or other important facts about the data. Although this section is more technical, it is certainly not difficult to read or understand.

This is an extremely valuable book. Physically the book is smallish and unimposing. The writing style is straightforward, easy to understand. Yet the book packs a big punch. As I said before, Data Quality is a huge problem for IT. This book contains everything you need to start a data quality program. As such I think that it is essential reading for any IT person in data management, or for an IT consultant looking to expand their practice.

Although the book was published in 2003, it is just as relevant and useful now. In an era where most computer technology books are out of date by the time they are a couple of years old, this is a book that will last. I would compare it to the Ralph Kimball's "The Data Warehouse Toolkit" which is 10 years old but just a useful now as it was when it was first published. By the way, Kimball is a great fan of this book.

Monday, October 03, 2005

Planning Ahead

As I said in my last entry we tend to build software systems for the hardware that we have now and not for the hardware that will exist when the system is mature. To get a glimpse of what is next for software systems we should look at the shape of hardware to come.

According to Moore, the semiconductor people can see 3 technology generations ahead where each generation is about 2 years. They get a doubling of density every 21 or so months. Thus it is safe to extrapolate over the next 6 years, in which time semiconductor density will multiply by a factor of at least 8. Six years ahead is a good target for the software systems that we are designing now.

Given that a current rack server or blade has 2 dual core processors and 4 Gigs of memory, the same system in 6 years time will have 32 processors and 32 Gigs of memory. As I noted previously processors will not get much faster, so extra performance comes from more processors. There is no doubt that this configuration is viable as it is the spec of a typical mid to high end server that you can buy now. A high end system will have several hundred to a thousand processors and perhaps a terabyte of main memory.

What does this mean for software? Well the obvious conclusions are 64 bits to address the large memory and concurrent programs to use all the processors. Slightly more subtle is the conclusion that all data can fit into main memory except for a small number of special cases. All data in main memory turns existing database systems on their head, so a specific conclusion is that we will see a new generation of data management systems.

Saturday, October 01, 2005

Moore's Law

On Thursday, I went to the Computer History Museum celebration of the 40th anniversary of Moore's Law. The centerpiece of the event was Gordon Moore in converation with Carver Mead. David House introduced the speakers and in his introduction read some remarkably prescient passages from the 1965 paper describing applications for the microelectronics to come.

During the conversation Moore explained why he wrote the paper. At the time, integrated circuits were expensive and mainly used in military applications. Most people believed that integrated circuits were a niche products and would remain that way. Moore wanted to show that integrated circuit technology was advancing rapidly and that they were the best way of building any electronic product.

So in a sense the paper was marketing, selling the concept of integrated circuits to a sceptical audience with the goal of widening the market for their use, obviously to benefit the companies that were producing integrated circuits. At the time the idea was controversial. Even nowadays we often forget the remarkable logic of Moore's law, designing systems with the hardware that we have now rather than designing the system to exploit the hardware will be there when the system is fully realized.

A remarkable thing is that the original paper extrapolated Moore's Law out to 1975. Since then we have ridden the Law for another 30 years, and it is not going to stop any time soon. Moore told us that they have always been able to see out about 3 generations of manufacturing technology, where each generation is now about 2 years. So they can see how they are going to follow Moore's Law for at least the next 6 years.

Build and Break