Thursday, October 13, 2005

Data Quality, The Accuracy Dimension

A couple of years ago Jack Olson spoke to the Business Intelligence SIG on his then recently published book "Data Quality, The Accuracy Dimension". I just finished reading the book and felt that it is well worth a review.

Data quality is a huge problem for Information Technology. In theory, IT systems capture all sorts of useful information that can be used to analyze the business and help make better decisions. In practice when we look at the data, quality problems mean that the information is not there. Data quality is about identifying problems with data and fixing them.

For example, the same customer may appear many different times in different forms so we cannot form an integrated view of all the business interactions with the customer. And then the address may be incomplete so we cannot mail the customer an exciting new offer that fits their profile exactly.

The book has several examples of databases with curious data. There is a HR database where the oldest employee appeared to have been born before the Civil war and the youngest employee had not yet been born. Then there is a medical database where people appeared to have operations inappropriate to their gender. There is also an auto insurance claims database with many different creative spellings for the color beige.

The book itself is divided into three sections. The first section describes the data quality problem, what data quality is and how the problem arises. The second section explains how to implement a data quality assurance program. The accent of this section is towards the processes needed to do data quality assurance, however it includes a chapter on the important topic of making the business case for data quality.

The final and longest section is a more technical look at implementing data quality, through data profiling technology. Data profiling is a set of analytic tools for analyzing data to find quality problems. In a simple case, grouping, counts and an order are enough to identify outlier data, like the multiple spellings of beige mentioned earlier. In other cases sophisticated algorithms are used to identify correlations that may indicate keys or other important facts about the data. Although this section is more technical, it is certainly not difficult to read or understand.

This is an extremely valuable book. Physically the book is smallish and unimposing. The writing style is straightforward, easy to understand. Yet the book packs a big punch. As I said before, Data Quality is a huge problem for IT. This book contains everything you need to start a data quality program. As such I think that it is essential reading for any IT person in data management, or for an IT consultant looking to expand their practice.

Although the book was published in 2003, it is just as relevant and useful now. In an era where most computer technology books are out of date by the time they are a couple of years old, this is a book that will last. I would compare it to the Ralph Kimball's "The Data Warehouse Toolkit" which is 10 years old but just a useful now as it was when it was first published. By the way, Kimball is a great fan of this book.

No comments: