Monday, November 27, 2006

Understanding Metadata Revisited

Last month I wrote down some ideas on metadata. Looking back they are not very useful. More recently, I got back to reading Ralph Kimball's latest book "The Data Warehouse ETL Toolkit". As Kimball says, metadata is an important component of ETL, however he does not have much more success than I did in producing a definition or a useful explanation and slipped back, as I did, into the less satisfactory definition by example.

This got me thinking. Maybe we could improve the definition of metadata by tightening up the data about data definition. Here is my version "Metadata is structured data that describes structured data". One important attribute of metadata is that it can be used by programs as well as by people, and for this reason metadata must have a known structure. Also as metadata describes data, the data it describes has structure as well, if only because the metadata describes it. Compared to other definitions, this one finds a middle ground, specific enough to have some use while not being confined to a specific application.

Properly understanding Metadata is more than absorbing an 8 word definition. I have already mentioned one important attribute, that metadata can be used by programs as well as people. Another important attribute is the distinction between System and User metadata. System metadata is generated by the system that originates or manages the data. For example, the system metadata in a database is the descriptions of the tables, columns, constraints and almost everything else in the catalog. User metadata is created and mostly for use by other people. In a database catalog, User metadata is the contents of comments on the database objects and any semantics that may be associated with the names of the database objects.

A better example of the distinction is found in an MP3 file. System metadata in an MP3 file is the Bit Rate, Frequency and Stereo or Mono mode. User metadata is the contents of the ID3 tag. The distinction between system and user metadata is important because metadata like any other data can have data quality issues. System metadata is almost always correct. If the system metadata were faulty, the system that generated or used the data would break. On the other hand User metadata is always suspect. Just ask any music fan about ID3 tags.

I am sure that there is a lot more to say about metadata, however this feels like a good starting point.

No comments: