Monday, October 09, 2006

Understanding Metadata

Some time ago I made the remark that "metadata is the cure for all problems digital". Well it is true, metadata IS the cure, although this statement is a bit like that old Computer Science saw that any problem in Computer Science can be solved by adding an extra level of indirection. The problem with Metadata is that it is one of these slippery concepts that seems so straightforward but then when you go to look into the details is difficult to pin down.

Ask a number of people what metadata is and you get a variety of answers. There are experts who will tell you that metadata is data about data and then go starry eyed as their mind gets lost in an infinite regression, while you wait unrequited wanting to hear something more useful and concrete. On the other hand there are people who will tell you that metadata is the ID3 tags found in MP3 files. Of course metadata is data about data, but that definition does not capture the essence of it, and while ID3 tags are metadata, there is a lot more useful metadata in an MP3 file than just the ID3 tags, let alone all the metadata in all the other data stores, data sources and file types that are available.

To get an understanding of metadata, a good starting point is to look at a diverse set of examples and then look at some of the important attributes and characteristics that are common to these examples. So, to kick this off, here is a description of metadata in a database, an XML file and a MP3 file. We will look at attributes and characteristics in later posts.

In a SQL database, the metadata is called the Catalog and this is presented as a set of tables like any other database tables. The catalog contains tables that define the tables, columns, views, permissions and so on. In practice the catalog is an external representation of the information that the database system needs to access its data. Internally a Catalog can be stored as a set of database tables or just some data structures, I have seen it done both ways. The Catalog is always presented as a set of tables so that the user can query the Catalog just like any other data. For example, I have fixed a bug by writing a mind-bendingly complicated query on a catalog rather than update the definition of the catalog table to get the required information easily.

In an XML document, the tags are the metadata. Well, except for the fact that tags can contain attributes and the value part of an attribute is data. Next we have to quench the argument about whether processing instructions are metadata by saying that some types of processing instruction are metadata and other types are not. Then there are DTDs and Schema that are metadata and also metadata about the metadata, (which is allowed by the definition that metadata is data about data). Some of the metadata can be in other XML documents that are referenced by URLs.

An MP3 file consists of a sequence of frames where each frame contains a header and data representing the sound. The header is metadata, containing useful information like the bit rate, frequency and whether the data is stereo. An ID3 tag is a frame with a header that indicates that the data is not a MP3 sound frame. The ID3 tag contains information about the artist, recording and album. There are several different versions of ID3 tags that are not upwardly compatible with one another.

No comments: