Thursday, March 09, 2006

UIMA

UIMA stands for Unstructured Information Management Architecture as we heard at the SDForum Emerging Technology SIG's March meeting. IBM has just open-sourced a central part of UIMA so that you can download and play with it yourself. So what is UIMA? Well it seems that like so many other things these days, the presenters did not want to be too specific about what UIMA is, because that would constrict our thinking and prevent us from seeing all sorts of wonderful new applications for it. On the other hand you have to have some kind of grasp of what it is or you cannot do anything with it.

Lead presenter Daniel Gruhl gave the following roundabout explanation. In 1998, Tim Berners Lee introduced the Semantic Web. The idea is that you tag your web pages with metadata in the RDF format and even robots will be able to discover what you really mean. Unfortunately, since then nobody has actually put RDF tags in their web pages and web page metadata has become somewhat discredited as its principal use is to spam search engines.

So what if you could read pages and tag them with your own metadata? Well that is what UIMA is about. It is a framework where your can take a set of documents and generate your own metadata for each documents. The set of documents could be the whole web, or a subset of the web or a set of documents in your own content repository. The documents can be XML, HTML, media files or anything else as all information is now digital.

The next question is what do we do with this metadata? You cannot go and update other peoples web pages, although you could use the metadata to update your own documents and content. In practice, the principal use for the metadata is in building a search index. Although as I write this I can see that there can be plenty of other uses for UIMA for scanning and adding metadata to an existing media or document repository. So maybe the presenters were correct when they say that they do not want to constrain our thinking by being too specific about what UIMA is for.

The final question is why would you want to build your own document analyzer or search engine? Current search engines are very general. If you have specific knowledge about a subject area you can catalog a set of documents much more accurately and usefully than a general purpose search engine. One successful application of UIMA is an annotator that knows petrochemical terms and can create an index of documents useful to a petroleum engineer.

As UIMA is open source, people can build annotators on the work of others. The example shown as a demo was an annotator that discovered business titles. This used an existing annotator that identified peoples names and an annotator that identified business names and would look for business titles between them, so it could easily find the CEO in "Sam Palmisamo, CEO of IBM".

No comments: