Friday, October 23, 2009

Database Systems for Analytics

The question "what are the attributes of a database system for analytics?" came up during Omer Trajman's talk to the October meeting of the SDForum Business Intelligence SIG. The talk was titled "The Evolution of BI from Back Office to Business Critical Analytics". In the talk Omer gave several examples of applications that use real time analytics and explained the special attributes of each application. As he runs field engineering for Vertica, a Database Systems vendor, I am sure that these examples were based on his experience with Vertica deployments, however Omer was careful to keep his talk vendor neutral.

So what are the the attributes of a database system for analytics? Omer discussed three attributes. Firstly, an analytics database system cannot use the row level locking that is found in a traditional transaction processing database. The database system needs to provide snapshot isolation that gives a query a consistent view of the data while not preventing other operations like data loads. Having helped implement a system like this in the past, I am in total agreement with Omer.

The second attribute is the need to allow concurrency between loading and querying data. While this is related to the first attribute, it also comes with its own issues. Bulk loads are more efficient (particularly for a columnar database like Vertica), however, if you want access to the most up to minute data you need to do loads in small increments so that the data is available for query as soon as it is loaded. Managing this balance is difficult and as yet it has not been completely solved. Again, I have worked on this issue in several different systems.

The final attribute was scaleout, that is the ability to add more processing systems to handle more data and larger queries. We are building systems out of hundreds and thousands of computer systems. Scaleout is vital to effectively use these systems.

No comments: