Sunday, October 31, 2010

The New OLAP

Just as there are new approaches to database management with the NoSQL movement, so is there a move to a new OLAP, although this movement is just emerging and has not taken a name yet. This month at the SDForum Business Intelligence SIG meeting, Flurry talked about how they put their data on mobile app usage in a giant data-cube. More recently, Chris Riccomini of LinkedIn spoke to the SDForum SAM SIG about the scalable data cubing system that they have developed. Here is what I learned about Avatara, the LinkedIn OLAP server. DJ Cline has also written a report of the event.

If you do not know what OLAP is, I had hoped to just point to an online explanation, but could not find any that made sense. The Wikipedia entries are pretty deplorable, so here is a short description. Conceptually, OLAP stores data in a multi-dimensional data cube, and this allows users to look at the data from different perspectives in real time. For example, take a simple cube of sales data has three dimensions, a date dimension, a sales person dimension, and a product dimension. In reality, OLAP cubes have more than dimensions than this. Each dimension contains a hierarchy, so the sales person dimension groups sales person by state then sales region, then country. At the base level the cube contains a data point called a measure for each sale of each product made by each sales person and the date when the sales was made. OLAP allows the user to look at the data in aggregate, and then drill down on the dimensions. In the example cube, a user could start by looking at the sales of all products grouped by quarter. Then they could drill down to look at the sales in the most recent quarter divided by sales region. Next they could drill down again to look at sales in the most recent quarter by sales person, comparing say the the Northern region to the Western region, and so on.

The new OLAP is enabled by the same forces that are changing databases with NoSQL. Firstly, the rise of commodity hardware that runs Linux, the commodity operating system, allows the creation of cheap server farms that encourages parallel distributed processing. Secondly, the inevitable march of Moore's law is increasing the size of main memory so that now you can spec a commodity server with more main memory that a commodity server had in disk space 10 years ago. An OLAP data cube can be partitioned along one or more of its dimensions to be distributed over a server farm, although at this stage partitioning is more of a research topic than standard practice. Huge main memory allows large cubes to reside in main memory, giving near instantaneous response to queries. For another perspective on in memory OLAP, read the free commentary by Nigel Pendse at the BI-Verdict (it used to be called the OLAP Report) on "What in-memory BI ‘revolution’?"

LinkedIn is a fast growing business oriented social networking site. They have developed Avatara to support their business needs and currently run several cubes on it. The plan is to to open source the code later this year. Avatara is an in memory OLAP server that uses partitioning to provide scalability beyond the capabilities of a single server.

The presentation was fast paced and it has taken me some time to appreciate the full implications of what was said. Here are some thoughts. Avatara offers an API that is reminiscent of Jolap rather than the MDX language that is the standard way of programming OLAP, probably because an API is easier to implement than a programming language. Avatara does not support hierarchies in its dimensions, but the number of dimension in a typical cube seems to be higher than usual. It may be that they use more dimensions rather than hierarchies within a dimension to represent the same information. This is a trade off of roll-up within the cube for slicing of dimensions. Slicing is probably more efficient and easier to implement while a hierarchy is easier for the user to understand as it allows for drill up and down.

Chris mentioned that most dimensions are small and that can be true, however the real problems with OLAP implementations start when you have more than one large dimension and you have to deal with the issue of sparsity in the data cube. Chris spent some time on the problem of a dimension with more than 4 billion elements and this seems to be a real requirement at LinkedIn. Current OLAP servers seem to be limited to 2 billion elements in a dimension, so they are going to be even more constraining than Avatara.

2 comments:

Anonymous said...

partitioning is traditionally thought of as defining regions on a physical disk, so what is partitioning in memory?

Richard Taylor said...

By partitioning, I meant taking a data cube and slicing it into sub-cubes along one of the dimensions. Each sub-cube is then assigned to a different host in a distributed computer farm. It is similar to sharding which is a technique for scaling database performance. With sharding a database is split up into separate shards and each shard is assigned to a separate host. You can slice up a data cube more precisely than a database, however it is not always straightforward to get consistent performance gains from slicing up a data cube.