Sunday, October 31, 2010

The New OLAP

Just as there are new approaches to database management with the NoSQL movement, so is there a move to a new OLAP, although this movement is just emerging and has not taken a name yet. This month at the SDForum Business Intelligence SIG meeting, Flurry talked about how they put their data on mobile app usage in a giant data-cube. More recently, Chris Riccomini of LinkedIn spoke to the SDForum SAM SIG about the scalable data cubing system that they have developed. Here is what I learned about Avatara, the LinkedIn OLAP server. DJ Cline has also written a report of the event.

If you do not know what OLAP is, I had hoped to just point to an online explanation, but could not find any that made sense. The Wikipedia entries are pretty deplorable, so here is a short description. Conceptually, OLAP stores data in a multi-dimensional data cube, and this allows users to look at the data from different perspectives in real time. For example, take a simple cube of sales data has three dimensions, a date dimension, a sales person dimension, and a product dimension. In reality, OLAP cubes have more than dimensions than this. Each dimension contains a hierarchy, so the sales person dimension groups sales person by state then sales region, then country. At the base level the cube contains a data point called a measure for each sale of each product made by each sales person and the date when the sales was made. OLAP allows the user to look at the data in aggregate, and then drill down on the dimensions. In the example cube, a user could start by looking at the sales of all products grouped by quarter. Then they could drill down to look at the sales in the most recent quarter divided by sales region. Next they could drill down again to look at sales in the most recent quarter by sales person, comparing say the the Northern region to the Western region, and so on.

The new OLAP is enabled by the same forces that are changing databases with NoSQL. Firstly, the rise of commodity hardware that runs Linux, the commodity operating system, allows the creation of cheap server farms that encourages parallel distributed processing. Secondly, the inevitable march of Moore's law is increasing the size of main memory so that now you can spec a commodity server with more main memory that a commodity server had in disk space 10 years ago. An OLAP data cube can be partitioned along one or more of its dimensions to be distributed over a server farm, although at this stage partitioning is more of a research topic than standard practice. Huge main memory allows large cubes to reside in main memory, giving near instantaneous response to queries. For another perspective on in memory OLAP, read the free commentary by Nigel Pendse at the BI-Verdict (it used to be called the OLAP Report) on "What in-memory BI ‘revolution’?"

LinkedIn is a fast growing business oriented social networking site. They have developed Avatara to support their business needs and currently run several cubes on it. The plan is to to open source the code later this year. Avatara is an in memory OLAP server that uses partitioning to provide scalability beyond the capabilities of a single server.

The presentation was fast paced and it has taken me some time to appreciate the full implications of what was said. Here are some thoughts. Avatara offers an API that is reminiscent of Jolap rather than the MDX language that is the standard way of programming OLAP, probably because an API is easier to implement than a programming language. Avatara does not support hierarchies in its dimensions, but the number of dimension in a typical cube seems to be higher than usual. It may be that they use more dimensions rather than hierarchies within a dimension to represent the same information. This is a trade off of roll-up within the cube for slicing of dimensions. Slicing is probably more efficient and easier to implement while a hierarchy is easier for the user to understand as it allows for drill up and down.

Chris mentioned that most dimensions are small and that can be true, however the real problems with OLAP implementations start when you have more than one large dimension and you have to deal with the issue of sparsity in the data cube. Chris spent some time on the problem of a dimension with more than 4 billion elements and this seems to be a real requirement at LinkedIn. Current OLAP servers seem to be limited to 2 billion elements in a dimension, so they are going to be even more constraining than Avatara.

Sunday, October 24, 2010

Accidental Data Empires

In the new world of big data and analytics a winning business model is to find a novel way to collect interesting big data. Once you have the data, the ways to exploit it are endless. It is a phenomenon that I have seen several times, the latest example is Flurry, a company that collects and aggregates data from mobile applications. Peter Farago, VP Marketing, and Sean Byrnes, CTO abd Co-founder of Flurry spoke to the October meeting of the SDForum Business Intelligence SIG on "Your Company’s Mobile App Blind Spot".

The Flurry proposition is simple, they offer a toolkit that an app developer combines with their mobile app. The app developer goes to the Flurry website, creates a free account and downloads the toolkit. Whenever an instance of the app with the Flurry code is activated or used, it collects information about the usage that is sent back to the Flurry. The amount of information is small, usually about 1.2 kB compressed, so the burden of collection is small. At Flurry, the data is collected, cleansed and put in a gigantic data cube. At any time, an app developer can log into the Flurry website and get reports on how their application is being used. You can get a feel for their service by taking the short Analytics developer tour. Flurry have committed that their Analytics service will always be free.

While there are some issues with data collection that Flurry deals with, the quality of the data is great. Every mobile phone has a unique identifier so there is no problem with identifying individual usage patterns. As the service is free, there is very little friction to its use. Flurry estimates that they are in one in five mobile apps that are out there. In fact, for an app developer, the only reason for not using Flurry is that they have chosen to use a rival data collection service.

In the end however, the big winner is Flurry, who collect huge amounts of information about mobile app and phone usage. In the meeting Peter Farago gave us many different analyses of where the mobile smartphone market is and where it is going, including adoption rates for iPhones versus Android based phones and how the follow on market for apps on each platform is developing. You can get a mouthwatering feel for the information they presented by looking at their blog in which they publish a series of analyses from their data. As I write their latest post shows a graph on the "Revenue Shift from Advertising to Virtual Goods Sales" which shows that apps are growing their revenue from sales of virtual goods, while advertising revenue seems to be stagnant.

With data aggregators, there is always something creepy when you discover just how much data they have on you. Earlier this year there was an incident where a Flurry blog post described some details of the iPad a few days before it was announced that they had gleaned from apps running on these new devices in the Apple offices. Steve Jobs was so provoked by this that he called out Flurry by name and changed the iPhone app developer terms of service to prevent apps from collecting certain sorts of data. You can read more about this incident in the blog report on the meeting by my colleague Paul O'Rorke.

The title of this piece is a reference to the entertaining and still readable book Accidental Empires by Robert X. Cringely about the birth of the personal computer industry and the rivalry between Steve Jobs and Bill Gates.

Wednesday, October 13, 2010

A Critique of SQL

SQL is not a perfect solution as I told the audience at the SDForum Business Intelligence SIG September meeting, where I spoke about "Analytics: SQL or NoSQL". The presentation discusses the difference between SQL and structured data on the one hand versus the NoSQL movement and semi-structured data on the other hand. There is more to the presentation than I can fit in one blog post, so here is what I had to say about the SQL language itself. I will write more about the presentation at another time. You can download the presentation from the BI SIG web site.

Firstly the good. SQL has given us a model of a query language that seems so useful as to be essential. Every system that provides persistence has developed a query language. Here are a smattering of examples. The Hibernate object persistence system has Hibernate Query Language (HQL) which has been developed into the Java Persistence Query language (JPQL). Other Java based object oriented persistence systems either use JPQL or their own variant. Hive is a query interface built on top of the Hadoop Map-Reduce engine. Hive was initially developed by Facebook as a simplified way of accessing their Map-Reduce infrastructure when they discovered that many of the people who need to write queries did not have the programming skills to handle a raw Map-Reduce environment. XQuery is a language for querying a set of XML documents. It has been adopted into the SQL language and is also used with stand alone XML systems. If data is important enough to persist, there is almost always a requirement to provide a simple and easy to use reporting system on that data. A query language handles the simple reporting requirements easily.

On the other hand, SQL has many problems. Here is my thoughts on the most important ones. The first problem is that SQL is not a programming language, it is a data access language. SQL is not designed for writing complete programs, it is intended to fetch data from the database and then anything more than a simply formatted report is done in another programming language. This concept of a data access language for accessing a database goes back to the original concept of a database as promulgated by the CODASYL committee in the late 1960's.

While most implementations of SQL add extra features to make it a complete programming language, they do not solve the problem because SQL is a language unlike any of the other other programming language we have. Firstly, SQL is a relational language. Every statement in SQL starts with a table and results in a table. (Table means a table like in a document, a fixed number of columns and as many rows as are required to express the data.) This is a larger chunk of data than programmers are used to handling. The procedural languages that interface to SQL expects to deal with data at most a row at a time. Also, the rigid table of SQL does not fit well into the more flexible data structures of procedural languages.

Moreover SQL is a declarative language where you specify the desired results and the database system works out the best way to produce them. Our other programming languages are procedural where you describe to the system how it should produce the desired result. Programming SQL requires a different mindset from programming in procedural languages. Many programmers, most of whom just dabble in SQL as a sideline, have difficulty making the leap and are frustrated by SQL because it is just not like the programming languages that they are used to. The combination of a relational language and a declarative language creates a costly mismatch between SQL and our other programming systems.

Finally, SQL becomes excessively wordy, repetitive and opaque as the queries becomes more complicated. Sub-Queries start to abound and the need for correlated sub-queries, outer joins and pivoting data for presentation cause queries to explode in length and complexity. Analytics is the province of complicated queries so this is a particular problem for data analysts. In the past I have suggested that persistence is a ripe area for a new programming language, however although there are many new languages being proposed none of them are concerned with persistence or analytics. The nearest thing to an analytics programming language is R which is powerful but neither new nor easy to use.

Wednesday, October 06, 2010

Vertical Pixels are Disappearing

The quality of monitors for PCs is going backwards. A few years ago, noticing the rise of the widescreen monitor and fearful that all reasonably proportioned monitors would soon disappear, I bought a Samsung Syncmaster 204B (20.1" screen, 1600x1200 pixels). Last night it popped and stopped working. When I went online to research a replacement monitor, the full gravity of the situation became obvious.

Not only is it virtually impossible to find a monitor that is not widescreen, almost all monitors that you can buy, whatever the size of their screen, are 1920x1080 pixels. In the years since I bought the 204B, the number of pixels that we get in the vertical direction has shrunk from 1200 to 1080! Funnily enough, there is a post on Slashdot this morning titled "Why are we losing our vertical pixels" about this topic. The post has drawn many more that the usual number of comments.

For me, the vertical height of the screen is important. I use my computer for reading, writing, programming, editing media and some juggling with numbers. For each activity, having a good height to the screen helps, and width after a certain point does not add much. A Television uses a 1920x1080 pixels for a full 1080p display. The monitor manufacturers are just giving us monitors made from cheap LCD panels designed for televisions. When I watch TV, I have a much larger screen in another room with more comfortable chairs and more room between me and the screen. Thus, I do not need or want a computer monitor that is expressly designed for watching TV.

The real problem is that 1920x1080 monitors are so ubiquitous that it is difficult to find anything else. After a lot of searching I only found a couple of alternatives. Apple has a 27" widescreen monitor that is 2560x1440 pixels at a cost of ~$1000, and only works well with some recent Apple systems. Dell has a 20" monitor in their small business section that is 1600x1200 and costs ~$400. However, Dell seems to vary the type of LCD panel that they use between production runs and one type of panel is a lot better than the other. Unfortunately, you do not know which type of panel you are going to get until it arrives at your door. Neither alternative gets me really excited. One thing is certain, technology is supposed to be about progress, and I am not going backwards and accepting less pixels in any dimension for my next monitor.