Saturday, August 28, 2010

Mad Skills for Big Data

Big Data is a big deal these days, so it was with great interest that we welcomed Brian Dolan to the SDForum Business Intelligence SIG August meeting to speak on "MAD Skills: New Analysis Practices for Big Data". MAD is an acronym for Magnetic Agile Deep, and as Brian explained, these skills are all important in handling big data. Brian is a mathematician who came to Fox Interactive Media as Lead Analyst. There he had to help the marketing group with deciding how to price and serve advertisements to users. As they had tens of millions of users that they often knew quite a lot about, and served billions of advertisements per day, this was a big data problem. They used a 40 node Greenplum parallel database system and also had access to a 105 node map reduce cluster.

The presentation started with the three skills. Magnetic, means drawing the analyst in by giving them a free reign over their data and access to use their own methods. At Fox, Brian grappled with a button down DBA to establish his own his own private sandbox where he could access and manipulate his own data. There he could bring in his own data sets, both internal and external. Over time the analysts group established a set of mathematical operations that could be run in parallel over the data in the database system speeding up their analyses by orders of magnitude.

Agile means analytics that adjust react and learn from your business. Brian talked about the virtuous cycle of analytics, where the analyst first acquires new data to be analyzed, then runs analytics to improve performance and finally the analytics causes business practices to suit. He talked through the issues at each step in the cycle and led us through a case study of audience forecasting at Fox which illustrated problems with sampling and scaling results.

Deep analytics is about producing more than reports. In fact Brian pointed out that even data mining can concentrate on finding a single answer to a single problem where big analytics has the need to solve millions of problems at the same time. For example, he suggested that statistical density methods may be better at dealing with big analytics than other more focused techniques. Another problem with deep analysis of big data is that, given the volume of data, it is possible to find data that supports almost any conclusion. Brian used the parable of the Zen Tea Cup to illustrate the issue. The analyst needs to be to approach their analysis without preconceived notions or they will just find exactly what they are looking for.

Of all the topics that came up during the presentation, the one the caused most frissons with the audience was dirty data. Brian's experience has been that cleaning data can lose valuable information and that a good analyst can easily handle dirty data as a part of their analysis. When pressed by an audience member he said "well 'clean' only means that it fits your expectation". As an analyst is looking for the nuggets that do not meet obvious expectations, sanitizing data can lose those very nuggets. The recent trend to load data and then do the cleaning transformations in the database means that the original data is in the database as well as the cleaned data. If that original data is saved, the analyst can do their analysis with either data as they please.

Mad Skills also refers to the ability to do amazing and unexpected things, especially in motocross motor bike riding. Brian's personal sensibilities were more forged in punk rock, so you could say that he showed us the "kick out the jams" approach to analytics. You can get the presentation from the BI SIG web site. The original MAD Skills paper was presented at the 2009 VLDB conference and a version of it is available online.

No comments: