Saturday, June 26, 2010

Winning With Big Data

Michael Driscoll gave us Secrets of a Successful Data Scientist at the June meeting of the SDForum Business Intelligence SIG in his talk "Winning With Big Data". Michael is founder of a data consultancy Dataspora, where he has done work on projects ranging to analyzing baseball pitchers through helping cell phone companies understand their customer churn. You can see slides for the talk here, and follow Micheal's thoughts in his excellent blog on the Dataspora site.

After Michael revved up the crowd by giving the Hal Varian quote that "... the sexy job in the next ten years will be statisticians", he went through 9 ways to win as a Data Scientist. His first suggestion is to use the right tools. Michael uses a variety of tools including database systems, Hadoop and the R language. Large data takes a long time to process and often we can gain insights by just working with a sample of the data, however you have to be careful when taking a sample to ensure that it makes sense and that the results will scale. Which leads us to the another way to win, which is to know, understand and use statistics.

Statistics is a field of mathematics that is still developing and it is not easy, however statistics is a core competence of a Data Scientist. It is not enough to do the analysis, the Data Scientist has to be able to present the results and turn them into a compelling story. Both analysis and presentation requires good visualization tools and the knowledge of how to use them.

To illustrate his ways to win, Michael led us through a specific example of a successful data analysis that he had done. He had been asked by a cell phone company to investigate customer churn. Although he looked at the data in several different ways, his successful analysis went as follows. The starting point was Call Data Record (CDR) which records each call that a customer makes. Cell phone traffic generates billions of CDRs, so Michael first cut the data set down to a more manageable size by just looking at the CDRs for a single city. He then created social graphs between customers that call one another frequently, and was able to show that if one customer dropped service it was a predictor that other customers in that social graph would also leave the service. The study ended with a clever visualization of connected customers leaving the cell phone provider.

No comments: