Saturday, June 26, 2010

Winning With Big Data

Michael Driscoll gave us Secrets of a Successful Data Scientist at the June meeting of the SDForum Business Intelligence SIG in his talk "Winning With Big Data". Michael is founder of a data consultancy Dataspora, where he has done work on projects ranging to analyzing baseball pitchers through helping cell phone companies understand their customer churn. You can see slides for the talk here, and follow Micheal's thoughts in his excellent blog on the Dataspora site.

After Michael revved up the crowd by giving the Hal Varian quote that "... the sexy job in the next ten years will be statisticians", he went through 9 ways to win as a Data Scientist. His first suggestion is to use the right tools. Michael uses a variety of tools including database systems, Hadoop and the R language. Large data takes a long time to process and often we can gain insights by just working with a sample of the data, however you have to be careful when taking a sample to ensure that it makes sense and that the results will scale. Which leads us to the another way to win, which is to know, understand and use statistics.

Statistics is a field of mathematics that is still developing and it is not easy, however statistics is a core competence of a Data Scientist. It is not enough to do the analysis, the Data Scientist has to be able to present the results and turn them into a compelling story. Both analysis and presentation requires good visualization tools and the knowledge of how to use them.

To illustrate his ways to win, Michael led us through a specific example of a successful data analysis that he had done. He had been asked by a cell phone company to investigate customer churn. Although he looked at the data in several different ways, his successful analysis went as follows. The starting point was Call Data Record (CDR) which records each call that a customer makes. Cell phone traffic generates billions of CDRs, so Michael first cut the data set down to a more manageable size by just looking at the CDRs for a single city. He then created social graphs between customers that call one another frequently, and was able to show that if one customer dropped service it was a predictor that other customers in that social graph would also leave the service. The study ended with a clever visualization of connected customers leaving the cell phone provider.

Thursday, June 24, 2010

Which Cloud Standards Matter?

The SDForum Cloud Services SIG June meeting was a panel session with multiple speakers devoted to the question "Which Cloud Standards Matter?". The answer came through loud an clear as speaker after speaker discussed Open Virtualization Format (OVF). No other standard got more than a mention or so.

OVF is a container that defines the contents of a virtual machine. It is simply a set of file in a directory and an XML descriptor file. The standard is managed by the Distributed Management Task Force (DMTF). Panel speaker Priya Ketkar of Abiquo showed OVF being used to move a virtual machine from one cloud service provider to another. Winston Bumpus, the final panel speaker, is President of the of the DMTF and Director of Director of Standards Architecture for VMWare. He made a convincing case for DMTF and its management of the OVF standard.

Another panel member James Urquhart of Cisco mentioned several standards including OVF, however he spent considerable time on XMPP, surely the most unlikely standard for cloud computing. I discussed XMPP some time ago. It is a standard for exchanging instant messages and Twitter feeds between large service providers. While it is a useful standard I do not see its place in cloud computing. If you can explain how XMPP helps cloud computing, please enlighten me.

Sunday, June 13, 2010

Reporting from the Production Database does their analytics directly out of their production database. For me, this was the interesting story that emerged from the talk on "Real Time Analytics at" at the May meeting of the SDForum Business Intelligence SIG. Note that this post is not a report on the meeting, rather it is a reflection on a topic that came up during the meeting. Both my co-chair Paul O'Rorke and SIG member James Downey have written great summaries of the meeting.

Directly reporting from a production database is an issue that comes up from time to time. Deciding on whether to do it is a two step process. The first question is to ask whether it is possible. A database can be oriented to report the current state of affairs or alternatively to contain a record of how we got to the current state of affairs. In practice we need both views, and it is common to have a production database that is oriented to the maintaining the current status and a data warehouse that maintains the historical record. Typically an enterprise has several databases with production information and the historical record is combined in a single reporting data warehouse.

The tension between the requirements for production and reporting databases shows up in a number of ways. Production needs a fast transaction execution. One way to achieve this is to make the database small, cutting out anything that is not really needed. On the other hand, we want to keep as much information as possible for reporting, so that we can compare this time period with a year ago or maybe even two years ago. Reporting wants a simple database structure like a star schema that makes it straightforward to write ad-hoc queries that that generate good answers. Production databases tend to have more interlinked structures. is in the business of Customer Relationship Management (CRM), where it is useful to keep the historical record of interactions with each customer. As has the historical record in their production database, reporting from that database makes perfect sense. In fact much of the impetus for real time data warehousing has come from CRM like applications. One common example is where a business wants to drive call center applications from data in their data warehouse.

The next question is whether it is a good idea to combine reporting and production queries in the same database. Production queries are short, usually reading a few records and then updating and inserting a few records. Reporting queries are read only, but they are longer running and may touch many records to produce aggregate results. A potential issue is that a longer running reporting query may interfere with production queries and prevent them from doing their job. This is the other major reason for doing reporting from a separate database than the production database.

The Oracle database used by has optimistic read locking so that read only queries do not lock out queries that update the database. Also, as came out in the presentation, has a multi-tenant database where each customer customizes their use of data fields in a different ways. Because of this, they sometimes copy the data out of the big table into a smaller temporary table to transform the data into the form that the customers query expects. Making a copy of the relevant data for further massaging is a common tactic in data reporting tools so this is not unusual. It also gets the reporting data out of the way of production data so they two do not interfere with one another.

Finally, is large enough that they can afford a luxury of having a performance team whose sole purpose is to look at queries that take the longest to run or use up the most resources. Any database application requires some performance tuning, however it is especially important when doing reporting from a production database.

Thursday, June 10, 2010

Google's Got Background

Go away for a few days and when I come back, Google looks like Bing. Instead of a restful blank page they had a background picture. Arrgh! Fortunately, it lasted for less than a day, and then we went back to the blank page we knew and loved.

Actually it is very clever. Firstly it tells the people who might be attracted to Bing because they can customize how the page looks that they can do the same thing with Google. Secondly, and more importantly, it encourages people to create and log in to their Google account so that they can customize their Google home page. Google can give you a better search experience when it knows who you are, and it can make more money from the advertisements that are pitched at you when it knows who you are.

I thought of trying to customize the page to something less distracting when I realized that I would have to give up something of my identity in exchange. On weighing this transaction I decided that what I would give up outweighed the benefit, particularly when the backlash would probably cause the background image to be a short lived experiment.