Sunday, June 13, 2010

Reporting from the Production Database

Salesforce.com does their analytics directly out of their production database. For me, this was the interesting story that emerged from the talk on "Real Time Analytics at Salesforce.com" at the May meeting of the SDForum Business Intelligence SIG. Note that this post is not a report on the meeting, rather it is a reflection on a topic that came up during the meeting. Both my co-chair Paul O'Rorke and SIG member James Downey have written great summaries of the meeting.

Directly reporting from a production database is an issue that comes up from time to time. Deciding on whether to do it is a two step process. The first question is to ask whether it is possible. A database can be oriented to report the current state of affairs or alternatively to contain a record of how we got to the current state of affairs. In practice we need both views, and it is common to have a production database that is oriented to the maintaining the current status and a data warehouse that maintains the historical record. Typically an enterprise has several databases with production information and the historical record is combined in a single reporting data warehouse.

The tension between the requirements for production and reporting databases shows up in a number of ways. Production needs a fast transaction execution. One way to achieve this is to make the database small, cutting out anything that is not really needed. On the other hand, we want to keep as much information as possible for reporting, so that we can compare this time period with a year ago or maybe even two years ago. Reporting wants a simple database structure like a star schema that makes it straightforward to write ad-hoc queries that that generate good answers. Production databases tend to have more interlinked structures.

Salesforce.com is in the business of Customer Relationship Management (CRM), where it is useful to keep the historical record of interactions with each customer. As Salesforce.com has the historical record in their production database, reporting from that database makes perfect sense. In fact much of the impetus for real time data warehousing has come from CRM like applications. One common example is where a business wants to drive call center applications from data in their data warehouse.

The next question is whether it is a good idea to combine reporting and production queries in the same database. Production queries are short, usually reading a few records and then updating and inserting a few records. Reporting queries are read only, but they are longer running and may touch many records to produce aggregate results. A potential issue is that a longer running reporting query may interfere with production queries and prevent them from doing their job. This is the other major reason for doing reporting from a separate database than the production database.

The Oracle database used by Salesforce.com has optimistic read locking so that read only queries do not lock out queries that update the database. Also, as came out in the presentation, Salesforce.com has a multi-tenant database where each customer customizes their use of data fields in a different ways. Because of this, they sometimes copy the data out of the big table into a smaller temporary table to transform the data into the form that the customers query expects. Making a copy of the relevant data for further massaging is a common tactic in data reporting tools so this is not unusual. It also gets the reporting data out of the way of production data so they two do not interfere with one another.

Finally, Salesforce.com is large enough that they can afford a luxury of having a performance team whose sole purpose is to look at queries that take the longest to run or use up the most resources. Any database application requires some performance tuning, however it is especially important when doing reporting from a production database.

No comments: