Sunday, April 22, 2007

The 60 Hour Data Warehouse Implementation

There was a lot of interesting stuff in the presentation by Stephen Bay to the SDForum Business Intelligence SIG on "Large Scale Detection of Irregularities in Accounting Data". However the one thing that really struck me was their 60 hour data warehouse implementation.

Stephen and his colleagues at the PricewaterhouseCoopers Center for Advanced Research have built a system called Sherlock for detecting fraud in accounting data by applying several analytic techniques. Sherlock works by looking at the general ledger of the business. A general ledger is typically several gigabytes of data and may be fed by sub-ledgers that can run into the hundreds of gigabytes. Before Sherlock can do its analytics, they have to get the accounting data into a standard form in a data warehouse. Sherlock is used during an accounting audit which typically lasts a month, so there is great pressure to get the data warehouse implemented in as short a time as possible.

So how do they do it? Firstly, the schema of the data warehouse is fixed. The PricewaterhouseCoopers team have developed a standard data warehouse design for a general general ledger that is applicable to all non-financial businesses. The data warehouse design is open source and is available from IPHIX. Secondly, the general ledger data usually comes from an SAP, Oracle or PeopleSoft ERP system so some of the connections can be prebuilt. The problem with ERP systems is that they are heavily customized for each user, so the Sherlock team have implemented a GUI tool for building a mapping between ERP content and the data warehouse. The tool is designed for business people and accountants to use so that the data warehouse can be built by people with domain knowledge but no technical knowledge.

With all this they claim that they can build a data warehouse in 60 hours for a new implementation and 20 hours for a repeat implementation. Contrast 60 hours with a typical data warehouse project that takes many months. A large data warehouse project can easily take a year or two to implement. So a 60 hour implementation is an astonishing achievement.

No comments: