Wednesday, October 18, 2006

Data Mining and Business Intelligence

After writing about the Netflix prize, I got to thinking about data mining the Netflix data set. On reflection the problem seemed intractable, that is until I attended the SDForum Business Intelligence SIG to hear Paul O'Rorke talk on "Data Mining and Business Intelligence".

Paul covered several topics in his talk, including the standard CRISP-DM data mining process, and a couple of data mining problem areas and their algorithms. One problem area was frequent item-set mining. This is used for applications like market basket analysis which looks for items that are frequently bought together.

In the meeting we spent some time discussing what market basket analysis is for. Of course, 'beer and diapers' came up. The main suggested use was store design. If some one who buys milk is likely to buy cookies, then the best way to design the store is with milk and cookies at opposite ends of the store so that the customer has to walk past all the other shelves with tempting wares while on their simple milk and cookies shopping run. I am sure that there are other more sophisticated uses of market basket analysis. I know of at least one company has made a good business out of it.

To get back to Netflix, there are similarities between an online movie store and a grocery store. Both have a large number of products, both have a large number of customers and any particular customer will only get a small number of the products. For the supermarket we are interested in understanding what products are bought together in a basket, while for Netflix we are interested in the slightly more complex issue of predicting how a customer will rate a movie.

Paul showed us the FP-Tree data structure and showed us some of the FP-growth algorithm for using it. The FP-Tree will only represent the fact that Netflix users have rated movies. As it stands, it cannot also represent the users ratings, however it is a good starting point, and there are several implementations available. Also, Netflix could easily use the FP-Growth algorithm to recommend movies ("Other people who watched your movie selections also watched ...").

1 comment:

Yun said...

No doubt that using FP trees can help make recommendations. There are a few questions that need to be addressed before implentation. First is what exactly constitutes a market basket? You could consider a user's entire history as a single market basket. Or you could consider a cluster of three or four as a market basket (three or four rented at the same time). Each has its advantages. The main problem with an FP tree is that it doesn't immediately get you to a ranking.

Yes, I know that I lack imagination, and the likihood of winning the prize is lower as a result. But I'm having trouble seeing how FP trees is going to produce a ranking prediction. I would love to see someone prove me wrong, since it is now primarily an empirical question.