After writing about the Netflix prize, I got to thinking about data mining the Netflix data set. On reflection the problem seemed intractable, that is until I attended the SDForum Business Intelligence SIG to hear Paul O'Rorke talk on "Data Mining and Business Intelligence".
Paul covered several topics in his talk, including the standard CRISP-DM data mining process, and a couple of data mining problem areas and their algorithms. One problem area was frequent item-set mining. This is used for applications like market basket analysis which looks for items that are frequently bought together.
In the meeting we spent some time discussing what market basket analysis is for. Of course, 'beer and diapers' came up. The main suggested use was store design. If some one who buys milk is likely to buy cookies, then the best way to design the store is with milk and cookies at opposite ends of the store so that the customer has to walk past all the other shelves with tempting wares while on their simple milk and cookies shopping run. I am sure that there are other more sophisticated uses of market basket analysis. I know of at least one company has made a good business out of it.
To get back to Netflix, there are similarities between an online movie store and a grocery store. Both have a large number of products, both have a large number of customers and any particular customer will only get a small number of the products. For the supermarket we are interested in understanding what products are bought together in a basket, while for Netflix we are interested in the slightly more complex issue of predicting how a customer will rate a movie.
Paul showed us the FP-Tree data structure and showed us some of the FP-growth algorithm for using it. The FP-Tree will only represent the fact that Netflix users have rated movies. As it stands, it cannot also represent the users ratings, however it is a good starting point, and there are several implementations available. Also, Netflix could easily use the FP-Growth algorithm to recommend movies ("Other people who watched your movie selections also watched ...").