Thursday, July 17, 2008

A Gentle Introduction to R

We were given a gentle introduction to the R statistical programming language and its application in Business Intelligence at the July meeting of the SDForum Business Intelligence SIG. The speakers were Jim Porzac ( Senior Director of Analytics at Responsys) and Michael Driscoll (Principal at Dataspora). Jim has posted the presentation here.

R is an Open Source project that uses the GNU license. It has a growing user base with a strong support community and a user group (called UseR Group - try Googling that). There are now almost 1500 packages for the languages that supports various statistical techniques and specialized application areas. Packages include: Bayesian, Econometrics, Genetics, Machine Learning, Natural Language Processing, Pharmacokinetics, Psycometrics, which gives some idea of the range of subjects and techniques that R covers.

Jim did most of the talking, introducing the language and showing us some examples of its use. One example is his data quality package that he uses on each new dataset that he receives for analysis at Responsys. Another example showed how reporting capabilities while a third showed sophisticated graphs and plots used for customer segmentation analysis. Michael showed us how he used R to do some interesting and very practical analyzes of Baseball statistics.

The audience probed R's strength and weakness. R has the connectivity to get data for analysis from databases and other sources. R also has excellent graphing and reporting capabilities. Currently R works by reading data into memory where it is manipulated, which limits the maximum size of data set that can be analyzed to the many Gigabyte range.

One person asked for a comparison with SAS. R has the advantages of being free with an enthusiastic user base to keeps it on the cutting edge. Also R is a more coherent language than SAS, which is a collection of libraries, each of which may be very good but they do not necessarily make a whole.

Jim and Michael are starting a Bay Area chapter of the UseR Group. If you are interested, contact Jim Porzac at Responsys.

4 comments:

miked98 said...

Hi Rich - As you mentioned, Jim and I have created a Bay Area R Users that interested folks can join at http://groups.google.com/group/bayareaR.

Thanks for coming on Tuesday and look forward to seeing you at the next event.

Jim Porzak said...

Hey Richard,

Thanks for posting this summary!

Always great fun to present at BI Sig.

You just reminded me of the SAS question. Bob Muenchen has created a great resource for SAS(& SPSS) users interested in R - basically a Rosetta Stone for stat geeks!

Free pdf at: http://oit.utk.edu/scc/RforSASandSPSSusers.pdf

And it's being picked up by Springer: http://rforsasandspssusers.com/

For folks interested in the "use R Group", also see http://ia.meetup.com/67/

Best, Jim Porzak

Mike Hogan said...

Rich,

My apologies up front, but my company www.scaledb.com is newly funded and we're bringing shared everything clustering to mysql. We are looking for database internal developers like you...know anyone? Again, my apologies.

Mike Hogan said...

I'm mike (at) scaledb.com