Build and Break

Saturday, July 11, 2009

Graphs That Suck

Many years ago in the early days of the web, I learned about web site design by reading "Web Pages That Suck: Learn Good Design by Looking at Bad Design". It is a delightfully easy beginner level crawl through web site design, filled with examples ranging from excellent to awful with a capital 'A'. I would recommend the book today except that the examples that make up the bulk of the book are way out of date.

For Business Intelligence the equivalent would be a book called something like "Graphs that Suck", and Stephen Few's Perceptual Edge blog is a good place to find examples of this genre. Recently they posted a spectacularly bad example, a pie chart put out by Business Objects to promote a user conference. I will not repeat the critique, however I will say that if this is an example of what Business Objects thinks their software should be used for, I would be leery of using it!

Friday, July 03, 2009

Musician Uses Twitter to Her Advantage, Shock Horror Probe

Technology is turning the music business upside down, like any other media business. Some people embrace the change and some people decry it. When I read a post like this one about using Twitter to make money, I always read the comments. Whether the post is at the Berklee School of Music or TechCrunch, the range of responses is wide and consistent. Some commenters accept the new world and cheer it on, while others complain bitterly. Typical complaints range from: "I cannot do that because I do not have any fans" through "people should respect copyright and give me the money I am due" to "the record company put you there so you should give it all back to them".

The most ridiculous response is the complaint that a musician who spends time developing their fan base is wasting time that could be better spend on creative activities. The point of the Amanda Palmer post is that if you are properly organized, it does not take a lot of time or effort to keep in contact with your fans, particularly when using new instant communication tools like Twitter.

Technology changes. Music is no longer distributed as sheets of paper or by stamping it on 5, 7 or 12 inch pieces of plastic. The business model must change with the times.

The moving finger [of technology change] writes; and having writ,
Moves on: nor all your piety nor wit
Shall lure it back to cancel half a line,
Nor all your tears wash out a word of it.

HT to Roger for the Berklee post.

Tuesday, June 30, 2009

Actors and Concurrent Computation

Carl Hewitt was in fine form when he spoke about the Actor model of concurrent computation at the SDForum SAM SIG meeting on "Actor Architectures for Concurrent Computation". The meeting was a panel with three speakers. Carl Hewitt, Emeritus Professor in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology. Hewitt and his students invented Actors in the 1970's. Frank Sommers of Artima Software, is an active writer in the area of information technology and computing and is currently writing a book on Actors in the Scala programming language. Robey Pointer is a software engineer at Twitter, and is an enthusiastic user of Scala and Actors at Twitter. Bill Venners is president of Artima Software and an author of a book on Scala moderated the proceedings.

Hewitt had retired when the advent of multi-core processors and distributed parallelism renewed interest in the Actor model. He has come out of retirement and is currently visiting Stanford. Hewitt described the genesis of the Actor methodology in the Alan Kay's Smalltalk 72, where every object was autonomous and you acted on an object by sending it messages. Later versions of Smalltalk moves in a different direction. The most important aspect of the Actor model is that it decouples the sender from the communications. In practice this allows the Scala implementation to scale to millions of Actors engaged in concurrent communication (more on this later).

Hewitt spoke with great flourish on a number of other topics, including his determination that the Actor model should be properly represented on Wikipedia and spooks and the internet archive. He spent some time with the unbounded non-determinism in the Actor model versus other concurrency formalisms that only support bounded non-determinism. An audience member challenged him to explain this better and citing Map-Reduce. Amusingly, Hewitt answered by describing the parallelism in Map-Reduce as like Stalin. Stalin has three deputies and each of those deputies has three deputies. Stalin tells his deputies what to do, and those deputies tell their deputies what to do and so on. Thus the business of the state can proceed in parallel. Map-Reduce is a sequential algorithm which is speeded up by parallelism. There is no non-determinism. This is parallelism as opposed to concurrency.

Next Frank Sommers spoke on how Actors are used in Scala. The good news is that Actors are implemented in Scala and Hewitt much preferred the Scala implementation over the Erlang implementation of Actors. The bad news is that there are a number of issues with the Scala implementation. For example, a Scala program cannot exit from a "receive" statement. Another issue is that messages are supposed to be immutable, however the current implementation may not ensure this. These and other issues are being worked on, and the next version of Actors in Scala will be much better.

Finally, Robey Pointer talked about how he is using Scala. He implements message queuing systems that deals with large numbers of long lived connections where each connection is mostly idle but has sporadic bursts of activity. Robey has implemented this system in many different ways. For example, a straight thread implementation and a lot of tuning got up to 5000 thread based connections working at one time, however this fell well short of his goal of supporting millions of connections. A thread pool implementation with a few hundred threads worked better but the code became unwieldy and more complicated than it should have been. Now he has an Actor based implementation in Scala that does scale to the millions of connections and yet the code remains straightforward and small.

He also showed us how Actors can be mixed-in with thread based synchronization to solve problems for which even Actors are too heavyweight. I am in two minds about this. On the one hand, there are legitimate uses for this low level synchronization (as discussed in my PhD thesis). On the other hand, thread based concurrency is awful as I keep promising to explain in another post. Also to do it safely, you need to understand is great detail how Actors are implemented in Scala, and one reason for adopting a high level construct like Actors is that it should hide gory implementation details.

After the meeting I spoke with Carl Hewitt. We agreed that sending a message needs to have a low overhead. It should have a similar costs to calling a procedure. Computers have specialized instructions to support procedure calls and they need specialized instructions to support message passing. We did this for the Tranputer, although that was before its time, and it is eminently possible for Actors. All we need is for a high level concurrency formalism like Actors to get enough traction that the chip manufacturers become interested in supporting it.

Monday, June 22, 2009

Now You See It

While visualization can be an effective tool to understand data, too many software vendors seem to view visualization as an opportunity to "bling your graph" according to Stephen Few author, teacher and consultant. Few has written a new book just published called "Now You See It: Simple Visualization Techniques for Quantitative Analysis". He spoke to the SDForum Business Intelligence SIG June meeting.

Few took us on a quick tour of visualization. We saw a short Onion News Network video that satirized graphics displays in news broadcasts, followed by examples of blinged graphs and dashboards that were both badly designed and misleading in their information display. Not all visualizations are bad. An example of good visualization is the work of Hans Rosling who is a regular speaker at the TED conference (his presentations are well worth watching, and then you can go to Gapminder.org and play with the data just as he does). Another example of visualization used effectively to tell a story is in the Al Gore documentary "An Inconvenient Truth".

Next came a discussion of visual perception, leading up to the idea that we can only keep a few items in our short term memory at one time, however these items can be complex pieces of visual information. Given that data analysis is about comparing data, visual encoding allow us to see and compare more complex patterns than, for example, tabular data.

Any data display can only show us a small part of the picture. An analyst builds understanding of their data set by building up complex visualizations of the data, piece at a time. We saw some examples of these visualizations. Software should support the data analyst as they build up their visualizations without getting in the way. Few told us that the best software is rooted in academic research. He recommend several packages including Tableau and Spotfire, both of whom have presented to the Business Intelligence SIG in the past.

Monday, June 15, 2009

Free: The Future of a Radical Price

For some time I have intended to write a post on the economics of media now that the cost of manufacturing it has gone to nothing. Today I discovered that Chris Anderson, editor of Wired and author of The Long Tail" has written a book on the subject called "Free: The Future of a Radical Price", available in July. I will write a post after reading the book, here is an outline of what I expect it to say.

Firstly, as the TechCrunch post says, "As products go digital, their marginal cost goes to zero." It is now economic to give the product away, and make it up on volume. Closely related is the network effect, the more widespread that some piece of media is, the more "valuable" that it becomes. Barriers to media becoming widespread reduce the likelihood that it is seen or heard. Cost is definitely a barrier.

Moreover, putting a price on your media item creates the opportunity for others to price for free and undercut you. A good example is craigslist. It may not be quite what you think of as media, but craigslist is in the process of decimating the newspaper industry by destroying their market for classified advertisements. Craigslist makes their money by selling access to narrow niche markets, so it seems to fit in perfectly with Anderson's thesis.

In the past I have written about the future of music and how musicians are moving to make their money from performance rather than from record sales. As goes music, so goes all media. My sister is currently writing a book. This last week she told me that she expects to make her living from touring to lecture on the books contents rather than from book sales.

Tuesday, June 02, 2009

Databases in the Cloud

Last week was a busy week, with Databases in the Cloud on Tuesday followed by Hadoop and MapReduce with Cascading on Wednesday. These were both must attend SDForum SIG meetings for anyone who wants to keep up with new approaches to database and analytics systems. The two meetings had very different characteristics. MapReduce with Cascading was a technical presentation that required concentration to follow but did contain some real nuggets of information. The Cloud Services SIG meeting on Tuesday Demo Night: Databases in the Cloud was more accessible. This post is about Databases in the Cloud.

Roger Magoulas of O'Reilly Research started the meeting by discussing big data and their experience with it. A useful definition of "Big Data" is that when the size of the data becomes a problem, you have Big Data. O'Reilly has about 6 TBytes of data in their Job database, that is more than a billion rows. The data comes from the web and it is messy. They use GreenPlum, a scalable MPP database system suitable for cloud computing. It also has built in MapReduce. Like many people doing analytics, they are not really sure what they are going to do with the data so they want to keep things as flexible as possible with flexible schemas. Roger and the O'Reilly team believe that 'making sense of "Big Data" is a core competency of the information Age'. On the technology side, Big Data needs MPP parallel processing and compression. Map-Reduce handles big data with flexible schemas and is resilient by design.

After Roger came three demos. Ryan Barrett from Google showed us a Google App Engine application that uses the Google Store. Google App Engine is a service for building web applications that is free for building small applications, and paid when the application scales. The Google Store is BigTable, a sharded stateless tuple store for big data (see my previous posts on the Google Database System and Hypertable, a clone of BigTable). Like every other usable system, Google has its own high level language called GQL (Google Query language), whose statements start with the verb SELECT. To show that they are serious about supporting cloud applications, Google also provides bulk upload and download. Google App Engine is a service that allows you to build and test your cloud web application for free.

Cloudera is a software start up that provides training and support for the Open Source Hadoop MapReduce project. Christophe Bisciglia from Cloudera gave a an interesting analogy. First he compared the performance of a Ferrari and a freight train. A Ferrari has fast acceleration and a higher top speed but can only carry a light load. A freight train accelerates slowly and has a lower top speed, but it can carry a huge load. Then he told us that a database system is like a Ferrari, while Map-Reduce is like the freight train. Map-Reduce does batch processing and is capable of handling huge amounts of data, but it is certainly not fast and agile like a database system, which is capable of giving answers in real time.

Finally George Kong showed us the Aster Data Systems MPP database system with a Map-Reduce engine. They divide their database servers into three groups, the Queen that manages everything, Worker hosts that handle queries and Loader hosts that handle loading. This is a standard database system that works with standard tools such as Informatica, Business Objects, Microstratagy and Pentaho. It is also capable of running in the elastic cloud. For example, one of their customers is ShareThis which keeps a 10 TByte Aster Data Systems database in the cloud. This database uses Microstratagy and Pentaho for reporting.

Friday, May 29, 2009

Using BI to Manage Your Startup

We heard several different perspectives on how Start Ups use Business Intelligence at the May meeting of the SDForum Business Intelligence SIG. The meeting was a panel, moderated by Dan Scholnick of Trinity Ventures. Dan opened the meeting by introducing himself and then asking the panelists to introduce themselves.

The first panelist was Naghi Prasad, VP, Engineering & Operations at Offerpal Media, a start up that allows developers to monetize social applications and online games. Offerpal Media is a marketing company that does real time advertisement targeting and uses a variety of analytics techniques such as AB testing. Naghi told us that Business Intelligence is essential to the companies business and baked into their framework.

Next up was Lenin Gali, Director of Business Intelligence at ShareThis, a start up that allows people to share content with friends, family and their network via Email, SMS and social networking sites such as FaceBook, Twitter, MySpace and LinkedIn. ShareThis also uses AB testing, and as a content network has to deal with large amounts of data.

Third was Bill Lapcevic, VP of Business Development at New Relic, which provides Software as a Service (SaaS) performance management for the Ruby on Rails web development platform. New Relic has acquired 1700 customers over its first year as a start up with a single sales person. Their customers are technical and they use their platform to track the addiction or pain of each customer, and to estimate their potential budget.

The final panelist was Bill Grosso, CTO & VP of Engineering at Twofish, a start up that offers SaaS based virtual economies for Virtual Worlds and Massive Multiplayer Online Games (MMOG). For the operator, a virtual economy is Sam Walton's analytics dream, as you see into every players wallet and capture their every purchase and exchange. TwoFish uses their experience with running multiple virtual economies to tell their customers what they are doing right and wrong in developing a virtual economy.

Dan's first question was "What are some of the pitfalls of Business intelligence?" Bill Lapcevic told us that they have a real time reporting system that can track can track revenue by the minute. The problem is that you can become addicted to data and spend too much time with it. Sometime you need to get away from your screen and talk to the customer. Lenin agreed with this and added that they have problems with data quality. Naghi told us that while a benefit is the surprises that they find from the data, a problem is that they are never finished with their analytics. Bill Grosso was concerned with premature generalization. You need to wait until you have enough data to support conclusions and revisit the conclusions as more data arrives.

There was a wide variety of answers to the question of which tools each panel member used. According to Naghi Prasad, "MySQL is a killer app, it will kill your app!" Offerpal Media uses Oracle for their their database. While they like some of the features of Microsoft SQL Server, they are constrained to have only one Database Administrator (DBA) and DBAs are best when they specialize in one database system. They use open source Kettle for ETL and Microsoft Excel for data presentation. Naghi extolled the virtues of giving users data in a spreadsheet they were comfortable with and Excel pivot tables allows the user to manipulate their data at will. After surveying what was available, they implemented their own AB testing package.

ShareThis is on the leading edge of technology use. Lenin told us that they are 100% in the cloud, using the LAMP stack with MySQL and PHP. They have a 10 Terabyte in an Aster Data Systems database, and use both Microstrategy and Hadoop with Cascading for data analysis and reporting. Running this system takes about 1.5 system admins.

As might be expected, the New Relic system is built on Ruby on Rails and uses sharded MySQL to achieve the required database performance. In their experience it is sometimes worth paying a little more for hardware than optimizing the last ounce of performance from a system. They have developed many of their own analytics tools that they expect to sell as product to their customers.

As TwoFish does accounting for virtual worlds, their servers are not in the cloud, rather they are locked in their cage in a secure data center. While Bill Grosso lusts after some features in Microsoft SQL Server, they use MySQL with Kettle for ETL. They have developed their own visualization code that sits in front of the Mondrian OLAP engine. They expect to do more with the R language for statistical analysis and data mining.

Dan asked the panel how they get the organization to use their Business Intelligence product Bill Grosso lead by saying that adoption has to come from the top. If the CEO is not interested in Business intelligence, then nobody else will be either. He also called for simple metrics that make a point. Bill Lapcevic agreed that leadership should come from the top. The idea is to make the data addictive to users and to avoid to many metrics. Sharing data widely can help everyone understand how they can contribute to improving the numbers. Lenin thought that it was important to make decisions and avoid analysis paralysis. Naghi offered that Business Intelligence can scare non Business Intelligence users. You have to provide simple stuff, and make sure that you score some sure hits early on to encourage people. Finally remember that different people need different reports so make sure each report is specialized to the requirements of the person receiving it.

There were more questions asked, too many to describe in detail here. All in all, we had an informative discussion throughout the evening with a lot of good information shared.

Saturday, May 16, 2009

Virtual Worlds - Real Metrics

Avoid hyperinflation! This was the salient piece of advice for running a virtual economy that I got from Bill Grosso's talk to the April meeting of the SDForum Emerging Technology SIG. The talk was entitled "Virtual Worlds and Real Metrics". Bill is CTO of TwoFish, a startup that provides virtual economies to online worlds. Bill has an undergraduate degree in Economics, so he has both academic knowledge and the practical experience of seeing the insides of virtual economies in online worlds.

In case you are wondering what a virtual economy is, two leading examples are found in World of Warcraft and Second Life. Both are massive multiplayer online (MMO) worlds. World of Warcraft is a fantasy game based on combat where players receive rewards for gameplay. The economy greases interactions between players and allows them to exchange rewards. Although it is not the primary point of the game, the World of Warcraft economy is estimated to be larger than many small countries around the world. Second Life is an online world where you can establish a second life for yourself. Like all the other aspects of Second Life, the economy is intended to provide a virtual reflection of a real economy.

Although it is attractive to think that virtual economies are like real economies, Bill spent quite some time disabusing us of this notion. Unlike real money, virtual money may have different costs in different localities, sometimes you get a bulk discount on large purchases of virtual money and in some cases old money may expire. Another issue is that money does not circulate in the same way are real money. In the real world money, once set free, circulates from person to person, business to business. In a virtual world, money is usually created by the game, flows through a player and then back to the game. It seems that the degree to which money circulates between players is a good measure of how "real" the economy is in a virtual world.

For measuring money velocity, virtual worlds do have an advantage. In the real world, statisticians estimate the money supply, then estimate the total number of number of transactions in the economy and divide one by the other to get money velocity. In a virtual world, the man behind the curtain sees into every players wallet and every transaction. The calculation of money velocity is exact. Moreover, by linking demographics and other knowledge to players you have a Business Intelligence analytics wonderland, precise reports on every aspect of the economy that can be used for all sorts of marketing purposes.

Hyperinflation has been and always will be a pitfall of virtual economies. It can come from bad design of the virtual economy, but is it more likely to come from players finding and exploiting bugs in the game, and there are always bugs in a game. Eternal vigilance is the price for avoiding hyperinflation, so the analytics reports are an important part of managing this part of the economy as well.

Sunday, May 03, 2009

Gartner BI Summit

Suzanne Hoffman, VP Sales at Star Analytics, recently attended the Gartner BI Summit and she gave the SDForum Business Intelligence SIG a short overview of the conference at the April meeting. You can get her presentation from our web site. There were several threads that ran through her presentation, one is the scope of Business Intelligence (BI). Another related thread is Business Intelligence and Performance Management (PM or EPM). Note that when "E" is prefixed to a acronym, it stands for Enterprise.

Although the title of the conference is the Gartner BI Summit, there seemed to be a lot of concern for placing BI within the context of Information Management (IM or EIM), where Information Management includes both search and social software for collaboration as well as BI and PM. On a side note, I have always found enterprise search to be terrible, as discussed previously. Anything that can be done to improve enterprise search is worthwhile, and both analytics and social efforts like bookmarking or tagging can and should be applied to make it better.

Gartner sees BI evolving from something that is mostly pushed by Information Technology (IT) to something that is more broadly focused on Performance Management driving Business Transformation. There has been tension between the terms Business Intelligence and Performance management for some time. For example, I wrote a semi-serious post on the subject in 2004. At the BI SIG we have always used BI an umbrella term that encompasses customer and supply chain analytics and management. On the one hand, maybe BI is too associated with IT and not enough with the end users such as business analysts. On the other hand Performance Management may have a narrower focus on financial analysis, which is a large and important part of analytics, but not the whole enchilada by any means.

Which ever term is chosen as the umbrella, we will continue to call our SIG the Business Intelligence SIG for as long as Gartner has BI Summits.

Saturday, May 02, 2009

The Next Revolution in Data Management

Cringely wrote a great post today called "The Sequel Dilemma". His point is that we are in the midst of a revolution in the way we do data management, the database is the like a horse and buggy soon to be run over by the next generation of data management tools like, for example, the Google database system that I wrote about last year. I particularly liked his comment:

Right now almost every web application has an Apache server fronting a database box running MySQL or its closed source equivalent like Oracle, DB2, or SQL Server. The data bottleneck in all those applications is the SQL box, which is generally doing a very simple job in a very complex manner that made total sense for minicomputers in 1975 but doesn’t make as much sense today.