Tuesday, June 30, 2009

Actors and Concurrent Computation

Carl Hewitt was in fine form when he spoke about the Actor model of concurrent computation at the SDForum SAM SIG meeting on "Actor Architectures for Concurrent Computation". The meeting was a panel with three speakers. Carl Hewitt, Emeritus Professor in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology. Hewitt and his students invented Actors in the 1970's. Frank Sommers of Artima Software, is an active writer in the area of information technology and computing and is currently writing a book on Actors in the Scala programming language. Robey Pointer is a software engineer at Twitter, and is an enthusiastic user of Scala and Actors at Twitter. Bill Venners is president of Artima Software and an author of a book on Scala moderated the proceedings.

Hewitt had retired when the advent of multi-core processors and distributed parallelism renewed interest in the Actor model. He has come out of retirement and is currently visiting Stanford. Hewitt described the genesis of the Actor methodology in the Alan Kay's Smalltalk 72, where every object was autonomous and you acted on an object by sending it messages. Later versions of Smalltalk moves in a different direction. The most important aspect of the Actor model is that it decouples the sender from the communications. In practice this allows the Scala implementation to scale to millions of Actors engaged in concurrent communication (more on this later).

Hewitt spoke with great flourish on a number of other topics, including his determination that the Actor model should be properly represented on Wikipedia and spooks and the internet archive. He spent some time with the unbounded non-determinism in the Actor model versus other concurrency formalisms that only support bounded non-determinism. An audience member challenged him to explain this better and citing Map-Reduce. Amusingly, Hewitt answered by describing the parallelism in Map-Reduce as like Stalin. Stalin has three deputies and each of those deputies has three deputies. Stalin tells his deputies what to do, and those deputies tell their deputies what to do and so on. Thus the business of the state can proceed in parallel. Map-Reduce is a sequential algorithm which is speeded up by parallelism. There is no non-determinism. This is parallelism as opposed to concurrency.

Next Frank Sommers spoke on how Actors are used in Scala. The good news is that Actors are implemented in Scala and Hewitt much preferred the Scala implementation over the Erlang implementation of Actors. The bad news is that there are a number of issues with the Scala implementation. For example, a Scala program cannot exit from a "receive" statement. Another issue is that messages are supposed to be immutable, however the current implementation may not ensure this. These and other issues are being worked on, and the next version of Actors in Scala will be much better.

Finally, Robey Pointer talked about how he is using Scala. He implements message queuing systems that deals with large numbers of long lived connections where each connection is mostly idle but has sporadic bursts of activity. Robey has implemented this system in many different ways. For example, a straight thread implementation and a lot of tuning got up to 5000 thread based connections working at one time, however this fell well short of his goal of supporting millions of connections. A thread pool implementation with a few hundred threads worked better but the code became unwieldy and more complicated than it should have been. Now he has an Actor based implementation in Scala that does scale to the millions of connections and yet the code remains straightforward and small.

He also showed us how Actors can be mixed-in with thread based synchronization to solve problems for which even Actors are too heavyweight. I am in two minds about this. On the one hand, there are legitimate uses for this low level synchronization (as discussed in my PhD thesis). On the other hand, thread based concurrency is awful as I keep promising to explain in another post. Also to do it safely, you need to understand is great detail how Actors are implemented in Scala, and one reason for adopting a high level construct like Actors is that it should hide gory implementation details.

After the meeting I spoke with Carl Hewitt. We agreed that sending a message needs to have a low overhead. It should have a similar costs to calling a procedure. Computers have specialized instructions to support procedure calls and they need specialized instructions to support message passing. We did this for the Tranputer, although that was before its time, and it is eminently possible for Actors. All we need is for a high level concurrency formalism like Actors to get enough traction that the chip manufacturers become interested in supporting it.

Monday, June 22, 2009

Now You See It

While visualization can be an effective tool to understand data, too many software vendors seem to view visualization as an opportunity to "bling your graph" according to Stephen Few author, teacher and consultant. Few has written a new book just published called "Now You See It: Simple Visualization Techniques for Quantitative Analysis". He spoke to the SDForum Business Intelligence SIG June meeting.

Few took us on a quick tour of visualization. We saw a short Onion News Network video that satirized graphics displays in news broadcasts, followed by examples of blinged graphs and dashboards that were both badly designed and misleading in their information display. Not all visualizations are bad. An example of good visualization is the work of Hans Rosling who is a regular speaker at the TED conference (his presentations are well worth watching, and then you can go to Gapminder.org and play with the data just as he does). Another example of visualization used effectively to tell a story is in the Al Gore documentary "An Inconvenient Truth".

Next came a discussion of visual perception, leading up to the idea that we can only keep a few items in our short term memory at one time, however these items can be complex pieces of visual information. Given that data analysis is about comparing data, visual encoding allow us to see and compare more complex patterns than, for example, tabular data.

Any data display can only show us a small part of the picture. An analyst builds understanding of their data set by building up complex visualizations of the data, piece at a time. We saw some examples of these visualizations. Software should support the data analyst as they build up their visualizations without getting in the way. Few told us that the best software is rooted in academic research. He recommend several packages including Tableau and Spotfire, both of whom have presented to the Business Intelligence SIG in the past.

Monday, June 15, 2009

Free: The Future of a Radical Price

For some time I have intended to write a post on the economics of media now that the cost of manufacturing it has gone to nothing. Today I discovered that Chris Anderson, editor of Wired and author of The Long Tail" has written a book on the subject called "Free: The Future of a Radical Price", available in July. I will write a post after reading the book, here is an outline of what I expect it to say.

Firstly, as the TechCrunch post says, "As products go digital, their marginal cost goes to zero." It is now economic to give the product away, and make it up on volume. Closely related is the network effect, the more widespread that some piece of media is, the more "valuable" that it becomes. Barriers to media becoming widespread reduce the likelihood that it is seen or heard. Cost is definitely a barrier.

Moreover, putting a price on your media item creates the opportunity for others to price for free and undercut you. A good example is craigslist. It may not be quite what you think of as media, but craigslist is in the process of decimating the newspaper industry by destroying their market for classified advertisements. Craigslist makes their money by selling access to narrow niche markets, so it seems to fit in perfectly with Anderson's thesis.

In the past I have written about the future of music and how musicians are moving to make their money from performance rather than from record sales. As goes music, so goes all media. My sister is currently writing a book. This last week she told me that she expects to make her living from touring to lecture on the books contents rather than from book sales.

Tuesday, June 02, 2009

Databases in the Cloud

Last week was a busy week, with Databases in the Cloud on Tuesday followed by Hadoop and MapReduce with Cascading on Wednesday. These were both must attend SDForum SIG meetings for anyone who wants to keep up with new approaches to database and analytics systems. The two meetings had very different characteristics. MapReduce with Cascading was a technical presentation that required concentration to follow but did contain some real nuggets of information. The Cloud Services SIG meeting on Tuesday Demo Night: Databases in the Cloud was more accessible. This post is about Databases in the Cloud.

Roger Magoulas of O'Reilly Research started the meeting by discussing big data and their experience with it. A useful definition of "Big Data" is that when the size of the data becomes a problem, you have Big Data. O'Reilly has about 6 TBytes of data in their Job database, that is more than a billion rows. The data comes from the web and it is messy. They use GreenPlum, a scalable MPP database system suitable for cloud computing. It also has built in MapReduce. Like many people doing analytics, they are not really sure what they are going to do with the data so they want to keep things as flexible as possible with flexible schemas. Roger and the O'Reilly team believe that 'making sense of "Big Data" is a core competency of the information Age'. On the technology side, Big Data needs MPP parallel processing and compression. Map-Reduce handles big data with flexible schemas and is resilient by design.

After Roger came three demos. Ryan Barrett from Google showed us a Google App Engine application that uses the Google Store. Google App Engine is a service for building web applications that is free for building small applications, and paid when the application scales. The Google Store is BigTable, a sharded stateless tuple store for big data (see my previous posts on the Google Database System and Hypertable, a clone of BigTable). Like every other usable system, Google has its own high level language called GQL (Google Query language), whose statements start with the verb SELECT. To show that they are serious about supporting cloud applications, Google also provides bulk upload and download. Google App Engine is a service that allows you to build and test your cloud web application for free.

Cloudera is a software start up that provides training and support for the Open Source Hadoop MapReduce project. Christophe Bisciglia from Cloudera gave a an interesting analogy. First he compared the performance of a Ferrari and a freight train. A Ferrari has fast acceleration and a higher top speed but can only carry a light load. A freight train accelerates slowly and has a lower top speed, but it can carry a huge load. Then he told us that a database system is like a Ferrari, while Map-Reduce is like the freight train. Map-Reduce does batch processing and is capable of handling huge amounts of data, but it is certainly not fast and agile like a database system, which is capable of giving answers in real time.

Finally George Kong showed us the Aster Data Systems MPP database system with a Map-Reduce engine. They divide their database servers into three groups, the Queen that manages everything, Worker hosts that handle queries and Loader hosts that handle loading. This is a standard database system that works with standard tools such as Informatica, Business Objects, Microstratagy and Pentaho. It is also capable of running in the elastic cloud. For example, one of their customers is ShareThis which keeps a 10 TByte Aster Data Systems database in the cloud. This database uses Microstratagy and Pentaho for reporting.