Thursday, December 31, 2009

Television in Trouble

There are forces at work that are going to completely change the television business in the US. On the one side there are the major television networks who believe that it is their right to earn large sums of money from television, just because they have in the past. On the other side is the consumer who is tired of the cost and increasingly switching off. In the middle are the cable companies and the cable content companies.

The consumers are fed up. Television has become unwatchable as the number and length of the commercial breaks has extended. We used to get 48 to 50 minutes of content in each hour, and now we get just 42 minutes. At that rate a season of 24 has less than 17 hours of content. The only way to watch a TV show is to record it on a DVR and watch later, skipping the commercials. Once we get in the habit of watching TV offline, it becomes much easier to cut the cable completely and just watch the web. Between Netflix, Hulu and YouTube there is quite enough stuff to keep entertained.

Another source of complaint is the constantly rising cost of cable. This is caused by the cable company paying more and more for content. For example, the cable companies pay ESPN $4 per month per viewer to carry the channel, and that fee is rising. Other cable content companies are jumping into the valuable content pool. Ten years ago, the AMC channel used to show very old movies with no commercial breaks, now AMC puts on award winning shows like Mad Men full of commercials. Every cable channel seems to have its must see TV program from the BBC with Top Gear through the the USA network with Burn Notice.

The cost of cable is about to go up sharply as the major TV networks demand commensurate fees for their programming from the cable companies. This does not seem like a winning idea in recessionary times. As fees rise more and more people will cut the cable. Either the cost of cable has to stabilize with cuts to content, or TV risks going the way of radio. (I hear that radio still broadcasts, but I do not listen to it, and nobody that I know still listens.) I think that we will see some big changes coming to TV business over the next year or so.

Sunday, December 27, 2009

Kindle Chronicles

Amazon announced that "On Christmas Day, for the first time ever, customers purchased more Kindle books than physical books." Well duh! If you want a physical book for Christmas, you have to buy it before Christmas day. On the other hand, every one who received a Kindle as a gift used the wireless book download feature to get a book to read on Christmas day. In the very same announcement, Amazon said that "Kindle has become the most gifted item in Amazon's history". Amazon's statement is a nice piece of spin but not a lot more.

More interesting commentary on electronic book readers is found in the Kindle Chronicles blog. In the early days of emusic, musicians generally stood by their record companies. Book authors seem to be a much more independent lot according to the most recent post "What We Have Here Is a Failure To Communicate". The publishers have been trying to preserve their position by keeping the prices of ebooks high, while the authors want to be read and the books that sell most on the Kindle are the cheaper ones. Also authors do not see why the publishers should get such a large share of the revenue when there is no cost to their ebook inventory.

Saturday, December 26, 2009

Die AutoRun Die

Another year has almost passed and I have not yet ranted about an awful, unnecessary and totally annoying feature of Microsoft Windows, so today I am going to tell you why AutoRun should die.

AutoRun is the "feature" where you plug something into your computer and then stuff happens completely out of your control. The thing you plug in might be key drive, camera, iPod or whatever. Last Christmas I won a 4GB SanDisk Cruzer USB key drive as a door prize. When I plugged this horrible little thing into my computer it installed the U3 driver with useless and dangerous functions that I DO NOT WANT! To make matters worse, there is no obvious way to remove the driver or its annoying functionality. To top off the bad behavior, even although I immediately erased the entire contents of the drive, when it was plugged into another computer, it infected that computer with its unwanted drivers as well. I have thrown the key drive away to prevent further damage.

The combination of USB key drives and AutoRun is a serious computer virus infection vector to the extent that key drives are being banned in critical places. However the problem is not just with key drives. I have not disabled AutoRun because I use it two to three times a week to sync my iPod with the latest podcasts. Recently my daughter plugged her iPod into my computer just to recharge the battery. First this caused iTunes to crash, then when I brought it back, it wanted to sync my stuff onto her iPod. My daughter does not want anything of mine on her iPod and I had to jump through hoops to prevent the sync.

The problem is that iTunes and everyone else has totally bought in to the automagic nonsense of AutoRun behavior. A much simpler, safer and easier to use behavior is to have the user plug in a device and then bring up a program to use the device. Unfortunately the designers(?) of Windows decided to emasculate their users and instead give the device the power to decide what it wants to do. The subliminal message from Microsoft is that you are too stupid to operate you own computer so we are going do it for you, or let anyone else who might have more of a clue do it for you. The consequence of this design is that our computers do not belong to us, but to hackers who exploit these "features" as attack vectors to take control of them.

If you sit back and think about it, Autorun is obviously ill conceived. The design center is that a single user is logged into their computer and actively using it. What does AutoRun do when nobody has logged into the computer, what does it do when two users are logged in? In the example that I gave above, my daughter plugged her iPod into my computer when two people were logged in and the screen saver had locked both accounts. Of course iTunes crashed, it did not know what to do.

The iPod and iTunes is particularly annoying because it is unusable without AutoRun. On the iTunes support web site, the top support issue is "iPod doesn't appear in iTunes" and the second issue is "iPhone does not appear in iTunes". However there is no button in iTunes to go and look for an iPod or iPhone, instead they rely on AutoRun with no easy fall back should that fail.

Sunday, December 20, 2009

BI Megatrends: Directions for Business Intelligence in 2010

Every year David Stodder, Research Fellow with Ventana Research and editor-at-large with Intelligent Enterprise writes a column on Business Intelligence Megatrends for the next year. This column looks back at what has happened in the last year and what he expects to happen in the next year. This year David also presented his thoughts to the December meeting of the SDForum Business Intelligence SIG. David talked about many topics, here I will just cover what he said about the big players.

Two years ago there was a huge wave of consolidation in Business Intelligence when the major independent BI vendors were bought up by IBM, SAP and Oracle, who along with Microsoft are the major enterprise software vendors. In the last year SAP has integrated Business Objects with SAP software to the point that SAP is now ready to threaten Oracle.

Consolidation has not finished. In 2009, two important mergers were announced. Firstly IBM bought SPSS to round out its analytics capabilities. This move threatens SAS which is in the same market, however SAS is a larger and more successful company that SPSS, also SAS is a private company which means that it does not necessarily need to respond to the pressures to consolidate.

The other merger is Oracle's offer to buy Sun and the effect that has on Oracle's relationship with HP. HP and Sun are bitter rivals for enterprise hardware, and HP was the launch partner for Oracle Exadata, the high end Oracle database. Now Oracle is pushing Sun hardware with Exadata, leaving HP in the lurch. David pointed out that there are plenty of up and coming companies with scalable database systems for HP to buy up. That list includes Aster Data Systems, GreenPlum, Infobright, ParAccell and Vertica. Expect to see something happen in this area in 2010.

Of the three major database vendors, Microsoft has the weakest offering, despite SQL Server 2008. However Microsoft does have the advantage of the Excel spreadsheet which remains the most used BI reporting tool. A new version of Excel is due in 2010. Also Microsoft is making a determined push in the direction of collaboration tools with SharePoint. As we heard at the BI SIG November meeting, collaboration is an important new direction for enterprise software capabilities.

Thursday, December 17, 2009

A Systematic and Platform Independent Approach to Time and Synchronization

Managing time and synchronization in any software is complicated. Leon Starr, a leading proponent of building executable models in UML, talked about the issues of modeling time and synchronization to the December meeting of the SDForum SAM SIG. Leon has spoken to the SAM SIG previously on executable models. This time he brought along two partners to demonstrate how the the modeling technique can be applied to a broad range of problems.

Leon started the meeting by talking through five rules for handling time and synchronization. The first and most important rule is that there is no global clock. This models real systems which may consist of many independent entities and allows for the most flexible implementation of the model on a distributed system. In practice, other rules are a consequence of this first rule.

The next rule is that that the duration of a step is unknown. The rule does not imply that any step can take forever, its purpose is to say that you cannot make assumptions about how long a step may take. In particular, you cannot expect independent steps in the model to somehow interleave themselves in some magical way. The third rule is that busy objects are never interrupted. This forces the modeller to create a responsive system by building it from many small steps so that an object is always available to handle whatever conditions that it needs to handle.

The fourth rule is that signals are never lost. This is an interesting rule as it gets to an issue at the heart of building asynchronous systems. The rule implies that there is a handshake between sender and receiver. If the receiver is not ready, the sender may be held up waiting to deliver the signal. Perhaps the signal can be queued, but then there is the problem that the queue is not big enough to handle all the queued signals. In the end you have to build a system that can naturally handle all the events thrown at it, if it is a safety critical system, or that fails gracefully if it is not.

The fifth rule is that there is no implicit order in the system, except that if one object sends signals to another object, the signals arrive in the order that they were sent. Note that I may have interpolated some of my own experience into this discussion of the rules. If you want to explore further watch this video on You-Tube and go to Leon's web site which leads to many interesting papers and discussions.

Next at the meeting, Leland Starr, younger brother of Leon, talked about a web application that he had been the lead on for his employer, TD Ameritrade. The online application is for arranging participants in online webinars. By using the UML modelling technique, he created a model that could be both used to explain how the system would worked to the business sponsors of the project and that could be executed to check that it worked as expected. Leland has a SourceForge project for his work.

Finally Andrew Mangogna talked about a very different class of applications. He builds software to control implanted medical devices like heart pacemakers. The two overriding concerns are that the medical device performs its function safely and that it runs for at least 5 years on a single battery charge. Compared to many of the applications that we hear about at the SAM SIG the implantable device applications feel like a throwback to an earlier and simpler age of computing. The applications are written in the C programming language and the code typically occupy 3 to 4 kilobytes. The program data is statically allocated and an application can use from 150 bytes to 500 bytes. Andrew also has a project on SourceForge for his work.

Friday, December 04, 2009

Bandwidth Hogging

There are several discussions going on around the web about bandwidth hogging started by a post from Benoit Felten in the fiberevolution blog. I wrote about this issue last month in my post on net neutrality. The basic problem is that when the internet becomes congested the person who has created the most connections wins. Congestion can happen anywhere from your local head end through to a backbone and the backbone interconnects. Felten claims that there is no problem, and given the data, he is willing to do the data crunching to prove it, while others disagree.

The problem is a classic Tragedy of the Commons. There is a shared resource, the internet, and some people use more of it than others. That is fine provided that they do not interfere with each other and there is enough resource to go around. As I explained, the problem is that when there are not enough resources to go around, the people who win are the people who create a large number of connections, and these tend to be the people who use the most bandwidth. The point of a torrent client creating a large number of connections is to ensure that that the client gets its "share" of the net whether there is congestion or not. The only viable response is for everyone else to create large numbers of connections to do whatever they want to do, be it download a web page or make a internet phone call. This is undesirable because it can only lead to more congestion and less efficient use of the shared resource.

There are two parts to a solution. Firstly, the internet service providers have to keep adding more equipment to reduce congestion as internet usage grows. Everything would be fine if there were no congestion. Secondly, we need better algorithms to manage congestion. Penalizing people for using the bandwidth they were sold is not the answer, particularly when that is not the real problem. I have suggested that we should look towards limiting connections. Another thought is to kill the connections of the users with the largest numbers of connections to reduce congestion. Again, I am sure that this will have some unintended consequences.

The real problem is that unless we can all agree to be good internet citizens and get along, the forces against Net Neutrality may win. Then large companies with deeply vested interests will get to decide who has priority. The recently announced merger of Comcast, a large Internet Service Provider and NBC, a large content provider is exactly the sort of thing that we need to be wary of.

Monday, November 30, 2009

Consumerization of IT

A new generation is entering the workforce and they are just not going to take it any more. Brian Gentile, CEO of Jaspersoft, did not say these exact words, but it conveys the intent of the introduction to his talk on "Consumerization of IT" at the November meeting of the SDForum Business Intelligence SIG.

Brian was talking about Generation Y, the first generation to have grown up with computers and instant communication to the extent that they take them for granted. More that that they have expectations about these tools and what they can do with them. Unfortunately, enterprise software has often created systems that are slow, ugly and so difficult to use that it can requires weeks of training. While previous generations have put up with difficult software because they know no better, Gen Y does know that it can be better and is not going to put up with software that does not match up.

Brian identified 4 characteristics that Business Intelligence, or any enterprise software must provide to meet the next generations expectations. They are:
  • Elegant presentation.
  • Easy access to data.
  • Extensive Customization.
  • Built In Collaboration.
To do collaboration properly, software applications must fit into a collaboration platform rather than have each application provide its own silo'ed collaboration mechanism.

While I have heard people argue that current Business Intelligence software does not provide a good user experience, Brian put a positive light on this trend, as if the change is for the good and the right thing to do. He is certainly positioning JasperSoft to provide these features and meet the requirements of the next generation.

Brian ended with another optimistic note. The cost of Information Technology is coming down with cheaper hardware and Open Source software. CIO's can direct the money they save to new innovative projects. A good example of this movement is Ingres talking about "The New Economics of IT" as they have been doing for some time.

Sunday, November 15, 2009

Fight Instutional Corruption

Many people think of Lawrence Lessig as a radical with an anti-IPR (Intellectual Property Rights) agenda. In practice he is no radical, in fact his mission is to find a defensible middle ground between the Intellectual Property right and the Free Culture left. One of my first blog posts discussed his talk: "The Comedy of the Commons".

Recently, I have been following his podcasts which discuss his work on copyright as well as his newer work on institutional corruption. Note, while I find these podcasts interesting, they are not for everybody. They are mostly records of lectures given to various groups. While they are accessible, they are about serious policy matters and as many of the talks are on similar subjects, there tends to be some repetition.

Recently Lessig has been working on a new project on institutional corruption called Change Congress. The issue is that large sums of money fed through lobbyists seems to have an undue influence on the lawmakers in Congress and the Senate. The money appears to have such a large influence that lawmakers are voting against the clear wishes of their constituents.

Change Congress works to highlight these cases of apparent institutional corruption. It is fighting for citizen funded elections so that the lawmakers are not under pressure to raise the money they need to get re-elected. Thus they will be less likely to be swayed by the lobbyists. Go to the site, see what they have to say, and help them with their mission.

Saturday, November 07, 2009

Vote for Net Neutrality Now

There is a lot of talk about Net Neutrality now, and the issues are not completely clear cut as I will discuss later. However, there is also a big threat that needs to be addressed right now.

Bills are being proposed in Washington with friendly names like "The Internet Freedom Act" whose effect would be to give more control of the internet to the big ISPs and take away power from the people who are giving us innovative services like Google, Skype and Amazon. While there is also a friendly bill, and the FCC is on the side of Net Neutrality, everyone needs to act to let their congressman know whose side they are on. Visit "Save the Internet" and take action now!

Now that you have done your bit to save the internet, we can talk about the problem. When a node on the internet gets too much traffic, the traffic control algorithm will pick connections at random and kill them. While this is good for keeping the traffic flowing in the aggregate, it tends to favor one class of user over another. The disadvantage user is the one who is using a single connection to browse the web, download a song or make a voice call. The advantaged user is using Bit-torrent which opens a large number of connections to do a massive download. It does not matter if Bit-torrent loses a connection, it has many others to make up for it, but it does matter when a web browser, or Skype conversation loses a connection.

One solution is to answer greedy software with greedy software. That is every internet application would emulate Bit-torrent and greedily create hundreds of connections in case any one of them gets stomped. While this solution puts all applications on an equal footing, it may strain resources leading to a "Tragedy of the Commons", something that should not be in our bright digital future.

Another solution would be to limit the number of simultaneous sessions a user can have. I personally feel that this would be better than having Comcast or AT&T doing deep packet inspection of my packets. However a hard limit on the number of sessions may cause all sorts of problems with software that is not expecting it, leading to deadlock and other bad behavior. Does anyone have any other ideas?

Friday, October 23, 2009

Database Systems for Analytics

The question "what are the attributes of a database system for analytics?" came up during Omer Trajman's talk to the October meeting of the SDForum Business Intelligence SIG. The talk was titled "The Evolution of BI from Back Office to Business Critical Analytics". In the talk Omer gave several examples of applications that use real time analytics and explained the special attributes of each application. As he runs field engineering for Vertica, a Database Systems vendor, I am sure that these examples were based on his experience with Vertica deployments, however Omer was careful to keep his talk vendor neutral.

So what are the the attributes of a database system for analytics? Omer discussed three attributes. Firstly, an analytics database system cannot use the row level locking that is found in a traditional transaction processing database. The database system needs to provide snapshot isolation that gives a query a consistent view of the data while not preventing other operations like data loads. Having helped implement a system like this in the past, I am in total agreement with Omer.

The second attribute is the need to allow concurrency between loading and querying data. While this is related to the first attribute, it also comes with its own issues. Bulk loads are more efficient (particularly for a columnar database like Vertica), however, if you want access to the most up to minute data you need to do loads in small increments so that the data is available for query as soon as it is loaded. Managing this balance is difficult and as yet it has not been completely solved. Again, I have worked on this issue in several different systems.

The final attribute was scaleout, that is the ability to add more processing systems to handle more data and larger queries. We are building systems out of hundreds and thousands of computer systems. Scaleout is vital to effectively use these systems.

Wednesday, October 14, 2009

e-Readers for All

The e-Reader market is heating up, just in time for Christmas. Amazon is expanding features and bringing the Kindle down the price curve. Today came word of the Barnes and Noble e-Reader with two screens, an e-ink screen for reading and a small LCD touch screen for interactivity.

Also today I caught up with the "This Week in Tech" podcast from last weekend where they talked about the real killer features of the Kindle - wireless download and almost unlimited capacity. You can buy as many books as you want any time you want, which leads to buying many more books than you would otherwise buy. Imagine the scene, at dinner with your friends, you discuss books that you have recently read, and bam you buy the books they recommend there and then. In fact there was even a cry in the podcast "Friends don't let friends use a Kindle while drunk" (for fear that the judgmentally impaired friend may buy too many books).

When the original Kindle came out there was a tremendous outcry against it with people complaining of gadgets destroying their book reading experience and authors expecting to have their livelihood destroyed just as the music industry has been laid waste. Hint, musicians are doing just as well as they have always done, it is the music moguls with their "by the way, which one is Pink?" who have been laid waste. The Kindle stimulates the publishing industry and makes it much easier to buy books, leading to more sales where author gets a larger slice of the pie.

Competition is good, particularly for the consumer. The e-Reader needs another generation or so to iron out the kinks and bring the price down to the mass market levels. I am waiting for the $149 price point (iPod Nano) which should come by next Christmas if not sooner.

Saturday, October 03, 2009

Search User Experience Innovations

Innovations in the Search User Experience was the topic at the September meeting of the SDForum Search SIG. The distinguished panel from Microsoft, Google and Yahoo was chaired by Safa Rashtchy, a long time analyst and commentator on the Search scene.

First, Sean Suchter General Manager of Microsoft's Search Technology Center Silicon Valley told us about the latest innovations in Bing. Sean started out with some numbers, showing that the Internet is still growing at a fast pace and that search is growing faster than the Internet in general. They measure their user's experience and see that about a quarter of searches are failures, resulting in an immediate click back. On the other hand, getting on for a half the search queries are further refined meaning that the user is engaged in a search session. Microsoft will recognize these sessions and use them to improve the user experience.

To simplify the user experience, when they are confident about what a user is searching for, Bing will show one subject on the first page with a number of related links. Sean showed us two examples. Firstly for the search term "target", where they assume the person is looking for the Target chain of stores, they show a complete set of links to Target and shopping related pages with a single link to get other search results that are not related to Target stores. The second example was "ups" where they they only show links related to United Parcel Services and sending parcels on the first page.

Next up was Johanna Wright, Director of Web Search Product Management at Google. Johanna started off by telling us that that 20% of searches have never been seen before, and that Google is dedicated to serving the long tail of web searches as well as more popular ones. To show us how far the search experience has come in the last few years, she applied the search term "how to tie a tie" to an index that they had saved from 2001, and compared it with what you get today. In 2001 you got a miscellaneous collection of links to sites like "The Indus Entrepreneur" with none about tying ties. Now you get relevant links along with image and video links, a tremendous improvement.

Johanna talked about how speed is essential to a good user experience. A couple of years ago, they added related links to popular search terms like "target" to reduce the number of steps a user needs to make to get to the page they want. Google continues to work on helping users with query formulation. She showed us the options panel that you access by clicking the "search options" link on a search results page and how it can be used to refine a search.

Finally, Dr. Larry Cornett, vice president of the Yahoo! Search Consumer Products division spoke. He started by reassuring us that Yahoo! is still in the search business and that if and when the planned combination of Yahoo! Search with Microsoft goes through, they will still provide their own front end and control their user's experience. Yahoo!'s goal has always been to personalize and structure the web. We saw the new layout for Yahoo! search results in the typical Yahoo! busy style.

After the demo's, the floor was thrown open to audience questions. Someone asked about natural language support for queries. Sean told the story as he has been in the search business for a long time. In the early days of search, natural language queries were considered important research area. Then the issue went away as providing relevant answers to queries became the dominant problem. Now that giving good answers is under control, natural language queries are making a comeback. Recently Microsoft bought Powerset to help them in this area.

There were several questions about the sizes of market segments, and growth rates, particularly in the mobile space, to which the panel would not give answers. The audience did manage to uncover the fact that while adult searches are more prevalent than mobile searches, mobile searches have been growing fast since the introduction of the iPhone and other smartphones.

Another set of questions related to real time search. All three search engines have been working on improving the speed with which they update their indexes so that they are current. There is still an open question about whether the major search engines embrace real time search or make it a separate option.

Monday, September 07, 2009

Ikea Culture

We live in an Ikea world. I like to find excuses to visit the nearest Ikea in Palo Alto to lunch in their cafeteria, eating either a smoked salmon plate or Swedish meatballs with Lingonberry jam. The cafeteria has a great view over the South Bay and the East Bay hills. However the reason for this post is to note that Ikea has been popping up in the conversation all over the world.

In China, the Ikea stores have become a great success, for the people, if not for Ikea. This LA Times story reports that Chinese people are flocking to the local Ikea store, to test the bedding, hang out and eat in the cafeteria, maybe even buy some plates, just not to buy anything big.

Meanwhile in LA itself, several young aspiring producers have noticed that an Ikea store is just like a movie studio with lots of little well lit sets showing off bedrooms, living rooms, kitchens. Just the place to make a short episode on the cheap. The actors mike up with wireless mikes outside, rush in and take a few shots and then rush out before any employees notice. Here is Ikea Hights, a soap opera, and here is a send up of The Real World.

Finally, as reported in the New York Times, there has been outrage over the decision by Ikea to change the font in their latest catalog from Futura to Verdana. Futura is a well respected modern san-serif font that suits the Ikea style. Verdana is the generic Microsoft version of a san-serif font that comes on every computer with Windows. I am not sure why this is so important, are these people really complaining that Ikea has lowered its standards to encompass the lowest common denominator font?

Wednesday, September 02, 2009

Project Voldemort

There were three interesting trends exposed in the talk about Project Voldemort at the August meeting of the SDForum SAM SIG. Firstly Voldemort is another tuple store as opposed to a relational database, the trend that interested me the most. The second trend is implementation of systems described in academic papers. The final trend is to use Open Source as a support mechanism for a large software project. Lets break down each of these trends one at a time. By the way, the presentation was given by Bhupesh Bansal and Jay Kreps, of LinkedIn.

The relational databases have been the reliable store for serious computing for the last 20 years, but recently tuple stores and tuple processing like Map-Reduce have appeared and are starting to challenge the relational database hegemony. In the simplest terms, a tuple store is just a very degenerate relational database. Relations are based on the n-tuple, that is each row in a table contains a number of data items whereas a plain tuple is two data items, a key and a value.

As Jay Kreps explained, to get a web service application to scale, you need to distribute it over a over a cluster of computer systems, and to make this work with a relational database, you need to denormalize your database. The end point of database denormalization is the plain flat tuple store. Jay Kreps also complained that relational databases are not very good at handling data structures like the graphs of connections found in social networking applications, and semi-structured data like text.

In my opinion, tuple stores are no better or worse than relational databases at dealing with graphs between tuples. Tuple stores are more flexible for handling semi-structured data, but again this depends on the application (for more, read my comparison of Map-Reduce with relational databases). Tuple stores are certainly simpler, easier to use, more stable under load and cheaper than a relational database. I will write more about tuple stores at another time.

The second notable trend is for groups to pick up on systems described in academic papers and just implement them. Voldemort is an implementation of the Amazon Dynamo system as described in their paper at the ACM Symposium on Operating Systems Principles. We have seen several other examples of this recently. Google released a set of papers about their data processing systems including Map-Reduce, that has created a number of projects to emulate their functionality. I have written about Hadoop and Hypertable, two examples, and there are others. These are systems for doing very large scale analytic data processing, while Amazon Dynamo and Voldemort are systems for supporting rapid access to large volumes of data such as is needed to support large and complex web sites.

The final trend is Open Source as a support model. Voldemort was developed by LinkedIn, a company whose main business is providing a social and business network on the web. Their primary business is social networking, not writing and supporting a lot of complicated software. LinkedIn decided that they needed a tuple store like Amazon Dynamo and, as they could not buy it, they built it. However they decided they wanted help with support, so they released the software as an Open Source project. Now, Voldemort is being used by several organizations and at least half the people working on code are from outside LinkedIn. When Sandeep Giri started the OpenI project, I asked him why he was releasing it as an Open Source project and he gave the same reason.

Sunday, August 30, 2009

Augmented Reality

Earlier this month there was an explosion of posts and comments on the TechCrunch blog about Apple rejecting the Google Voice application for the iPhone. Michael Arrington wrote a post about how Apple reasons for the rejection were misleading and untrue that got over 400 responses. At the time I did not understand the reasons for the intensity of the comments and responses, particularly in a quiet news month like August. Last week I went to the SDForum Virtual Worlds SIG to hear a talk about Augmented Reality, and started to appreciate what is going on.

The Augmented Reality presentation was given by Kari Pulli and Radek Grzeszczuk, researchers at the Nokia Research Center in Palo Alto. What they mean by Augmented Reality is that you point the camera in your smart phone at something and the phone displays more information about what you are looking at. For example, you point the phone at a building and it tells you which building you are looking at with perhaps a link to a map or information about the building. Alternatively, you could point the phone at a book cover and the phone will identify the book and give you links to reviews and a web site where you can buy the book.

Someone in the audience asked the interesting question "Where do you get your data?" There are many different places to get data. For the demos, the book cover data had been scraped from But when it comes to data, the elephant in the room is Google, the company that promises to organize the worlds information. To organize the worlds information, they first have to collect it and then they have to have the computer systems and technology to organize it. Google has been busy doing that for many years now.

The history of mobile appears to be going like this. First came the cellular network companies. They proved themselves incapable of providing anything more than voice and data services, so they are doomed to continue providing nothing but these basic services. As time goes on these services become less differentiated and eventually mere commodities.

Next come the smart phones providers like Blackberry and Apple. They have opened up the cell phone business model to provide services that their users really want. But they still rely on others for data to run these services. When the data providers get a little too close to core functionality they back off their openness, as Apple has with the Google Voice application.

The final step is for the data providers to take over, as Google is doing with the Google Android cell phone operating system. This is a play to reduce the devices to mere commodities and put the interesting business where it really belongs, with software and data.

Thursday, August 20, 2009

Media Convergence

The digital age has brought an extraordinary convergence of media that I have not seen remarked on anywhere. In the old world, each type of media was manufactured and delivered in its own different way. Movies were printed onto film and shown in movie theaters. Newspapers were printed on newspaper printing presses and delivered through a content delivery network that ends up with the product being thrown onto driveways in the early morning hours. Books were printed on book printing presses, bound and delivered through wholesalers to bookstores around the country. Records were printed in record presses, delivered to music wholesalers and then to record stores. Radio and TV were produced in studios, sometimes recorded and sent around the country to be broadcast on local transmitters.

That has all changed. In the new digital world, each media type has the same underlying form. Spoken words, music, written words, pictures, moving pictures are all buckets of digital bits. While we can still get each type of media in its old form we can also get them all delivered to our computer, cell phone or media player through the internet or the cellular phone network.

Even the devices we use to consume media are converged. Most of them can handle everything, so lets take an extreme example, the Amazon Kindle book reader. While the primary purpose of the Kindle is a book reader, it also has text to speech and handles audio files so that you can listen to music while reading. It will also display black and white pictures in the 3 common formats. So when you come to list the types of media that a Kindle can handle, it is quicker to say what it cannot do, that is color and moving pictures, than list all the things that it can do.

This change to digital media is just upon us, so it is going to take some time for all the consequences to shake out. At the moment, there is great wailing and gnashing of teeth from the newspaper industry. Newspapers rely on advertising which always does badly in a recession, but this time they also have to deal with the air being sucked out of their lungs by internet advertising and free listings. For some time, movie producers have been worried that they may be MP3ed like the music industry. More recently, book publishers have become aware that their business model is targeted and they are starting to behave like deer in the headlights as well.

These are just media industries and their travails are just the price of doing business in a time of technological change. The interesting question is how it will affect culture. If all types of media are fundamentally equivalent, will our preferences, being unfettered, change? One change is that there is a move towards shorter forms. For example, online journalism is certainly shorter and more punchy than the printed equivalent. This is just one example of one direction that change could go in. I am sure that there will be more consequential changes, so let me know what you think.

Thursday, August 13, 2009

Too Cheap to Meter

Last month Malcolm Gladwell wrote a snarky review of Chris Anderson's book Free: The Future of a Radical Price for the New Yorker. You will remember that Anderson's last book The Long Tail produced a wide range of reactions and "Free" will be no different. I did not like the Gladwell review. He picks up on a lot of little things while missing the big picture. On the other hand the book is somewhat carelessly written so that it is easy to find little things to criticize.

An example is the discussion of the phrase "too cheap to meter". In the 1950's, Lewis Strauss, then head of the Atomic Energy Commission, predicted that atomic energy would make electricity so cheap to produce that there would be no need for electricity meters. Unfortunately too many people see that phrase and take it to mean that electricity would be free, which is not what Strauss was claiming as I will explain.

The book "Free" has a chapter called "Too Cheap to Matter" that starts with Strauss's claim and goes on to Moore's Law and other laws of shrinking prices. Anderson seems to imply that electricity could be free, and in a long and rambling footnote still does not get to the point. Gladwell in his review of the book picks up on the implication and castigates Anderson for thinking that electricity could ever be free, using his own words against him.

To understand Strauss, you need to look at a utility bill, where you will see that the charges come in two parts, a fixed component for providing the service and a variable component which is your actual metered use of the utility. Strauss was claiming that for electricity there would be no need for the variable part, all that would be needed was a fixed part to cover the cost of fixed generator plant, transmission and billing. Sorry Gladwell and Anderson, Lewis Strauss was not trying to say that electricity would ever be so abundant that it would be free.

It is not unusual for utilities to be unmetered. Here are three examples of unmetered utilities from my personal experience. Firstly, I pay a fixed price for the broadband pipe of my internet service. Secondly, where I grew up, domestic water is plentiful enough that it is not metered, householders pay a fixed price for a 1/2 inch water main connection or somewhat more for a 3/4 inch water main connection. You could say that my garbage is too difficult to meter, so I just pay a fixed price for the weekly emptying of a 32 gallon garbage cart.

Apart from some writing that seems to imply more than is actually there, I found Chris Anderson's new book to be forward looking and full of familiar arguments. Well recommended. I will write more on the subject.

Sunday, July 26, 2009

Twittering Foodies

Given these difficult economic times, the latest trend in San Francisco dining is the unresturant, according to San Francisco Magazine. That is a posh way of describing eating from a food cart or truck. For example: Spenser-On-The-Go serves Caper Braised Skate Cheeks or Frogs Legs and Curry from a converted Taco truck; Boccalone serves exquisite pulled pork sandwiches from a bicycle; the Creme Brulee Cart and Magic Curry Kart are just street carts.

As the vendors come and go and many of them are not properly licensed, the only way to find out where they are going to be serving is to follow them on Twitter. At last! a purpose for Twitter, if you are a committed foodie. As I have not quite gotten to the Escargot Puffs level yet, I have not yet joined Twitter, although I can see a glimmer of hope. On the other hand David Letterman is still firmly in the camp that twitter is a colossal waste of time as this hilarious segment with Kevin Spacey shows.

LinkedIn's Data

LinkedIn has an extraordinary data resource. They have more than 40 million members and a complete job history of each member, in some cases going back 30 or 40 years. DJ Patil, Chief Scientist and Senior Director of Product Analytics at LinkedIn showed us some examples of their data when he spoke to a packed meeting of the SDForum Business Intelligence SIG on "The Analytics Behind LinkedIn" last week. Paul O'Rorke has written an excellent account of the meeting and here I am just adding my impression to that record.

DJ believes in the growing importance of the "data analyst" as a profession. He backed up that belief with some hard data when he shows us the growing importance of the job title over the last 35 years. Up to the mid 90's the appearance of that job title as a percentage of all job titles was flat, but since then it has been growing at a steady pace. As an aside, DJ told us that they use the Amazon Mechanical Turk service to do data cleansing of things like job titles. This is the first time I have heard of the service being used for this purpose.

We were shown other interesting examples of LinkedIn analytics including the change in the top five job titles over the Dot Com bust and an excellent display of the volume of cross country links between LinkedIn members. The big problem with this data is that we cannot have access to it because it is private to LinkedIn and they will keep it private to protect the privacy of their members.

Saturday, July 11, 2009

Graphs That Suck

Many years ago in the early days of the web, I learned about web site design by reading "Web Pages That Suck: Learn Good Design by Looking at Bad Design". It is a delightfully easy beginner level crawl through web site design, filled with examples ranging from excellent to awful with a capital 'A'. I would recommend the book today except that the examples that make up the bulk of the book are way out of date.

For Business Intelligence the equivalent would be a book called something like "Graphs that Suck", and Stephen Few's Perceptual Edge blog is a good place to find examples of this genre. Recently they posted a spectacularly bad example, a pie chart put out by Business Objects to promote a user conference. I will not repeat the critique, however I will say that if this is an example of what Business Objects thinks their software should be used for, I would be leery of using it!

Friday, July 03, 2009

Musician Uses Twitter to Her Advantage, Shock Horror Probe

Technology is turning the music business upside down, like any other media business. Some people embrace the change and some people decry it. When I read a post like this one about using Twitter to make money, I always read the comments. Whether the post is at the Berklee School of Music or TechCrunch, the range of responses is wide and consistent. Some commenters accept the new world and cheer it on, while others complain bitterly. Typical complaints range from: "I cannot do that because I do not have any fans" through "people should respect copyright and give me the money I am due" to "the record company put you there so you should give it all back to them".

The most ridiculous response is the complaint that a musician who spends time developing their fan base is wasting time that could be better spend on creative activities. The point of the Amanda Palmer post is that if you are properly organized, it does not take a lot of time or effort to keep in contact with your fans, particularly when using new instant communication tools like Twitter.

Technology changes. Music is no longer distributed as sheets of paper or by stamping it on 5, 7 or 12 inch pieces of plastic. The business model must change with the times.
The moving finger [of technology change] writes; and having writ,
Moves on: nor all your piety nor wit
Shall lure it back to cancel half a line,
Nor all your tears wash out a word of it.
HT to Roger for the Berklee post.

Tuesday, June 30, 2009

Actors and Concurrent Computation

Carl Hewitt was in fine form when he spoke about the Actor model of concurrent computation at the SDForum SAM SIG meeting on "Actor Architectures for Concurrent Computation". The meeting was a panel with three speakers. Carl Hewitt, Emeritus Professor in Electrical Engineering and Computer Science at the Massachusetts Institute of Technology. Hewitt and his students invented Actors in the 1970's. Frank Sommers of Artima Software, is an active writer in the area of information technology and computing and is currently writing a book on Actors in the Scala programming language. Robey Pointer is a software engineer at Twitter, and is an enthusiastic user of Scala and Actors at Twitter. Bill Venners is president of Artima Software and an author of a book on Scala moderated the proceedings.

Hewitt had retired when the advent of multi-core processors and distributed parallelism renewed interest in the Actor model. He has come out of retirement and is currently visiting Stanford. Hewitt described the genesis of the Actor methodology in the Alan Kay's Smalltalk 72, where every object was autonomous and you acted on an object by sending it messages. Later versions of Smalltalk moves in a different direction. The most important aspect of the Actor model is that it decouples the sender from the communications. In practice this allows the Scala implementation to scale to millions of Actors engaged in concurrent communication (more on this later).

Hewitt spoke with great flourish on a number of other topics, including his determination that the Actor model should be properly represented on Wikipedia and spooks and the internet archive. He spent some time with the unbounded non-determinism in the Actor model versus other concurrency formalisms that only support bounded non-determinism. An audience member challenged him to explain this better and citing Map-Reduce. Amusingly, Hewitt answered by describing the parallelism in Map-Reduce as like Stalin. Stalin has three deputies and each of those deputies has three deputies. Stalin tells his deputies what to do, and those deputies tell their deputies what to do and so on. Thus the business of the state can proceed in parallel. Map-Reduce is a sequential algorithm which is speeded up by parallelism. There is no non-determinism. This is parallelism as opposed to concurrency.

Next Frank Sommers spoke on how Actors are used in Scala. The good news is that Actors are implemented in Scala and Hewitt much preferred the Scala implementation over the Erlang implementation of Actors. The bad news is that there are a number of issues with the Scala implementation. For example, a Scala program cannot exit from a "receive" statement. Another issue is that messages are supposed to be immutable, however the current implementation may not ensure this. These and other issues are being worked on, and the next version of Actors in Scala will be much better.

Finally, Robey Pointer talked about how he is using Scala. He implements message queuing systems that deals with large numbers of long lived connections where each connection is mostly idle but has sporadic bursts of activity. Robey has implemented this system in many different ways. For example, a straight thread implementation and a lot of tuning got up to 5000 thread based connections working at one time, however this fell well short of his goal of supporting millions of connections. A thread pool implementation with a few hundred threads worked better but the code became unwieldy and more complicated than it should have been. Now he has an Actor based implementation in Scala that does scale to the millions of connections and yet the code remains straightforward and small.

He also showed us how Actors can be mixed-in with thread based synchronization to solve problems for which even Actors are too heavyweight. I am in two minds about this. On the one hand, there are legitimate uses for this low level synchronization (as discussed in my PhD thesis). On the other hand, thread based concurrency is awful as I keep promising to explain in another post. Also to do it safely, you need to understand is great detail how Actors are implemented in Scala, and one reason for adopting a high level construct like Actors is that it should hide gory implementation details.

After the meeting I spoke with Carl Hewitt. We agreed that sending a message needs to have a low overhead. It should have a similar costs to calling a procedure. Computers have specialized instructions to support procedure calls and they need specialized instructions to support message passing. We did this for the Tranputer, although that was before its time, and it is eminently possible for Actors. All we need is for a high level concurrency formalism like Actors to get enough traction that the chip manufacturers become interested in supporting it.

Monday, June 22, 2009

Now You See It

While visualization can be an effective tool to understand data, too many software vendors seem to view visualization as an opportunity to "bling your graph" according to Stephen Few author, teacher and consultant. Few has written a new book just published called "Now You See It: Simple Visualization Techniques for Quantitative Analysis". He spoke to the SDForum Business Intelligence SIG June meeting.

Few took us on a quick tour of visualization. We saw a short Onion News Network video that satirized graphics displays in news broadcasts, followed by examples of blinged graphs and dashboards that were both badly designed and misleading in their information display. Not all visualizations are bad. An example of good visualization is the work of Hans Rosling who is a regular speaker at the TED conference (his presentations are well worth watching, and then you can go to and play with the data just as he does). Another example of visualization used effectively to tell a story is in the Al Gore documentary "An Inconvenient Truth".

Next came a discussion of visual perception, leading up to the idea that we can only keep a few items in our short term memory at one time, however these items can be complex pieces of visual information. Given that data analysis is about comparing data, visual encoding allow us to see and compare more complex patterns than, for example, tabular data.

Any data display can only show us a small part of the picture. An analyst builds understanding of their data set by building up complex visualizations of the data, piece at a time. We saw some examples of these visualizations. Software should support the data analyst as they build up their visualizations without getting in the way. Few told us that the best software is rooted in academic research. He recommend several packages including Tableau and Spotfire, both of whom have presented to the Business Intelligence SIG in the past.

Monday, June 15, 2009

Free: The Future of a Radical Price

For some time I have intended to write a post on the economics of media now that the cost of manufacturing it has gone to nothing. Today I discovered that Chris Anderson, editor of Wired and author of The Long Tail" has written a book on the subject called "Free: The Future of a Radical Price", available in July. I will write a post after reading the book, here is an outline of what I expect it to say.

Firstly, as the TechCrunch post says, "As products go digital, their marginal cost goes to zero." It is now economic to give the product away, and make it up on volume. Closely related is the network effect, the more widespread that some piece of media is, the more "valuable" that it becomes. Barriers to media becoming widespread reduce the likelihood that it is seen or heard. Cost is definitely a barrier.

Moreover, putting a price on your media item creates the opportunity for others to price for free and undercut you. A good example is craigslist. It may not be quite what you think of as media, but craigslist is in the process of decimating the newspaper industry by destroying their market for classified advertisements. Craigslist makes their money by selling access to narrow niche markets, so it seems to fit in perfectly with Anderson's thesis.

In the past I have written about the future of music and how musicians are moving to make their money from performance rather than from record sales. As goes music, so goes all media. My sister is currently writing a book. This last week she told me that she expects to make her living from touring to lecture on the books contents rather than from book sales.

Tuesday, June 02, 2009

Databases in the Cloud

Last week was a busy week, with Databases in the Cloud on Tuesday followed by Hadoop and MapReduce with Cascading on Wednesday. These were both must attend SDForum SIG meetings for anyone who wants to keep up with new approaches to database and analytics systems. The two meetings had very different characteristics. MapReduce with Cascading was a technical presentation that required concentration to follow but did contain some real nuggets of information. The Cloud Services SIG meeting on Tuesday Demo Night: Databases in the Cloud was more accessible. This post is about Databases in the Cloud.

Roger Magoulas of O'Reilly Research started the meeting by discussing big data and their experience with it. A useful definition of "Big Data" is that when the size of the data becomes a problem, you have Big Data. O'Reilly has about 6 TBytes of data in their Job database, that is more than a billion rows. The data comes from the web and it is messy. They use GreenPlum, a scalable MPP database system suitable for cloud computing. It also has built in MapReduce. Like many people doing analytics, they are not really sure what they are going to do with the data so they want to keep things as flexible as possible with flexible schemas. Roger and the O'Reilly team believe that 'making sense of "Big Data" is a core competency of the information Age'. On the technology side, Big Data needs MPP parallel processing and compression. Map-Reduce handles big data with flexible schemas and is resilient by design.

After Roger came three demos. Ryan Barrett from Google showed us a Google App Engine application that uses the Google Store. Google App Engine is a service for building web applications that is free for building small applications, and paid when the application scales. The Google Store is BigTable, a sharded stateless tuple store for big data (see my previous posts on the Google Database System and Hypertable, a clone of BigTable). Like every other usable system, Google has its own high level language called GQL (Google Query language), whose statements start with the verb SELECT. To show that they are serious about supporting cloud applications, Google also provides bulk upload and download. Google App Engine is a service that allows you to build and test your cloud web application for free.

Cloudera is a software start up that provides training and support for the Open Source Hadoop MapReduce project. Christophe Bisciglia from Cloudera gave a an interesting analogy. First he compared the performance of a Ferrari and a freight train. A Ferrari has fast acceleration and a higher top speed but can only carry a light load. A freight train accelerates slowly and has a lower top speed, but it can carry a huge load. Then he told us that a database system is like a Ferrari, while Map-Reduce is like the freight train. Map-Reduce does batch processing and is capable of handling huge amounts of data, but it is certainly not fast and agile like a database system, which is capable of giving answers in real time.

Finally George Kong showed us the Aster Data Systems MPP database system with a Map-Reduce engine. They divide their database servers into three groups, the Queen that manages everything, Worker hosts that handle queries and Loader hosts that handle loading. This is a standard database system that works with standard tools such as Informatica, Business Objects, Microstratagy and Pentaho. It is also capable of running in the elastic cloud. For example, one of their customers is ShareThis which keeps a 10 TByte Aster Data Systems database in the cloud. This database uses Microstratagy and Pentaho for reporting.

Friday, May 29, 2009

Using BI to Manage Your Startup

We heard several different perspectives on how Start Ups use Business Intelligence at the May meeting of the SDForum Business Intelligence SIG. The meeting was a panel, moderated by Dan Scholnick of Trinity Ventures. Dan opened the meeting by introducing himself and then asking the panelists to introduce themselves.

The first panelist was Naghi Prasad, VP, Engineering & Operations at Offerpal Media, a start up that allows developers to monetize social applications and online games. Offerpal Media is a marketing company that does real time advertisement targeting and uses a variety of analytics techniques such as AB testing. Naghi told us that Business Intelligence is essential to the companies business and baked into their framework.

Next up was Lenin Gali, Director of Business Intelligence at ShareThis, a start up that allows people to share content with friends, family and their network via Email, SMS and social networking sites such as FaceBook, Twitter, MySpace and LinkedIn. ShareThis also uses AB testing, and as a content network has to deal with large amounts of data.

Third was Bill Lapcevic, VP of Business Development at New Relic, which provides Software as a Service (SaaS) performance management for the Ruby on Rails web development platform. New Relic has acquired 1700 customers over its first year as a start up with a single sales person. Their customers are technical and they use their platform to track the addiction or pain of each customer, and to estimate their potential budget.

The final panelist was Bill Grosso, CTO & VP of Engineering at Twofish, a start up that offers SaaS based virtual economies for Virtual Worlds and Massive Multiplayer Online Games (MMOG). For the operator, a virtual economy is Sam Walton's analytics dream, as you see into every players wallet and capture their every purchase and exchange. TwoFish uses their experience with running multiple virtual economies to tell their customers what they are doing right and wrong in developing a virtual economy.

Dan's first question was "What are some of the pitfalls of Business intelligence?" Bill Lapcevic told us that they have a real time reporting system that can track can track revenue by the minute. The problem is that you can become addicted to data and spend too much time with it. Sometime you need to get away from your screen and talk to the customer. Lenin agreed with this and added that they have problems with data quality. Naghi told us that while a benefit is the surprises that they find from the data, a problem is that they are never finished with their analytics. Bill Grosso was concerned with premature generalization. You need to wait until you have enough data to support conclusions and revisit the conclusions as more data arrives.

There was a wide variety of answers to the question of which tools each panel member used. According to Naghi Prasad, "MySQL is a killer app, it will kill your app!" Offerpal Media uses Oracle for their their database. While they like some of the features of Microsoft SQL Server, they are constrained to have only one Database Administrator (DBA) and DBAs are best when they specialize in one database system. They use open source Kettle for ETL and Microsoft Excel for data presentation. Naghi extolled the virtues of giving users data in a spreadsheet they were comfortable with and Excel pivot tables allows the user to manipulate their data at will. After surveying what was available, they implemented their own AB testing package.

ShareThis is on the leading edge of technology use. Lenin told us that they are 100% in the cloud, using the LAMP stack with MySQL and PHP. They have a 10 Terabyte in an Aster Data Systems database, and use both Microstrategy and Hadoop with Cascading for data analysis and reporting. Running this system takes about 1.5 system admins.

As might be expected, the New Relic system is built on Ruby on Rails and uses sharded MySQL to achieve the required database performance. In their experience it is sometimes worth paying a little more for hardware than optimizing the last ounce of performance from a system. They have developed many of their own analytics tools that they expect to sell as product to their customers.

As TwoFish does accounting for virtual worlds, their servers are not in the cloud, rather they are locked in their cage in a secure data center. While Bill Grosso lusts after some features in Microsoft SQL Server, they use MySQL with Kettle for ETL. They have developed their own visualization code that sits in front of the Mondrian OLAP engine. They expect to do more with the R language for statistical analysis and data mining.

Dan asked the panel how they get the organization to use their Business Intelligence product Bill Grosso lead by saying that adoption has to come from the top. If the CEO is not interested in Business intelligence, then nobody else will be either. He also called for simple metrics that make a point. Bill Lapcevic agreed that leadership should come from the top. The idea is to make the data addictive to users and to avoid to many metrics. Sharing data widely can help everyone understand how they can contribute to improving the numbers. Lenin thought that it was important to make decisions and avoid analysis paralysis. Naghi offered that Business Intelligence can scare non Business Intelligence users. You have to provide simple stuff, and make sure that you score some sure hits early on to encourage people. Finally remember that different people need different reports so make sure each report is specialized to the requirements of the person receiving it.

There were more questions asked, too many to describe in detail here. All in all, we had an informative discussion throughout the evening with a lot of good information shared.

Saturday, May 16, 2009

Virtual Worlds - Real Metrics

Avoid hyperinflation! This was the salient piece of advice for running a virtual economy that I got from Bill Grosso's talk to the April meeting of the SDForum Emerging Technology SIG. The talk was entitled "Virtual Worlds and Real Metrics". Bill is CTO of TwoFish, a startup that provides virtual economies to online worlds. Bill has an undergraduate degree in Economics, so he has both academic knowledge and the practical experience of seeing the insides of virtual economies in online worlds.

In case you are wondering what a virtual economy is, two leading examples are found in World of Warcraft and Second Life. Both are massive multiplayer online (MMO) worlds. World of Warcraft is a fantasy game based on combat where players receive rewards for gameplay. The economy greases interactions between players and allows them to exchange rewards. Although it is not the primary point of the game, the World of Warcraft economy is estimated to be larger than many small countries around the world. Second Life is an online world where you can establish a second life for yourself. Like all the other aspects of Second Life, the economy is intended to provide a virtual reflection of a real economy.

Although it is attractive to think that virtual economies are like real economies, Bill spent quite some time disabusing us of this notion. Unlike real money, virtual money may have different costs in different localities, sometimes you get a bulk discount on large purchases of virtual money and in some cases old money may expire. Another issue is that money does not circulate in the same way are real money. In the real world money, once set free, circulates from person to person, business to business. In a virtual world, money is usually created by the game, flows through a player and then back to the game. It seems that the degree to which money circulates between players is a good measure of how "real" the economy is in a virtual world.

For measuring money velocity, virtual worlds do have an advantage. In the real world, statisticians estimate the money supply, then estimate the total number of number of transactions in the economy and divide one by the other to get money velocity. In a virtual world, the man behind the curtain sees into every players wallet and every transaction. The calculation of money velocity is exact. Moreover, by linking demographics and other knowledge to players you have a Business Intelligence analytics wonderland, precise reports on every aspect of the economy that can be used for all sorts of marketing purposes.

Hyperinflation has been and always will be a pitfall of virtual economies. It can come from bad design of the virtual economy, but is it more likely to come from players finding and exploiting bugs in the game, and there are always bugs in a game. Eternal vigilance is the price for avoiding hyperinflation, so the analytics reports are an important part of managing this part of the economy as well.

Sunday, May 03, 2009

Gartner BI Summit

Suzanne Hoffman, VP Sales at Star Analytics, recently attended the Gartner BI Summit and she gave the SDForum Business Intelligence SIG a short overview of the conference at the April meeting. You can get her presentation from our web site. There were several threads that ran through her presentation, one is the scope of Business Intelligence (BI). Another related thread is Business Intelligence and Performance Management (PM or EPM). Note that when "E" is prefixed to a acronym, it stands for Enterprise.

Although the title of the conference is the Gartner BI Summit, there seemed to be a lot of concern for placing BI within the context of Information Management (IM or EIM), where Information Management includes both search and social software for collaboration as well as BI and PM. On a side note, I have always found enterprise search to be terrible, as discussed previously. Anything that can be done to improve enterprise search is worthwhile, and both analytics and social efforts like bookmarking or tagging can and should be applied to make it better.

Gartner sees BI evolving from something that is mostly pushed by Information Technology (IT) to something that is more broadly focused on Performance Management driving Business Transformation. There has been tension between the terms Business Intelligence and Performance management for some time. For example, I wrote a semi-serious post on the subject in 2004. At the BI SIG we have always used BI an umbrella term that encompasses customer and supply chain analytics and management. On the one hand, maybe BI is too associated with IT and not enough with the end users such as business analysts. On the other hand Performance Management may have a narrower focus on financial analysis, which is a large and important part of analytics, but not the whole enchilada by any means.

Which ever term is chosen as the umbrella, we will continue to call our SIG the Business Intelligence SIG for as long as Gartner has BI Summits.

Saturday, May 02, 2009

The Next Revolution in Data Management

Cringely wrote a great post today called "The Sequel Dilemma". His point is that we are in the midst of a revolution in the way we do data management, the database is the like a horse and buggy soon to be run over by the next generation of data management tools like, for example, the Google database system that I wrote about last year. I particularly liked his comment:
Right now almost every web application has an Apache server fronting a database box running MySQL or its closed source equivalent like Oracle, DB2, or SQL Server. The data bottleneck in all those applications is the SQL box, which is generally doing a very simple job in a very complex manner that made total sense for minicomputers in 1975 but doesn’t make as much sense today.

Saturday, April 25, 2009

Mahout on Hadoop

No, this is not a tale of an elephant and his faithful driver, I am talking about Mahout, an Open Source project that is building a set of serious machine learning and analytic algorithms to run on the Hadoop Open Source Map-Reduce platform. We learned about this at the April meeting of the SDForum Business Intelligence SIG where Jeff Eastman spoke on "BI Over Petabytes: Meet Apache Mahout".

As Jeff explained, the Mahout project is a distributed group of about 10 committers who are working on implementing different types of analytics and machine learning algorithms. Jeff's interest is in clustering algorithms that are used for various purposes in analytics. One use is to generate the "customers who bought X also bought Y" come on that you see at an online retailer. Another use of Clustering is to create a small number of large groups of similar behavior to understand patterns and trends in customer purchasing behavior.

Jeff showed us the all the Mahout clustering algorithms, explaining what you need to provide to set up the algorithm and giving graphical examples of how they behaved on a example data set. He then went on to show how one algorithm was implemented on Hadoop. This implementation shows how flexible the Map Reduce paradigm is. I showed a very simple example of Map-Reduce when I wrote about it last year so that I could compare it to the same function implemented in SQL. Clustering using Map-Reduce is at the other end of the scale, a complicated big data algorithm that also can effectively use the Map-Reduce platform.

Most Clustering algorithms are iterative. From an initial guess at the clusters, an iteration moves data points from one cluster to another to make better clusters. Jeff suggested that a typical application may use 10 iterations or so to converge to a reasonable result. In Mahout, each iteration is a Map-Reduce step. He showed us the top level code for one clustering algorithm. Building on the Map-Reduce framework and the Mahout common libraries for data representation and manipulation, the clustering code itself is pretty straightforward.

Up to now, it has not been practical to do sophisticated analytics like clustering on datasets that exceed a few megabytes, so the normal approach is to sample the dataset to get a small representative sample and then do the analytics on that sample. Mahout enables the analytics on the whole data set, provided that you have the computer cluster to do it.

Given that most analysts are used to working with samples, is there any need for Mahout scale analytics? Jeff was asked this question when he gave the presentation at Yahoo, and he did not have a good answer then. Someone in the audience suggested that analytics on the long tail requires the whole dataset. After thinking about it, processing the complete dataset is also needed for collaborative filtering like the "customers who bought X also bought Y" example given above.

Note that at the BI SIG meeting Suzanne Hoffman of Star Analytics also gave a short presentation on the Gartner BI Summit. I will write about that in another post.

Wednesday, April 08, 2009

Ruby versus Scala

An interesting spat has recently emerged in the long running story of Programming Language wars. The Twitter team, who had long been exclusively a Ruby on Rails house, came out with the "shock" revelation that they were converting part of their back end code to use the Scala programming language. Several Ruby zealots immediately jumped in saying that the Twitter crew obviously did not know what they were doing because they had decided to turn their backs on Ruby.

I must confess to being amused by the frothing at the mouth from the Ruby defenders, but rather than laughing, lets take a calm look at the arguments. The Twitter developers are still using Ruby on Rails for its intended purpose of running a web site. However they are also developing back-end server software and have chosen the Scala programming language for that effort.

The Twitter crew offer a number of reasons for choosing a strongly typed language. Firstly, dynamic languages are not very good for implementing the kind of long running processes that you find in a server. I have experience with writing servers in C, C++ and Java. In all these languages there are problems with memory leaks that cause the memory footprint to grow over the days, weeks or months that the server is running. Getting rid of memory leaks is tedious and painful, but absolutely necessary. Even the smallest memory leak will be a problem with heavy usage and if you have a memory leak, the only cure is stopping and restarting the server. Note that garbage collection does not do away with memory leaks, it just changes the nature of the problem. Dynamic languages are designed for rapid implementation and hide the boring details. One detail that is missing is control over memory usage and memory usage left on its own tends to leak.

Another issue is concurrency. Server software needs to exploit concurrency, particularly now in the era of multi-core hardware. Dynamic languages have problems with concurrency. There are a bunch of issues, too many to discuss here. Sufficient to say that in the past Guido van Rossum has prominently argued against putting threads into Python, another dynamic language, and both Python and Ruby implementations suffer from poor thread performance.

A third issue is type safety. As the Twitter crew say, they found themselves building their own type manager into their server code. In a statically typed language, the type management is done at compile time, making the code more efficient and automatically eliminating the potential for a large class of bugs.

Related to this, many people commented on the revelation that the Twitter Ruby server code was full of calls to the Ruby kind_of method. It is normally considered bad form to have to use kind_of or its equivalent in other languages like the Java instanceof operator. After a few moments thought I understood what the kind_of code is for. If you look at any real server like a database server's code, it is full of assert statements. The idea is that if you are going to fail, you should fail as fast as you can and let the error management and recovery system get you out of trouble. Failing fast reduces the likelihood that the error will propagate and cause real damage like corrupting persistent data. Also with a fast fail it is easier to figure out why the error occurred. In a language with dynamic typing, checking parameters with a kind_of method is the first type of assert to put in any server code.

So the Twitter developers have opted to use Ruby on Rails for their web server and Scala for their server code. In the old days we would have said "horses for courses" and everyone would have nodded their heads in understanding. Nowadays , nobody goes racing, so nobody knows what the phrase means. Can anyone suggest a more up to date expression?

Sunday, April 05, 2009

Cloud to Ground

Everybody is talking about Cloud Computing, the idea that your computing needs can be done by utility computing resources out there on the internet. One sometimes overlooked issue with Cloud Computing is how do you get your data out of the cloud, summed up in the phrase "Cloud-to-Ground".

The issue is that you have your data in the cloud, but you need it down here in your local computing systems so that for example, you can prepare a presentation for the board, or generate a quarter end report, or confirm that a new customer can get the telephone support they just paid for. While it is not a hugely different from other data integration problem, it is one more thing to put on your check list when you think about how you are going to use Cloud Computing.

I first heard the phrase last year when Mike Pittaro of SnapLogic spoke to the SDForum Business Intelligence SIG on SaaS Data Integration. It was only later that I discovered the origin of the phrase is describing a type of lightning.

Sunday, March 29, 2009


I was out of town and could not attend the SDForum SAM SIG meeting on the architecture, which was a shame as it seems to have been a fascinating presentation. I have been following for some time. We had them present to the SDForum Business Intelligence SIG in 2002. In 2006, Ken Rudin an early employee gave an interesting presentation on his experience to the SDForum SaaS SIG.

On the one hand has built the first really successful Software as a Service (SaaS) application and continue to grow the company year after year. On the other hand there is a certain amount of hype surrounding the company. Here is the unvarnished story of what they do. In the USA, there are tens of thousands of companies with distributed sales forces. Each company has to keep in contact and track what its salespeople are doing, where each salesperson works out of their home or an anonymous office suite far from headquarters. provides the application to manage a distributed sales force. It is as simple as that. is the perfect web based SaaS application. There is a large client base. Each client's problem is to keep contact with a distributed sales force, dictating a web enabled application. There are many small clients who do not have the resources to implement their own sales force management application. In practice each client needs the same basic functions in their application, with some minor variations. started out by offering their application to the smaller clients who needed a few seats and would have the most difficulty in implementing their own stand alone software application. With experience they made their application economic to medium sized clients with hundreds of seats. Eventually they got to the point where they could effectively support the largest clients like Merrill Lynch with 25000 seats.

Thursday, March 19, 2009

Google Blog Search is a Misnomer

Last week I saw one news item in passing that seemed to confirm my previous post about Twitter, and that was the news that Twitter Search is about to overtake Google search. At least that is what I thought it said, but it seemed so unlikely that I just dismissed it. Later, when I went back and looked again I discovered that the headline breaking news is that Twitter Search is about to overtake Google Blog Search.

I have tried to use Google Blog Search a couple of times and found it to be completely useless. It seems to order search results in time rather than by relevance. Others have complained about this as well. For example, Michael Arrington of TechCrunch made the same complaint when the service debuted in 2006. It was only after a conversation last weekend that I came to realized that the point of Google Blog Search is to provide search results in time order rather than in relevance order (in practice, you can sort by relevance or by time, but the relevancy order seems to be heavily influenced by recency).

Which brings me to my point. When I think of Google Search I think of the wonderful way that it always shows me relevant results in the first page. By ordering results by recency, the name Google Blog Search certainly does not explain its purpose and is thus a confusing misnomer. Ultimately the name dilutes the Google brand. I cannot offer any good suggestions, but Google Blog Search should change its name to something that clearly explains what it is for.

Saturday, March 14, 2009

Twitter is the new Black

This is the age of Twitter. In the last couple of weeks, both Doonesbury and Jon Stewart have made fun of it. Members of the technorati who have been using Twitter for some time and have built up a solid following have suddenly found their lead in the number of followers eviscerated as Twitter goes mainstream and people we have all heard of become the most popular. As might be expected, the loudest complaints have come from Dave Whiner.

I have written about Twitter in the past and how it relates to various feed technologies. However I must confess that I do not use the service for the simple reason that I have no use it. We live in a very noisy world. I want to keep the noise level down, and Twitter seems to just increases the noise.

For example, when I first started using RSS to follow stuff, I subscribed to feeds from all the places that I regularly follow. Then I realized that there is no point in using RSS to follow a site that publishes several times a day. If I want to see what they have to say, I can just go to the site at any time and see their latest stuff. So I cut my RSS subscriptions back to the sites that publish infrequently. That way a quick daily look at my RSS feeds allows me to catch up with all sorts things without being overloaded.

That is not to say that I do not see a value in Twitter, it is just that I do not see a use case for my using Twitter now. Twitter is of most use for people whose job is about communications. My job is to get things done, and to do that, I often need to switch off the outside world to reduce the noise.

This attitude to Twitter could easily change. Many years ago, a friend suggested that I use Instant Messaging (IM). At the time I had no use for it and did not subscribe. A couple of years later, my manager asked that all reports be available through IM whenever they were working, so that, for example the manager could ask a question of anyone from a meeting. Ever since then I have been online in IM whenever I have been at work. At a minimum, it shows my colleagues in a distributed organization that I am at work and available. Just as I found a use for IM, I could find a use for Twitter.