Build and Break

Wednesday, May 25, 2011

On Copyright and Open Source

Copyright is a key part of an Open Source or Free Software project. It may sound like copyright is antithetical to Free and Open software, but if Richard Stallman, President of the Free Software Foundation (FSF) thinks that ownership of copyright is an important part of Free Software, then we should believe him. A couple of things have led me to these conclusions. Firstly, at the February meeting of the Business Intelligence SIG, Ian Fyfe discussed the business of Open Source suites and how Pentaho is able to offer a suite of Open Source projects as a commercial produce by controlling the Open Source projects, and in particular copyright to the code.

The other clue to the importance of copyright came by accident as I was looking at the difference between the emacs editor and the XEmacs editor. Emacs was an open software project that forked in the early 1990's before the terms Free Software and Open Source had even been invented. One of the criticisms that Stallman, speaking for the emacs project levels against the XEmacs project is that they have been sloppy about the ownership of the code and not always got the "legal papers" that assign ownership of the contribution to the project. On this web page about XEmacs versus emacs, Stallman says:

"XEmacs is GNU software because it's a modified version of a GNU program. And it is GNU software because the FSF is the copyright holder for most of it, and therefore the legal responsibility for protecting its free status falls on us whether we want it or not. This is why the term "GNU XEmacs" is legitimate.

"But in another sense it is not GNU software, because we can't use XEmacs in the GNU system: using it would mean paying a price in terms of our ability to enforce the GPL. Some of the people who have worked on XEmacs have not provided, and have not asked other contributors to provide, the legal papers to help us enforce the GPL. I have managed to get legal papers for some parts myself, but most of the XEmacs developers have not helped me get them."

Note that GNU is the FSF "brand" for its software. The legal papers that Stallman references assign ownership and copyright of a code contribution to the FSF. Because the FSF owns the code it can enforce its right as owner on anyone who breaks its license. Also it can change the terms of the license, and license the code to another party under any other license that it sees fit. The FSF has changed the license terms of the code that it owns. As new versions of the GNU Public License (GPL) have emerged the FSF have upgraded the license to the latest version.

Copyright and Open Source is a study in contradictions. On the one hand, Richard Stallman has "campaigning against both software patents and dangerous extension of copyright laws". On the other hand, he uses ownership of copyright to push his agenda through the GNU Public License which has a viral component so that the source code of any software that is linked with GNU licensed software must be published as open source software. I will write more about this issue.

A good Open Source project needs to make sure that everyone who contributes code to the project signs a document that assigns copyright of their contribution to the project. Unless care is taken to make all the code belong to a single entity, each person who has contributed to the code owns their contribution. If the project wants to be able to do anything with the code other than passively allow its distribution under its existing license, the code must be owned by a single entity. As Stallman says, the project may not be able to defend its own rights unless the code has clear ownership.

Wednesday, May 18, 2011

The Facebook PR Fiasco

Last week came the revelation that Facebook had secretly hired a prestigious Public Relations (PR) firm to plant negative stories about Google and its privacy practices. This is a completely ridiculous thing to have done and wrong in so many ways that it is difficult to know where to begin. Here are some of the top reasons as to why it was a bad idea.

Firstly, the idea that Facebook should be accusing anyone of of playing fast and loose with peoples privacy is a severely hypocritical. Just last year, Mark Zuckerberg told us that "the age of privacy is over". Now he is trying to say that Google is worse for privacy than Facebook! And by the way, this revelation comes at the same time as Symantec has discovered a serious and longstanding security hole in the Facebook App API that allows a users private data to to leak. The only cure is to change your Facebook password, so if you are a Facebook user, go and change your password now!
Secondly, we come to the oxymoronic idea of a secret PR campaign. Anyone who thinks that a PR campaign can be secret does not understand PR.
Thirdly, a competent let alone "prestigious" PR firm should have understood that the ruse was bound to be discovered and that the fallout would be much worse publicity than anything negative that they could promulgate. Thus anyone who claims to understands PR should have guided their client to do something less radical and refused to get involved in the PR campaign. As it is, the PR firm of Burson-Marsteller has lost a lot of their credibility by being involved in the fiasco, and in PR credibility is everything.
Fourthly, the whole idea of a secret PR campaign against another company seems sophomoric, as if Facebook is run by a bunch of undergraduates who have little real world experience, and think that they will be able to get away with a jape like this. No wait …
Finally, if Facebook does want to launch a PR campaign on privacy they should do so openly by generating positive press that compares their supposedly good privacy policies with others less good privacy policies and behavior. As Machiavelli said "A prince also wins prestige for being a true friend or a true enemy, that is, for revealing himself without any reservation in favor of one side against another" and goes on to explain why openness and taking sides leads to better outcomes than pretended neutrality. As Facebook did their PR campaign in secret, we conclude that they could not have done it in public and therefore their privacy practices are no better than that of Google or anyone else.

Note: I was going to call this post "Pot hires PR firm to secretly call kettle black" until I read this article from the Atlantic about Search Engine Optimization (SEO) and the fact that as search engines do not have a sense of humor, humorous headlines do not work in the online world.

Saturday, May 07, 2011

Living In the Stream

It used to be that "stream of consciousness" was a pejorative. It was a phrase you used to put down the type of person who talked endlessly with little connection between what they said and what anyone else said or even between what they had just said. Nowadays, the way live our lives is in a stream of consciousness.

Text messages demand to be answered. If you do not answer a text within ten or fifteen minutes the sender complains that you are ignoring them. Emails keep arriving, and a popup in the corner of the screen heralds their arrival. The popup contains an excerpt of the message designed to make you read the whole thing immediately, even although you know that it is junk or something that you should handle later. Instant message boxes pop up whenever you are on line and cannot be ignored. Sometimes people call you on the phone, although good form these days is to IM someone first to see if you can call them on the phone. Finally there are the two great streams of consciousness that captivate our attention: Facebook and Twitter. Random stuff arrives in a random order and as you have subscribed to the feeds you keep looking at them to see if anything interesting happened. In practice it is most likely to be a video of a cute animal doing something stupid.

How people survive and get anything done with these constant streams of distraction is a mystery to me. I do software, and sometimes I need to concentrate on a problem for a good period of time without interruption. It is not that I am necessarily thinking hard all the time, just that it can take time to investigate a problem or think through all the ramifications of a solution and any distraction just breaks the groove, meaning I have to start over. When this happens endlessly in a day my rate of getting stuff done drops towards zero.

So how do we fight back against constant disruption? The answer is to take control and do not let others dictate the agenda. Firstly, establish that there are periods when you are off-line. I do not take my phone to the bathroom, or when I work out or when I go to bed. Also, I do not answer the phone when driving alone, and have my passenger answer when I am not alone. All our means of communication apart from voice have a buffer so that they do not need to be answered immediately, for voice there is a thing called voicemail. On the other hand, voicemail introduces us to the game of telephone tag which is fun for those who like playing it and exceedingly annoying for the rest of us.

Secondly, you do need to "return your calls" as they used to say. Which brings to the crux of the matter. If you want to be part of the conversation, you need to take part in it. Unfortunately, these days what you have to do is "return your calls", respond to your texts, answer your emails, react to IMs, post to Facebook and Twitter to show that you are a conscious sentient being, and finally do something to make a living. So it comes down to picking conversations, and thinking hard about which conversations you want to join. Do this right and we become Islands in the Stream, which is the most we can hope to be these days.

Sunday, May 01, 2011

Understanding MapReduce Performance: Part 2

Getting good performance out of MapReduce is a matter of understanding two concepts. I discussed the first one, that MapReduce is designed to run on large clusters, in a post last week. Here is the second concept and it is something that everyone who uses MapReduce needs to grapple with. MapReduce works by breaking the processing task into a huge number of little pieces so that the work can be distributed over the cluster to be done in parallel. Each Map task and each Reduce task is a separate task that is can be scheduled to run in parallel with other tasks. For both Map and Reduce, the number of tasks needs to be much larger than the number of nodes in the cluster.

The archetypal example of MapReduce is to count word frequency in a large number of documents. A Map task reads a document and outputs a tuple for each word with the count of occurrences of the word in the document. A Reduce task takes a word and accumulates a total count for the word from the per document count produced by each Map tasks. In this example, there are a large number of documents as input to the Map tasks and presumable a large number of words so that there are a large number of Reduce tasks. Another illustration of this principle is found in the Sort Benchmark disclosure that I discussed in the previous post. For the Gray sort, the 100 TB of data is broken into 190,000 separate Maps and there are 10,000 Reduces for a cluster of 3400 nodes.

While most users of MapReduce get the idea that MapReduce needs its input data broken into lots of little pieces so that there are many Map tasks, they forget about the same requirements for Reduce tasks. Searching the internet it is easy to find examples of MapReduce with a small number of Reduce tasks. One is a tutorial from the University of Wisconsin where there is ultimately only one Reduce task. It is particularly galling that this example comes from the University of Wisconsin where they have a large and prestigious program on parallel database system research. In their defense, the tutorial does show how to do intermediate reduction of the data, but that does not prevent it from being a bad example in general.

Sometimes the problem is too small. What do you do if the problem you are working on now just involves the computation of a single result? The answer is to enlarge the problem. In a large cluster it is better to compute more results even although they may not be of immediate use to you. Lets look at an example. Say you want to analyze a set of documents for the frequency of the word 'the'. The natural thing to do is process all the documents and in the Map function filter for the word 'the' and count the results in the Reduce function. This is how you are taught to use "valuable" computing resources. In practice, with MapReduce it is better to count the frequency of all the words in the documents and save the results. It is not a lot more effort for the MapReduce engine to count the frequency of all the words in the documents and if you then want to know how many uses there are of 'a' or any other word, they are there for you immediately.

A common analogy is MapReduce as a freight train as opposed to a relational database which is a racing car. The freight train carries a huge load but is slow to start and stop. A race car is very fast and nimble but it carries only one person. Relational database systems rely on you to use the where clause to reduce the data that it has to analyze, and in return gives you the answer in a short time. MapReduce does not give you an answer as quickly but it is capable of effectively processing a lot more data. With MapReduce you should process all the data and save the results, then use them as you need them. We can sum the way of thinking about how to use MapReduce with the slogan "no where clauses".

Thursday, April 28, 2011

Understanding MapReduce Performance: Part 1

Currently MapReduce is riding high on the hype cycle. The other day I saw a presentation that was nothing but breathless exhortation for MapReduce as the next big thing and that we had better all jump on the bandwagon as soon as possible. However, there are rumblings of performance problems. At the recent Big Data Camp, Greenplum reported that their MapReduce was 100 times slower than their database system. Searching the web finds many people complaining about MapReduce performance, particularly with NoSQL systems like MongoDB. That is a problem because MapReduce is the data analysis tool for processing NoSQL data. For MongoDB, anything more than the most trivial reporting will require the use of MapReduce.

At the same time there is plenty of evidence that MapReduce is no performance slouch. The Sort Benchmark is a prime measure of computer system performance and currently the Hadoop MapReduce system holds two out of 6 titles for which it is eligible. One title is the Gray test for sorting 100 Terabytes (TB) of data in 173 minutes. The other title is the Minute test for sorting 500 Gigabytes (GB) of data in under a minute. These results are as of May 2010 and the Sort Benchmark is run every year, so we can expect better performance in the future.

Understanding MapReduce performance is a matter of understanding two simple concepts. The first concept is that the design center for MapReduce systems like Hadoop is to run large jobs on a large distributed cluster. To get a feel of what this means, look at the Hadoop disclosure document for the Sort Benchmark. The run for sorting 100 TB was made on a cluster of about 3400 nodes. Each node had 8 cores, 4 disks, 16 GB of RAM and 1GB ethernet. For the Minute sort, a smaller cluster was used with 1400 node systems with the same configuration except 8GB of RAM on each node. That is not to say that MapReduce will only work on thousand node systems. Most systems are much smaller than this, however Hadoop is particularly designed so that it will scale to run on a huge cluster.

One problem with a large cluster is that nodes break down. Hadoop has several features that transparently work around the problem of broken nodes and continue processing in the presence of failure. From the Sort Benchmark disclosure, for the Gray sort run, every processing task is replicated. That is, for every processing task, two nodes are assigned to do it so that should a node break down, the sort can still continue with the data from the other node. This was not used for the Minute test because the likelihood of a node breaking down in the minute while the test is running is low enough to be ignored.

Another large cluster feature that has an important effect on performance is that all intermediate results are written to disk. The results of all the Mappers are written to disk and the sorted data for Reducers is written to disk. This is done so that if a node fails only a small piece of work needs to be redone. By contrast, relational database systems go to great length to ensure that after data has been read from disk, it does not touch the disk again before being delivered to the user. If a node fails in a relational database system, the whole system goes into an error state and then does a recovery which can take some time. This is extremely disruptive to performance when a node fails and much better for performance when there is no failure. Relational database systems were not designed to run on thousands of nodes so they treat the problem of a node failure as a very rare event whereas Hadoop is designed as if it a commonplace. The consequence is that Hadoop performance can look slow when compared to a relational database on a small cluster.

Unfortunately, there is not a lot that a user can do about this, except look for a bigger cluster to run their analysis on, or look for bigger data to analyze. That is the subject for the second part of this post where I will talk about the other simple concept for understanding MapReduce performance.

Sunday, April 24, 2011

The Truth about Smartphone Location Tracking

There is a wave of outrage over the internet about revelation that iPhones has a file with tracking information recording all the places it has been. How dare Apple track users of their products! I am afraid that this is a extremely naive attitude. The fact is that everybody is tracking you on iPhone and not only on a iPhone but on all smartphones and on many less than smart phones as well. Let me count the ways, starting off with the benign and moving to the egregious.

Firstly the carriers and handset makers collect data from phone to help improve their service. Last week we has a joint meeting of the SDForum Business Intelligence and Mobile SIGs on "Mobile Analytics". At that meeting Andrew Coward of CarrierIQ described how they embed their software in phones, usually at the carriers direction, to collect information that can be used to improve service. For example, he told us for example that it is quite normal for them to report to a carrier that their dropped call rate is 6% whereas the carrier's own engineers are telling management that their dropped call rate is 1%. They collect data on location so that the carrier knows where their users are using their phones from so that they can improve their service to that area.

In Europe, CDR laws require phone carriers to retain their Call Data Record (CDR) for all calls for a period of 1 or 2 years. The police can and do request information on all the calls made to or from a number to help with their enquiries into crime. While a CDR record does not usually contain specific location information, it can identify the cell tower and thus the approximate location of the caller. Police have successfully used location based CDR data to help with their investigations for well over a decade.

With the users permission, Google collects information from Android phones about their location. Google is the ultimate data collection company and I am always amazed at the creative ways they find for using that data. One Google service is the Traffic overlay on their Maps. This is derived from observing the change in location of Android phones. However, while Google says that they do not collect personally identifying information, they do need to distinguish between phones to make this application work, so they are tracking the movements of individuals, if only to provide the rest of us generic information on traffic flows. Google has plenty of other uses for this data. For example, they keep a database that locates every Wi-Fi hotspot is so that they can identify your location based on the Wi-Fi hotspot you using. Google can use data from Android phones to validate and update that database.

Mobile analytics and Apps is where the use of location based information starts to get interesting. Last year Flurry presented to the Business Intelligence SIG and we heard about their run in with Steve Jobs. You can read their press release to get the full story of what they did. In short Flurry has a free toolkit that developers install into their mobile Apps that collects information and sends the data back to Flurry. The developer can then access analytics reports about their app at the Flurry web site. However, Flurry retains the data that has been collected from the App, including location based data.

In January 2010, a couple of days before the iPad was announced, Flurry issued a press release saying that they saw a new Apple device that was was only being used in the Apple headquarters in Cupertino and gave some statistics on the number of different Apps that were being tested on this device. At this Steve Jobs blew his top and tried to get Flurry completely banned from iPhone Apps. Eventually Flurry and Apple settled their differences. The conclusion was that in the words of the iPhone developer agreement "The use of third party software in Your Application to collect and send Device Data to a third party for processing or analysis is expressly prohibited."

So lets parse this. Flurry is a company that has no direct relationship with the carriers, handset makers or the users of Apps, yet is is collecting data from all the Apps that it is included in. The data is available for use by the App developer and by Flurry. At the time of the iPad release they could identify that the device was different from all other devices and identify its location to within one set of buildings. Now, I am not trying to pick on Flurry specifically, there are several companies in this area. At the Business Intelligence SIG last week we heard from Apsalar, a recent start up in the same space, however, Flurry is the largest company that provides mobile analytics. Flurry estimates that they are included in up to 1 in 5 mobile Apps for the iPhone and Android. Because they are in so many Apps, they can provide aggregate data on all App usage.

The point of this is that we want location aware Apps, however we also want to preserve our privacy. As Apps are, these two goals are incompatible. To be location aware, the App has to know your location, and if the App knows your location, it can transmit that information back to the App developer or aggregator of analytics for the App developer. Thus they know where you are whether you want to or not. Android, has a profile that determines which information an App can access that is set when the App is installed. If it is allowed to access location information on installation, it can continue to do so until it is uninstalled.

Compared to what Apps know about what you are doing while you use the App, the location database that the iPhone is collecting seems to be a small matter. In fact it seems to be a good reason to limit the number of Apps that you can be running at any one time. At least if only one App is running then only one App knows where you are at any particular time.

Tuesday, April 12, 2011

The Business of Open Source Suites

I have often wondered how a commercial company builds an Open Source Suite out of a collection of open source projects. At the last BI SIG meeting Ian Fyfe Chief Technology Evangelist at Pentaho told us how they do it and gave some interesting insights on how Open Source really works. Pentaho offers a Open Source Business Intelligence suite that includes the Kettle data integration project, the Mondrian OLAP project and the WETA data mining project amongst other projects.

As Ian explained, Pentaho controls these Open Source projects because it employs the project leader and major contributors to each of the projects. In some cases Pentaho also owns the copyright of the code. In other cases, any ownership is in doubt, because there have been too many contributors and or what they have contributed has not been managed well enough to be able to say who owns the code. Mondrian is an example of an Open Source project where there have been enough contributors that it is not possible to take control of the whole source code and exert any real rights over it.

The real control that Pentaho exerts over the Open Source components of its suites is that it gets to say what their roadmap is and how they will evolve in the future. As I noted, Pentaho is driving the various projects to a common metadata layer so that they can become integrated as a single suite of products.

Saturday, April 09, 2011

The Fable of the Good King and the Bad King

A long time ago there were two countries. Each country had a King. One King was a good King and the other King was a bad King as we will find out. Now, as you all know a Kings main job is to go out and make war on his enemies. It is the reason that Kings exist. If a king is not out making war against his enemies, he will go out hunting and make war on the animals of the forest. A good war will enlarge the kingdom, enhance the King fame and gives him more subjects to rule over. But before a King can make war, he should make sure that his subjects provided for. For while the subjects of a King owe everything that they have to their King, the King is also responsible for the welfare and good being of his subjects.

There are many parts to taking care of subjects: making good laws, passing down sound judgements, but the most important one is making sure that the granaries are filled in times of plenty. For as surely as fat times follow lean times, lean times follow fat times. In times of plenty, the excess harvest should be saved so that in times of need the subjects do not starve. Subjects who are starving are weak and cannot praise their King nor defend his kingdom.

Now in our two countries, these were years of plenty, and the Kings knew that they would go to war. The good King also knew that it was his duty to make sure the granaries were filled, and so he did. However, the bad King wanted to win the battle so badly that he sold off all the grain in his granaries to buy expensive war machines. A little incident happened, it was blown up into a huge crisis and the two countries went to war. Each King assembled his army and let it to the battleground at the border of their countries as had happened so many times before. The armies were evenly matched and they fought all day. At the end of the day the army of the bad King held its ground and he was declared the victor. The expensive war machines had helped, but less than hoped for. However, both armies were so weakened and exhausted by the fight that they turned around and went home, as they had so many times before.

The years after this battle were years of want. The harvest had failed and both kingdoms suffered. However, the kingdom of the bad King suffered much more than the kingdom of the good King for there was no grain in their granaries. When the little incident happened that blew up into a huge crisis, both Kings assembled their armies and marched to the battleground on the border. This time the good King won the battle because his men were stronger.

The good King advanced his army into the country of the bad King. They may not be able to take the whole country, but the good King had to let his men do a little rape and pillage as a reward for winning the battle. The bad King realizing his precarious position came out to parley with the good King. The bad King had nothing to offer the good King but some used war machines and the hand of his daughter in marriage. The good King accepted that the daughter of the bad King should should marry his son and that when the two Kings has passed on the greater battleground in the sky, the son of the good King would rule both countries. Thus the two kingdoms would become one united country. A country that would be large and strong enough to make war on the countries on the far side of the mountains.

The moral of this story is that in times of plenty, make sure that the granaries are filled, for as surely as fat times follow lean times, lean times follow fat times, and the best protection against lean times are full granaries. On this matter, a King must beware of false council. When times are good, the false council will say "What could possibly go wrong? The times are fat and everyone is happy. Make the populace more happy by selling off the grain in the granary and rewarding the citizens each according to what they have contributed." Even worse, when times are lean the false council will say "Times are awful and getting worse, we must take the grain out of the peoples mouths and put in in the granaries for the harvest next year could be even worse than this year." The point of a granary or any store of wealth is to save the excess during the fat years so that they can be used during the lean years.

Wednesday, March 30, 2011

Cloud Security

Security is not only the the number one concern for adopting cloud computing, it is also a serious barrier to the adopt-ability of cloud computing. Also, security considerations are causing the Virtual Machine (VM) operating system to evolve. All this came out at the SDForum Cloud SIG night on Cloud Security (the presentations are on the SIG page). There were three speakers and a lot was said. I am just going to highlight a few things that struck me as important.

Firstly, Dr Chenxi Wang from Forrester Research spoke on cloud security issues and trends. She highlighted the issue of compliance to various regulations and how it clashes with what the cloud providers have to offer. One concern is where data is stored, as countries have different regulations for data privacy and record keeping on individuals. If data from one country happened to be stored in another country, that could create a problem with complex legal ramifications that would be expensive to resolve. On the other side of the equation are the cloud system vendors who want to provide a generic service with as few constraints as possible. Having to give a guarantee about where data is stored would make their service offering more complicated and expensive to provide.

Another more specific example of the clash between compliance and what cloud vendors provide is with the PCI security standard in credit card card industry. One PCI requirement is that all computer systems used for PCI applications are scanned for vulnerabilities at least ever three months. Most cloud vendors are unwilling to have their systems scanned for vulnerabilities for a variety of reasons, one of which I will discuss shortly. The solution may be specialized cloud services that are aimed at specific industries. IBM is experimenting with a cloud service that they claim is PCI compliant. These specific services will be more expensive and we will have wait and see whether they succeed.

Chris Richter from Savvis, a cloud provider spoke next. He mentioned standards as a way to resolve the issued described above. The International Standards Organization is creating the ISO 27000 suite of standards for information security. So far ISO 27001 "Information security management systems — Requirements" and ISO 27002 "Code of practice for information security management" are the most mature and relevant standards. As with other ISO standards like ISO 9000 quality standard, there is certification process which will allow cloud providers to make standards based security claims about the service that they provide.

Finally, Dave Asprey from Trend Micro discussed the evolving nature of the VM technology that underlies cloud computing offerings. The original VMware vision was that a virtual machine would be used to develop software for a real physical machine so they spent a lot of time and effort on faithfully replication every aspect of a physical machine in their virtual machine. Now the use case has shifted to making more efficient use of resources. However, a problem is that common operations can bring a set of virtual machines to a standstill if they all decide to do the same common operation at the same time.

Again, vulnerability scanning shows the problem. If the company default is that the anti-virus scan is scheduled for lunchtime Wednesday, then the whole virtual machine infrastructure can be brought to its knees when everyone's VM starts its scan at the same time. Furthermore, because many of the files being scanned may be shared by all the virtual machines, having each VM scan them is a huge waste of resources. Anti-virus software companies are working with the VM software vendors to provide a vulnerability scan that is VM aware and that uses new VM APIs to perform its function is an efficient and non-disruptive way. While this is necessary it seems to run counter to the original notion that each VM is an entirely separate entity that is completely unaware that other VMs exist.

Sunday, March 13, 2011

Database System Startups Capitulate

In the last decade, there have been many database system startups, most of them aimed at the analytics market. In the last year, several of the most prominent ones have sold out to large companies. Here are my notes on what has happened.

Netezza to IBM
Netezza is database appliance that uses hardware assistance to do search. Recently it has been quite successful, with revenues getting into the $200M range. Netezza was founded in 2000 and sold out to IBM for $1.7B. The deal closed in November 2010. The Netezza hardware assistance is a gismo near the disk head that decides which data to read. Many people, myself included, think that special purpose hardware in this application is of marginal value at best. You can get better price performance and much more flexibility with commodity hardware and clever software. IBM seems to be keeping Netezza at arms length as a separate company and brand, which is unusual as IBM normally integrates the companies it buys into its existing product lines.

Greenplum to EMC
Greenplum is a massive multi-processor database system. For example, Brian Dolan told the BI SIG last year how Fox Interactive Media (MySpace) used a 40 host Greenplum database system to do their data analytics. The company was founded in 2003. The sale to EMC closed in July 2010. The price is rumoured to be somewhere at the top of the $300M to $400M range. EMC is a storage system vendor that has been growing very fast, partly by acquiring successful companies. EMC owns VMWare (virtualization), RSA (security) and many other businesses. The Greenplum acquisition adds big data to big storage.

Vertica to HP
Vertica is a columnar database system for analytics. The privately held company started in 2005 with respected database guru Mike Stonebreaker as a founder. The sale was announced in February 2011. The sale price has not been announced. I have heard a rumour of $180M which seems low, although the company received only $30M in VC funding. Initially Vertica seemed to be doing well, however in the last year it seems to have lost momentum.

The other interesting part of this equation is HP which used to be a big partner with Oracle for database software. When Oracle bought HP hardware rival Sun Microsystems in 2009, HP was left in a dangerous position as they did not have a database system to call their own. I was surprised that nobody commented on this at the time. In the analytics area, HP tried to fill in with the NeoView database system, which proved to be such a disaster that they recently cancelled it and bought Vertica instead. NeoView was based on the Tandem transaction processing database system. Firstly, it is difficult to get database system that is optimized for doing large numbers of small transactions to do large analytic queries well, and the Tandem system is highly optimized for transaction processing. Secondly, the Tandem database system only ran on the most expensive hardware that HP had to offer so it was very expensive to implement.

Aster Data Systems to Teradata
Aster Data is a massive multi-processor database system, which in theory is a little more flexible about using a cluster of hosts than Greenplum. The company was founded in 2006 and sold out to Teradata for about $300M in March 2011. Teradata, founded in 1979 and acquired by NCR in 1991 was spun out of NCR in 2007 and since then has been sucessfully growing in the data warehouse space. It is not clear how Aster Data and Teradata will integrate their product lines. One thing is that Aster data gives Teradata a scalable offering in the cloud computing space. Teradata has been angling to get into this space for some time as we heard last summer when Daniel Graham spoke the the BI SIG.

Recently there have been a lot of database systems startups, and several of them are still independent. On the other side, there are not a lot of companies that might want to buy a database systems vendor. Furthermore, there is a strong movement to NoSQL databases which are easier to develop and where there are several strong contenders. The buyout prices are good, but apart from Netezza the prices are no blowout. The VCs behind these sales probably decided that they do not want to be left standing when the music stops and so sold out for a good but not great profit.