Build and Break

Tuesday, April 12, 2011

The Business of Open Source Suites

I have often wondered how a commercial company builds an Open Source Suite out of a collection of open source projects. At the last BI SIG meeting Ian Fyfe Chief Technology Evangelist at Pentaho told us how they do it and gave some interesting insights on how Open Source really works. Pentaho offers a Open Source Business Intelligence suite that includes the Kettle data integration project, the Mondrian OLAP project and the WETA data mining project amongst other projects.

As Ian explained, Pentaho controls these Open Source projects because it employs the project leader and major contributors to each of the projects. In some cases Pentaho also owns the copyright of the code. In other cases, any ownership is in doubt, because there have been too many contributors and or what they have contributed has not been managed well enough to be able to say who owns the code. Mondrian is an example of an Open Source project where there have been enough contributors that it is not possible to take control of the whole source code and exert any real rights over it.

The real control that Pentaho exerts over the Open Source components of its suites is that it gets to say what their roadmap is and how they will evolve in the future. As I noted, Pentaho is driving the various projects to a common metadata layer so that they can become integrated as a single suite of products.

Saturday, April 09, 2011

The Fable of the Good King and the Bad King

A long time ago there were two countries. Each country had a King. One King was a good King and the other King was a bad King as we will find out. Now, as you all know a Kings main job is to go out and make war on his enemies. It is the reason that Kings exist. If a king is not out making war against his enemies, he will go out hunting and make war on the animals of the forest. A good war will enlarge the kingdom, enhance the King fame and gives him more subjects to rule over. But before a King can make war, he should make sure that his subjects provided for. For while the subjects of a King owe everything that they have to their King, the King is also responsible for the welfare and good being of his subjects.

There are many parts to taking care of subjects: making good laws, passing down sound judgements, but the most important one is making sure that the granaries are filled in times of plenty. For as surely as fat times follow lean times, lean times follow fat times. In times of plenty, the excess harvest should be saved so that in times of need the subjects do not starve. Subjects who are starving are weak and cannot praise their King nor defend his kingdom.

Now in our two countries, these were years of plenty, and the Kings knew that they would go to war. The good King also knew that it was his duty to make sure the granaries were filled, and so he did. However, the bad King wanted to win the battle so badly that he sold off all the grain in his granaries to buy expensive war machines. A little incident happened, it was blown up into a huge crisis and the two countries went to war. Each King assembled his army and let it to the battleground at the border of their countries as had happened so many times before. The armies were evenly matched and they fought all day. At the end of the day the army of the bad King held its ground and he was declared the victor. The expensive war machines had helped, but less than hoped for. However, both armies were so weakened and exhausted by the fight that they turned around and went home, as they had so many times before.

The years after this battle were years of want. The harvest had failed and both kingdoms suffered. However, the kingdom of the bad King suffered much more than the kingdom of the good King for there was no grain in their granaries. When the little incident happened that blew up into a huge crisis, both Kings assembled their armies and marched to the battleground on the border. This time the good King won the battle because his men were stronger.

The good King advanced his army into the country of the bad King. They may not be able to take the whole country, but the good King had to let his men do a little rape and pillage as a reward for winning the battle. The bad King realizing his precarious position came out to parley with the good King. The bad King had nothing to offer the good King but some used war machines and the hand of his daughter in marriage. The good King accepted that the daughter of the bad King should should marry his son and that when the two Kings has passed on the greater battleground in the sky, the son of the good King would rule both countries. Thus the two kingdoms would become one united country. A country that would be large and strong enough to make war on the countries on the far side of the mountains.

The moral of this story is that in times of plenty, make sure that the granaries are filled, for as surely as fat times follow lean times, lean times follow fat times, and the best protection against lean times are full granaries. On this matter, a King must beware of false council. When times are good, the false council will say "What could possibly go wrong? The times are fat and everyone is happy. Make the populace more happy by selling off the grain in the granary and rewarding the citizens each according to what they have contributed." Even worse, when times are lean the false council will say "Times are awful and getting worse, we must take the grain out of the peoples mouths and put in in the granaries for the harvest next year could be even worse than this year." The point of a granary or any store of wealth is to save the excess during the fat years so that they can be used during the lean years.

Wednesday, March 30, 2011

Cloud Security

Security is not only the the number one concern for adopting cloud computing, it is also a serious barrier to the adopt-ability of cloud computing. Also, security considerations are causing the Virtual Machine (VM) operating system to evolve. All this came out at the SDForum Cloud SIG night on Cloud Security (the presentations are on the SIG page). There were three speakers and a lot was said. I am just going to highlight a few things that struck me as important.

Firstly, Dr Chenxi Wang from Forrester Research spoke on cloud security issues and trends. She highlighted the issue of compliance to various regulations and how it clashes with what the cloud providers have to offer. One concern is where data is stored, as countries have different regulations for data privacy and record keeping on individuals. If data from one country happened to be stored in another country, that could create a problem with complex legal ramifications that would be expensive to resolve. On the other side of the equation are the cloud system vendors who want to provide a generic service with as few constraints as possible. Having to give a guarantee about where data is stored would make their service offering more complicated and expensive to provide.

Another more specific example of the clash between compliance and what cloud vendors provide is with the PCI security standard in credit card card industry. One PCI requirement is that all computer systems used for PCI applications are scanned for vulnerabilities at least ever three months. Most cloud vendors are unwilling to have their systems scanned for vulnerabilities for a variety of reasons, one of which I will discuss shortly. The solution may be specialized cloud services that are aimed at specific industries. IBM is experimenting with a cloud service that they claim is PCI compliant. These specific services will be more expensive and we will have wait and see whether they succeed.

Chris Richter from Savvis, a cloud provider spoke next. He mentioned standards as a way to resolve the issued described above. The International Standards Organization is creating the ISO 27000 suite of standards for information security. So far ISO 27001 "Information security management systems — Requirements" and ISO 27002 "Code of practice for information security management" are the most mature and relevant standards. As with other ISO standards like ISO 9000 quality standard, there is certification process which will allow cloud providers to make standards based security claims about the service that they provide.

Finally, Dave Asprey from Trend Micro discussed the evolving nature of the VM technology that underlies cloud computing offerings. The original VMware vision was that a virtual machine would be used to develop software for a real physical machine so they spent a lot of time and effort on faithfully replication every aspect of a physical machine in their virtual machine. Now the use case has shifted to making more efficient use of resources. However, a problem is that common operations can bring a set of virtual machines to a standstill if they all decide to do the same common operation at the same time.

Again, vulnerability scanning shows the problem. If the company default is that the anti-virus scan is scheduled for lunchtime Wednesday, then the whole virtual machine infrastructure can be brought to its knees when everyone's VM starts its scan at the same time. Furthermore, because many of the files being scanned may be shared by all the virtual machines, having each VM scan them is a huge waste of resources. Anti-virus software companies are working with the VM software vendors to provide a vulnerability scan that is VM aware and that uses new VM APIs to perform its function is an efficient and non-disruptive way. While this is necessary it seems to run counter to the original notion that each VM is an entirely separate entity that is completely unaware that other VMs exist.

Sunday, March 13, 2011

Database System Startups Capitulate

In the last decade, there have been many database system startups, most of them aimed at the analytics market. In the last year, several of the most prominent ones have sold out to large companies. Here are my notes on what has happened.

Netezza to IBM
Netezza is database appliance that uses hardware assistance to do search. Recently it has been quite successful, with revenues getting into the $200M range. Netezza was founded in 2000 and sold out to IBM for $1.7B. The deal closed in November 2010. The Netezza hardware assistance is a gismo near the disk head that decides which data to read. Many people, myself included, think that special purpose hardware in this application is of marginal value at best. You can get better price performance and much more flexibility with commodity hardware and clever software. IBM seems to be keeping Netezza at arms length as a separate company and brand, which is unusual as IBM normally integrates the companies it buys into its existing product lines.

Greenplum to EMC
Greenplum is a massive multi-processor database system. For example, Brian Dolan told the BI SIG last year how Fox Interactive Media (MySpace) used a 40 host Greenplum database system to do their data analytics. The company was founded in 2003. The sale to EMC closed in July 2010. The price is rumoured to be somewhere at the top of the $300M to $400M range. EMC is a storage system vendor that has been growing very fast, partly by acquiring successful companies. EMC owns VMWare (virtualization), RSA (security) and many other businesses. The Greenplum acquisition adds big data to big storage.

Vertica to HP
Vertica is a columnar database system for analytics. The privately held company started in 2005 with respected database guru Mike Stonebreaker as a founder. The sale was announced in February 2011. The sale price has not been announced. I have heard a rumour of $180M which seems low, although the company received only $30M in VC funding. Initially Vertica seemed to be doing well, however in the last year it seems to have lost momentum.

The other interesting part of this equation is HP which used to be a big partner with Oracle for database software. When Oracle bought HP hardware rival Sun Microsystems in 2009, HP was left in a dangerous position as they did not have a database system to call their own. I was surprised that nobody commented on this at the time. In the analytics area, HP tried to fill in with the NeoView database system, which proved to be such a disaster that they recently cancelled it and bought Vertica instead. NeoView was based on the Tandem transaction processing database system. Firstly, it is difficult to get database system that is optimized for doing large numbers of small transactions to do large analytic queries well, and the Tandem system is highly optimized for transaction processing. Secondly, the Tandem database system only ran on the most expensive hardware that HP had to offer so it was very expensive to implement.

Aster Data Systems to Teradata
Aster Data is a massive multi-processor database system, which in theory is a little more flexible about using a cluster of hosts than Greenplum. The company was founded in 2006 and sold out to Teradata for about $300M in March 2011. Teradata, founded in 1979 and acquired by NCR in 1991 was spun out of NCR in 2007 and since then has been sucessfully growing in the data warehouse space. It is not clear how Aster Data and Teradata will integrate their product lines. One thing is that Aster data gives Teradata a scalable offering in the cloud computing space. Teradata has been angling to get into this space for some time as we heard last summer when Daniel Graham spoke the the BI SIG.

Recently there have been a lot of database systems startups, and several of them are still independent. On the other side, there are not a lot of companies that might want to buy a database systems vendor. Furthermore, there is a strong movement to NoSQL databases which are easier to develop and where there are several strong contenders. The buyout prices are good, but apart from Netezza the prices are no blowout. The VCs behind these sales probably decided that they do not want to be left standing when the music stops and so sold out for a good but not great profit.

Saturday, February 26, 2011

The App Store Margin

Recently there has been a lot of discussion about the Apple announcement that they are taking a 30% margin for selling subscriptions through their App store, and that Apple will also take a 30% margin for Apps that sell virtual products and subscriptions through the App. Unfortunately most of the discussion has been heat without light. That is there have been no facts to back up the arguments on either side. I had been curious about the margin in selling goods anyway, so as I had the data, I computed gross margin for publicly traded US companies in the various different retail categories.

As you can see the margin varies between 20% and 40%. The overall average is about 25%, dominated by the Grocery and Department & Discount categories. For retailers working in the real world, after paying for their goods, they have to pay for their properties, staff and marketing so their net margin is considerably less. On the other hand Apple is just processing payments and delivering virtual goods over the internet. On this basis, a 30% margin seems to be on the high side, although not completely out of line.

Galen Gruman at Infoworld points out that a higher margin tends to favor small app and content providers because they would have high distribution costs anyway. On the other hand, a large content provider resents having to hand over 30% of their revenue to Apple for not doing a lot of work. For this reason, I expect that large content providers campaign for a bulk discount on the cost of distributing their content. Thus a good and hopefully likely outcome is a sliding scale. For example, a 30% margin on the first $20,000 per month, 20% on the next $20,000, 10% on the next $20,000 and so on (I have no insight on the business so these numbers are invented as an illustration rather than a suggestion as to what the numbers should be).

Part of the resentment with Apple is that they have a captive market and their behavior in stating terms appears dictatorial. They would have been much better to follow the standard politically correct procedure. That is, to put out a discussion document and then after some to and fro, imposed their terms as they always intended. It has the same end result while creating good will through a patina of choice and consultation.

Saturday, February 19, 2011

Agile BI at Pentaho

Ian Fyfe, Chief Technology Evangelist at Pentaho showed us what the Pentaho Open Source Business Intelligence Suite is and where they are going with it when he spoke to the February meeting of the SDForum Business Intelligence SIG on "Agile BI". Here are my notes on Pentaho from the meeting.

Ian started off with positioning. Pentaho is an open source Business Intelligence Suite with a full set of data integration, reporting, analysis and data mining tools. Their perspective is that 80% of the work in a BI project is acquiring the data and getting it into a suitable form and then other 20% is reporting and analysis of the data. Thus the centerpiece of their suite is the Kettle data integration tool. They have a strong Mondrian OLAP analysis tool and Weka Data Mining tools. Their reporting tool is perhaps not quite as strong as other Open Source BI suites that started from a reporting tool. All the code is written in Java. It is fully embeddable in other applications and can be branded for that application.

Ian showed us a simple example of loading data from a spreadsheet, building a data model from the data and then generating reports from the data. All of these things could be done from within the data integration tool, although they can also be done with stand alone tools. Pentaho is working in the direction of a fully integrated set of tools with common metadata between them all. Currently some of the tools are thick clients and some web based clients. They are moving to have all their client tools be web based.

We had come to hear a presentation on agile BI and Ian gave us the Pentaho view. In an enterprise, the task of generating useful business intelligence is usually done by the IT department in consultation with the end users who want the product. The IT people are involved because they supposedly know the data sources and they own the expensive BI tools. Also, the tools are complicated and using them is usually too difficult for the end user. However, IT works to their own schedule, through their own processes and take their time to produce the product. Often, by the time IT has produced a report, the need for it has moved on.

Pentaho provides a tightly integrated set of tools with a common metadata layer so there is no need to export the metadata from one tool and import it into the next one. The idea is that that the end to end task of generating business intelligence from source data can be done within a single tool or with a tightly integrated suite of tools. This simplifies and speeds up the process of building BI products to the point that it can be delivered while it is still useful. In some cases, the task is simplified to such an extent that it may be done by a power user rather than being thrown over the wall to IT.

The audience was somewhat sceptical of the idea that a sprinkling of common metadata can make for agile BI. All the current BI suites, commercial and open source, have been pulled together from a set of disparate products and they all have rough edges in the way the components work together. I can see that deep and seamless integration between the tools in a suite will make the work of producing Business Intelligence faster and easier. Whether it will be fast enough to call agile we will have to learn from experience.

Sunday, February 06, 2011

Revolution: The First 2000 Years of Computing

For years, the Computer History Museum (CHM) has a open storage area where they put their collection of old computers, but without any interpretation except for docent led tours. I had no problem wandering through this treasure trove because I knew a lot about what they had on show, from slide rules and abacuses to the Control Data 6600 and the Cray machines. Even then, a docent could help by pointing out features that I would miss, such as the ash tray on each workstation of the Sage early warning computer system.

Now the CHM has opened their "Revolution: The First 2000 Years of Computing" exhibition, and I recommend a visit. They still have all the interesting computer hardware as they had in the visible storage area, however it is placed in a larger space and there is all kind of interpretive help from explanation of the exhibits to video clips that you can browse. In my visit, I saw a lot of new things and learned much.

For example, Napier's Bones are an old time calculation aid that turns long multiplication into addition. The Napier's Bones exhibit explains how they work and allows you to do calculations using a set. The exhibit on computers and rocketry has the guidance computer for a large missile arrayed in a circle around the inside of the missile skin leaving an ominously empty space in the middle for the payload. In the semiconductor area they had examples of silicon wafers that ranged from the size of a small coin from the early days to a current wafer that is the size of a large dinner plate. There is also an interesting video discussion of the marketing of the early microprocessors like the 8086, the Z8000, the M68000 and the absolute importance of landing the design win for the IBM PC that led to the current era where Intel is biggest and most profitable chip maker. These are just a sample of the many fascinating exhibits there.

I spent over 2 hours in the exhibition and only managed to get through half of it. I am a long time member of the museum and can go back any time, so this is a warning to non-members to allow enough time for their visit.

Saturday, February 05, 2011

Greenplum at Big Data Camp

I was at the Big Data Camp at the Strata Big Data conference the other day and one of the breakout sessions was with Greenplum. They had several interesting things to say on performance in the cloud, map-reduce and performance. Greenplum is a parallel database system that runs on a distributed set of servers. To the user, Greenplum looks like a conventional database server except that it should be faster and able to handle large data because it farms out the data and the workload over all the hosts in the system. Greenplum also has a map-reduce engine in the server and distributed Hadoop file system. Thus the user can use Greenplum both as a big data relational database and as a big data NoSQL database.

Map-reduce is good for taking semi-structured data and reducing it to more structured data. The example of map-reduce that I gave some time ago does exactly that. Thus a good use of Map Reduce is to do the Transformation part of ETL (Extract-Transform-Load), which is the way data gets into a data warehouse. The Greenplum people confirmed that this is a common use pattern for map-reduce in their system.

Next was a discussion of performance. Greenplum has compared performance and asserted that their relational database is 100 times faster than their map-reduce engine for doing the same query. I was somewhat surprised by the magnitude of this number, however I know that at the low end of system and data size, a relational database can be much faster than map-reduce and at the high end there places you can go with map-reduce that conventional database servers will not go, so it is never a level comparison. I will write more on this in another post.

Finally we got to performance on Virtual Machines (VMs) and in the cloud. Again Greenplum had measured their performance and offered the following. In a place where where conditions are well controlled like a private cloud, they expect to see a 30% performance reduction from running on VMs. In a public cloud like the Amazon EC2 server cloud, they see a 4 times performance reduction. The problem in a public cloud is inconsistent speed for data access and networks. They see both an overall speed reduction and inconsistent speeds when the same query is run over and over again.

It is worth remembering that Greenplum and other distributed database systems are designed to run on a set of servers with the same performance. In practice this means that the whole database system tends to run at the speed of the slowest instance. On the other hand, map-reduce is designed to run on distributed systems with inconsistent performance. The workload is dynamically balanced as the map-reduce job progresses, so map-reduce will work relatively better in a public cloud than a distributed database server.

Monday, January 31, 2011

Security in the Cloud

Although I am not an expert, I have been asked more than once about security in the cloud. Now I can help because last week I got an education on best security practices in the cloud at the SDForum Cloud SIG meeting. Dave Asprey VP of Cloud Security at Trend Micro gave us 16 best practices for ensuring that data is safe in a public cloud like the Amazon cloud services. I will not list all of them, but here is the gist.

Foremost is to encrypt all data. The cloud is a very dynamic place with instances being created and destroyed all over the place, and your instances or data storage may be moved about to optimize performance. When this happens, the residual copy of your data can be left behind for the next occupier of that space to see. Although this would happen by accident, you do not want to expose confidential data for other to see. The only cure is to encrypt all data so that whatever may be left behind is not recognizable. Thus you should only use encrypted file systems, encrypt data in shared memory and encrypt all data on the network.

Management of encryption keys is important. For example, you should only allow the decryption key to enter the cloud when it is needed, and make sure that it is wiped from memory after it has been used. Passwords are a thing of the past. Instead of a password, being able to provide the key to decrypt your data is sufficient to identify you to your cloud system. There should be no password based authentication and access to root privileges should not be mediated by a password, but should be enabled as needed by mechanisms like encryption keys.

Passive measures are not the end of cloud security. There are system hardening tools and security testing services. Also use an active intrusion detection system, for example OSSEC. Finally, and most importantly, the best advice is to "Write Better Applications!"

Monday, January 17, 2011

The Steve Jobs Media Playbook

Information wants to be free. Steve Jobs is not usually associated with setting information free, however he set music free and may well be on the way to set more media free. Here is the playbook that he used to set music free, and an examination of whether he can set other media free.

Back at the turn of the millennium digital music was starting to make waves and Apple introduced their first iPod in 2001. At the beginning, it was not a great seller. Next year the second generation iPod that worked with Microsoft Windows came out and sales started to take off. The next problem with promoting sales of the iPod was to let people buy music directly. In those days, to buy music you had to buy a CD, rip it onto a computer and then sync the music onto the iPod.

The record companies did not like digital music. It was in the process of destroying their business model of selling physical goods, that is CDs, which had been plenty profitable until the internet and file sharing had come along. Thus the record companies knew that if they were going to allow anyone to sell digital music, the music content had to be protected by a strong Digital Rights Management (DRM) system. Basically DRM encrypts digital content so that it can only be accessed by a legitimate user on a accredited device.

Now there is one important thing about any encryption, it depends upon a secret key to unlock the content. If too many people know a secret, it is no longer a secret. So it made perfect sense for Apple to have their own DRM system and be responsible for keeping their secret safe. The only problem was that Apple effectively controlled the music distribution channel because of the DRM system and its secret. By providing exactly what the music business had asked for, Apple managed to wrest control of the distribution channel from them.

In the past I have joked about the music business controlling the industry by controlling the means of production. In fact they controlled the business by controlling the distribution channel between the artists and the record stores who sold the music. When the iTunes store became the prime music distribution channel it was game over for the recording industry. They had to climb down and offer their music without DRM to escape from its deadly embrace. DRM free music has not stopped iTunes but it does open up other sales channels.

The remaining question is what will happen with other media? Apple will not dominate the tablet market as it has the music player market so it will not be able to exert the same influence. On the other hand, other media is not a collectible as music. We collect music because we want to listen to it over and over again. With most other media, we are happy to consume it once and then move on. Thus we do not feel the need to own the media in the same way. I have some more thoughts that will have to wait for another time.