Saturday, February 26, 2011

The App Store Margin

Recently there has been a lot of discussion about the Apple announcement that they are taking a 30% margin for selling subscriptions through their App store, and that Apple will also take a 30% margin for Apps that sell virtual products and subscriptions through the App. Unfortunately most of the discussion has been heat without light. That is there have been no facts to back up the arguments on either side. I had been curious about the margin in selling goods anyway, so as I had the data, I computed gross margin for publicly traded US companies in the various different retail categories.

As you can see the margin varies between 20% and 40%. The overall average is about 25%, dominated by the Grocery and Department & Discount categories. For retailers working in the real world, after paying for their goods, they have to pay for their properties, staff and marketing so their net margin is considerably less. On the other hand Apple is just processing payments and delivering virtual goods over the internet. On this basis, a 30% margin seems to be on the high side, although not completely out of line.

Galen Gruman at Infoworld points out that a higher margin tends to favor small app and content providers because they would have high distribution costs anyway. On the other hand, a large content provider resents having to hand over 30% of their revenue to Apple for not doing a lot of work. For this reason, I expect that large content providers campaign for a bulk discount on the cost of distributing their content. Thus a good and hopefully likely outcome is a sliding scale. For example, a 30% margin on the first $20,000 per month, 20% on the next $20,000, 10% on the next $20,000 and so on (I have no insight on the business so these numbers are invented as an illustration rather than a suggestion as to what the numbers should be).

Part of the resentment with Apple is that they have a captive market and their behavior in stating terms appears dictatorial. They would have been much better to follow the standard politically correct procedure. That is, to put out a discussion document and then after some to and fro, imposed their terms as they always intended. It has the same end result while creating good will through a patina of choice and consultation.

Saturday, February 19, 2011

Agile BI at Pentaho

Ian Fyfe, Chief Technology Evangelist at Pentaho showed us what the Pentaho Open Source Business Intelligence Suite is and where they are going with it when he spoke to the February meeting of the SDForum Business Intelligence SIG on "Agile BI". Here are my notes on Pentaho from the meeting.

Ian started off with positioning. Pentaho is an open source Business Intelligence Suite with a full set of data integration, reporting, analysis and data mining tools. Their perspective is that 80% of the work in a BI project is acquiring the data and getting it into a suitable form and then other 20% is reporting and analysis of the data. Thus the centerpiece of their suite is the Kettle data integration tool. They have a strong Mondrian OLAP analysis tool and Weka Data Mining tools. Their reporting tool is perhaps not quite as strong as other Open Source BI suites that started from a reporting tool. All the code is written in Java. It is fully embeddable in other applications and can be branded for that application.

Ian showed us a simple example of loading data from a spreadsheet, building a data model from the data and then generating reports from the data. All of these things could be done from within the data integration tool, although they can also be done with stand alone tools. Pentaho is working in the direction of a fully integrated set of tools with common metadata between them all. Currently some of the tools are thick clients and some web based clients. They are moving to have all their client tools be web based.

We had come to hear a presentation on agile BI and Ian gave us the Pentaho view. In an enterprise, the task of generating useful business intelligence is usually done by the IT department in consultation with the end users who want the product. The IT people are involved because they supposedly know the data sources and they own the expensive BI tools. Also, the tools are complicated and using them is usually too difficult for the end user. However, IT works to their own schedule, through their own processes and take their time to produce the product. Often, by the time IT has produced a report, the need for it has moved on.

Pentaho provides a tightly integrated set of tools with a common metadata layer so there is no need to export the metadata from one tool and import it into the next one. The idea is that that the end to end task of generating business intelligence from source data can be done within a single tool or with a tightly integrated suite of tools. This simplifies and speeds up the process of building BI products to the point that it can be delivered while it is still useful. In some cases, the task is simplified to such an extent that it may be done by a power user rather than being thrown over the wall to IT.

The audience was somewhat sceptical of the idea that a sprinkling of common metadata can make for agile BI. All the current BI suites, commercial and open source, have been pulled together from a set of disparate products and they all have rough edges in the way the components work together. I can see that deep and seamless integration between the tools in a suite will make the work of producing Business Intelligence faster and easier. Whether it will be fast enough to call agile we will have to learn from experience.

Sunday, February 06, 2011

Revolution: The First 2000 Years of Computing

For years, the Computer History Museum (CHM) has a open storage area where they put their collection of old computers, but without any interpretation except for docent led tours. I had no problem wandering through this treasure trove because I knew a lot about what they had on show, from slide rules and abacuses to the Control Data 6600 and the Cray machines. Even then, a docent could help by pointing out features that I would miss, such as the ash tray on each workstation of the Sage early warning computer system.

Now the CHM has opened their "Revolution: The First 2000 Years of Computing" exhibition, and I recommend a visit. They still have all the interesting computer hardware as they had in the visible storage area, however it is placed in a larger space and there is all kind of interpretive help from explanation of the exhibits to video clips that you can browse. In my visit, I saw a lot of new things and learned much.

For example, Napier's Bones are an old time calculation aid that turns long multiplication into addition. The Napier's Bones exhibit explains how they work and allows you to do calculations using a set. The exhibit on computers and rocketry has the guidance computer for a large missile arrayed in a circle around the inside of the missile skin leaving an ominously empty space in the middle for the payload. In the semiconductor area they had examples of silicon wafers that ranged from the size of a small coin from the early days to a current wafer that is the size of a large dinner plate. There is also an interesting video discussion of the marketing of the early microprocessors like the 8086, the Z8000, the M68000 and the absolute importance of landing the design win for the IBM PC that led to the current era where Intel is biggest and most profitable chip maker. These are just a sample of the many fascinating exhibits there.

I spent over 2 hours in the exhibition and only managed to get through half of it. I am a long time member of the museum and can go back any time, so this is a warning to non-members to allow enough time for their visit.

Saturday, February 05, 2011

Greenplum at Big Data Camp

I was at the Big Data Camp at the Strata Big Data conference the other day and one of the breakout sessions was with Greenplum. They had several interesting things to say on performance in the cloud, map-reduce and performance. Greenplum is a parallel database system that runs on a distributed set of servers. To the user, Greenplum looks like a conventional database server except that it should be faster and able to handle large data because it farms out the data and the workload over all the hosts in the system. Greenplum also has a map-reduce engine in the server and distributed Hadoop file system. Thus the user can use Greenplum both as a big data relational database and as a big data NoSQL database.

Map-reduce is good for taking semi-structured data and reducing it to more structured data. The example of map-reduce that I gave some time ago does exactly that. Thus a good use of Map Reduce is to do the Transformation part of ETL (Extract-Transform-Load), which is the way data gets into a data warehouse. The Greenplum people confirmed that this is a common use pattern for map-reduce in their system.

Next was a discussion of performance. Greenplum has compared performance and asserted that their relational database is 100 times faster than their map-reduce engine for doing the same query. I was somewhat surprised by the magnitude of this number, however I know that at the low end of system and data size, a relational database can be much faster than map-reduce and at the high end there places you can go with map-reduce that conventional database servers will not go, so it is never a level comparison. I will write more on this in another post.

Finally we got to performance on Virtual Machines (VMs) and in the cloud. Again Greenplum had measured their performance and offered the following. In a place where where conditions are well controlled like a private cloud, they expect to see a 30% performance reduction from running on VMs. In a public cloud like the Amazon EC2 server cloud, they see a 4 times performance reduction. The problem in a public cloud is inconsistent speed for data access and networks. They see both an overall speed reduction and inconsistent speeds when the same query is run over and over again.

It is worth remembering that Greenplum and other distributed database systems are designed to run on a set of servers with the same performance. In practice this means that the whole database system tends to run at the speed of the slowest instance. On the other hand, map-reduce is designed to run on distributed systems with inconsistent performance. The workload is dynamically balanced as the map-reduce job progresses, so map-reduce will work relatively better in a public cloud than a distributed database server.