Build and Break

Tuesday, November 30, 2010

The Registration Dilemma

To register or not to register, that is the question:
Whether 'tis better to create a new online account,
or just make do with with the existing ones,
and so lead a slightly less ennobled life.

Online account registration is a barrier, something that we are all thinking about as this is the season for buying stuff. As I said previously, I have about 70 online accounts where I actively maintain a user identity, and I have created many many more. Thus every time I am presented with the choice of registering for a new site, I stop and think, do I really want to create another account? In the past couple of weeks I have decided to forgo on creating 3 new online accounts and just stick to my well traveled paths.

Registration is not always thought of as a bad thing. For example, Dave McLure, Master of 500 Hats, micro Venture Capitalists and relentless promoter of analytics to improve web based businesses, has Activation as the second of his 5 step program to web enterprise success. Now Activation does not necessarily imply Registration, however Registration is the most common and strongest form of Activation. Dave's perspective is that to succeed on the net, your product needs to be strong enough to overcome any barriers to Activation.

There have been many initiatives to vault over the registration hurdle. The most promising one is OpenId, an open system that allows you to use your account at one web site to log onto other web sites. A couple of years ago I thought that this was a good solution to the Single Sign-on problem and worth promoting. Now OpenId seems to be moribund and it is not widely used. I am not sure what happened, but I did hear rumors of a argument and a split which diminished the organization.

One of the problems with OpenId and any other system is that it tends to favor and strengthen the big players like Yahoo and Google. Another idea the people often trot out is some form of micro-payments system that would obviate the need for registration at many sites. There are a couple of problems. Firstly, any payment is its own barrier, and creating many little barriers instead of one is not a path that is likely to lead to success. For a broader discussion of this issue I recommend the book Free by Chris Anderson.

The second problem is that a successful micro-payment system will favors and strengthen the big players that operate it. It has to be a big player as no one is going to trust their payments to some small and unknown start-up. In practice, the only really successful micro-payment site is iTunes, and it shows up all these problems. In the beginning we all cheered as Steve Jobs took on the record companies. Now that iTunes is the leading purveyor of music, many people have taken to railing against the power of Apple.

The Registration Dilemma is this. We can either continue with the current system that has a chaos of millions of sites, each with their own registration that we need to manage, or we can give in to consolidation and just deal with a few giants. Every time I think about it, I end up siding with chaos.

Tuesday, November 16, 2010

Yeah, Yeah, Yeah

This morning I woke up to the local newspaper headline "Do you want to know a Secret?", and knew that something was going on. Later they changed their tune to something more like the The Wall Street Journal which starts their piece "Steve Jobs is nearing the end of his long and winding pursuit of the Beatles catalog." Other newspapers had headlines like "All you need is iTunes", "Let it be Available" and "Apple and The Beatles finally come together on iTunes". All in all, it seems like bunch stupid headline tricks from the old media, a sure sign that they are getting past it.

Meanwhile the new media is a lot more standoffish. Wired News is like "Yawn". TechCrunch is all business with "All 17 Beatles Albums Are In The Top 100 On iTunes". Of course Fake Steve Jobs had a field day, providing by far the best commentary on the whole event.

Monday, November 15, 2010

Open Source Coopetition

Coopetition is the driving force behind many of the best Open Source projects. In the past, I have written about several different reasons that Open Source projects exist. There are business models like the low cost sales channel. Open Source can act as a home for old software that is still useful, but not commercially essential to a business. There have been attempts to use Open Source as a weapon, to suck the air out of a competitors lungs, by devaluing the intellectual property of the competitor, although many of these attempts have been less successful than their originator hoped.

A presentation on Hadoop got me thinking about Coopetition and Open Source. Hadoop is a big Open Source project to implement all the components of what I have called the Google Database System and a lot more. The major contributors to Hadoop are Yahoo!, Facebook and Powerset - now a part of Microsoft. While these companies are related in that Microsoft owns a stake in Facebook, has tried to buy Yahoo! and now Yahoo! uses Microsoft's Bing search engine, they are also competitors, fighting each other for the attention of web users.

So is it strange that these three companies should cooperate to build Hadoop, an incredibly useful and widely used Open source project? Firstly, the genius of Open Source is that they are not cooperating directly with each other they are all contributing code to a third party, the non-profit Apache foundation that oversees the Hadoop project. Secondly, by spreading the cost of the software over many contributors, they all gain much more than they contribute. Finally, many eyes and the public nature of the code tends to make it better than code that is bottled up in secret and protected from prying eyes. Because the Open Source model allows for the kind of coopetition that brings us software like Hadoop, we all benefit.

Thursday, November 11, 2010

Write Down Your Password

If Bruce Schneier says that you should write down your password, then write down your password. What he means is that given the choice between having a weak password that is so easy to remember that you do not need to write down and a strong password that you do need to write down to remember, it is better to go for the strong password. However, the problem of online identity management is much more complicated. Note that even the terminology is broken. We need to distinguish "online reputation management" which is about managing your personal brand online, with "online identity management" which is about managing how you authorize yourself with websites. Often, the term online identity management is used for online reputation management.

The problems of online identity management starts long before you need to provide a password. First you have to provide a user name. Each site has its own rules about what your user name should be. About half of web sites use an email address as a user identifier, while the other half insist that you play the game of user name roulette where you have to keep guessing a user name until you find one that has not been used. I have enough different user names that I have to write down my user name for each site, before even thinking about writing down a password.

Next problem is the large number of sites where you have an account. I have about 70 sites where I actively maintain a user identity, and there are many more sites where I have registered an identity and then abandoned. Of those 70 site, about 15 are sites like banking sites that are important to protect with a strong password.

One site that is particularly important to protect is your email account. Use a strong password with your email account and do not use that password on any other account. If your email account is compromised, you are in a lot of trouble. For example, many sites allow you to reset your password by mailing you a new one. Remember, an attacker who gains access to your email account is able to read your email including emails from other sites where you are registered. Many sites store your email address and password, so if they are compromised, and you use the same password for all accounts, the attacker has got your email address and the password to the account.

Another serious problem is any account that gives you access after answering security questions. The security questions are effectively another password and they encourage answers that are easy to guess. You are better giving nonsense answers to security questions, except for the fact that you now need to write down the answers to those questions as well. All in all, online identity management is a pain.

Sunday, October 31, 2010

The New OLAP

Just as there are new approaches to database management with the NoSQL movement, so is there a move to a new OLAP, although this movement is just emerging and has not taken a name yet. This month at the SDForum Business Intelligence SIG meeting, Flurry talked about how they put their data on mobile app usage in a giant data-cube. More recently, Chris Riccomini of LinkedIn spoke to the SDForum SAM SIG about the scalable data cubing system that they have developed. Here is what I learned about Avatara, the LinkedIn OLAP server. DJ Cline has also written a report of the event.

If you do not know what OLAP is, I had hoped to just point to an online explanation, but could not find any that made sense. The Wikipedia entries are pretty deplorable, so here is a short description. Conceptually, OLAP stores data in a multi-dimensional data cube, and this allows users to look at the data from different perspectives in real time. For example, take a simple cube of sales data has three dimensions, a date dimension, a sales person dimension, and a product dimension. In reality, OLAP cubes have more than dimensions than this. Each dimension contains a hierarchy, so the sales person dimension groups sales person by state then sales region, then country. At the base level the cube contains a data point called a measure for each sale of each product made by each sales person and the date when the sales was made. OLAP allows the user to look at the data in aggregate, and then drill down on the dimensions. In the example cube, a user could start by looking at the sales of all products grouped by quarter. Then they could drill down to look at the sales in the most recent quarter divided by sales region. Next they could drill down again to look at sales in the most recent quarter by sales person, comparing say the the Northern region to the Western region, and so on.

The new OLAP is enabled by the same forces that are changing databases with NoSQL. Firstly, the rise of commodity hardware that runs Linux, the commodity operating system, allows the creation of cheap server farms that encourages parallel distributed processing. Secondly, the inevitable march of Moore's law is increasing the size of main memory so that now you can spec a commodity server with more main memory that a commodity server had in disk space 10 years ago. An OLAP data cube can be partitioned along one or more of its dimensions to be distributed over a server farm, although at this stage partitioning is more of a research topic than standard practice. Huge main memory allows large cubes to reside in main memory, giving near instantaneous response to queries. For another perspective on in memory OLAP, read the free commentary by Nigel Pendse at the BI-Verdict (it used to be called the OLAP Report) on "What in-memory BI ‘revolution’?"

LinkedIn is a fast growing business oriented social networking site. They have developed Avatara to support their business needs and currently run several cubes on it. The plan is to to open source the code later this year. Avatara is an in memory OLAP server that uses partitioning to provide scalability beyond the capabilities of a single server.

The presentation was fast paced and it has taken me some time to appreciate the full implications of what was said. Here are some thoughts. Avatara offers an API that is reminiscent of Jolap rather than the MDX language that is the standard way of programming OLAP, probably because an API is easier to implement than a programming language. Avatara does not support hierarchies in its dimensions, but the number of dimension in a typical cube seems to be higher than usual. It may be that they use more dimensions rather than hierarchies within a dimension to represent the same information. This is a trade off of roll-up within the cube for slicing of dimensions. Slicing is probably more efficient and easier to implement while a hierarchy is easier for the user to understand as it allows for drill up and down.

Chris mentioned that most dimensions are small and that can be true, however the real problems with OLAP implementations start when you have more than one large dimension and you have to deal with the issue of sparsity in the data cube. Chris spent some time on the problem of a dimension with more than 4 billion elements and this seems to be a real requirement at LinkedIn. Current OLAP servers seem to be limited to 2 billion elements in a dimension, so they are going to be even more constraining than Avatara.

Sunday, October 24, 2010

Accidental Data Empires

In the new world of big data and analytics a winning business model is to find a novel way to collect interesting big data. Once you have the data, the ways to exploit it are endless. It is a phenomenon that I have seen several times, the latest example is Flurry, a company that collects and aggregates data from mobile applications. Peter Farago, VP Marketing, and Sean Byrnes, CTO abd Co-founder of Flurry spoke to the October meeting of the SDForum Business Intelligence SIG on "Your Company’s Mobile App Blind Spot".

The Flurry proposition is simple, they offer a toolkit that an app developer combines with their mobile app. The app developer goes to the Flurry website, creates a free account and downloads the toolkit. Whenever an instance of the app with the Flurry code is activated or used, it collects information about the usage that is sent back to the Flurry. The amount of information is small, usually about 1.2 kB compressed, so the burden of collection is small. At Flurry, the data is collected, cleansed and put in a gigantic data cube. At any time, an app developer can log into the Flurry website and get reports on how their application is being used. You can get a feel for their service by taking the short Analytics developer tour. Flurry have committed that their Analytics service will always be free.

While there are some issues with data collection that Flurry deals with, the quality of the data is great. Every mobile phone has a unique identifier so there is no problem with identifying individual usage patterns. As the service is free, there is very little friction to its use. Flurry estimates that they are in one in five mobile apps that are out there. In fact, for an app developer, the only reason for not using Flurry is that they have chosen to use a rival data collection service.

In the end however, the big winner is Flurry, who collect huge amounts of information about mobile app and phone usage. In the meeting Peter Farago gave us many different analyses of where the mobile smartphone market is and where it is going, including adoption rates for iPhones versus Android based phones and how the follow on market for apps on each platform is developing. You can get a mouthwatering feel for the information they presented by looking at their blog in which they publish a series of analyses from their data. As I write their latest post shows a graph on the "Revenue Shift from Advertising to Virtual Goods Sales" which shows that apps are growing their revenue from sales of virtual goods, while advertising revenue seems to be stagnant.

With data aggregators, there is always something creepy when you discover just how much data they have on you. Earlier this year there was an incident where a Flurry blog post described some details of the iPad a few days before it was announced that they had gleaned from apps running on these new devices in the Apple offices. Steve Jobs was so provoked by this that he called out Flurry by name and changed the iPhone app developer terms of service to prevent apps from collecting certain sorts of data. You can read more about this incident in the blog report on the meeting by my colleague Paul O'Rorke.

The title of this piece is a reference to the entertaining and still readable book Accidental Empires by Robert X. Cringely about the birth of the personal computer industry and the rivalry between Steve Jobs and Bill Gates.

Wednesday, October 13, 2010

A Critique of SQL

SQL is not a perfect solution as I told the audience at the SDForum Business Intelligence SIG September meeting, where I spoke about "Analytics: SQL or NoSQL". The presentation discusses the difference between SQL and structured data on the one hand versus the NoSQL movement and semi-structured data on the other hand. There is more to the presentation than I can fit in one blog post, so here is what I had to say about the SQL language itself. I will write more about the presentation at another time. You can download the presentation from the BI SIG web site.

Firstly the good. SQL has given us a model of a query language that seems so useful as to be essential. Every system that provides persistence has developed a query language. Here are a smattering of examples. The Hibernate object persistence system has Hibernate Query Language (HQL) which has been developed into the Java Persistence Query language (JPQL). Other Java based object oriented persistence systems either use JPQL or their own variant. Hive is a query interface built on top of the Hadoop Map-Reduce engine. Hive was initially developed by Facebook as a simplified way of accessing their Map-Reduce infrastructure when they discovered that many of the people who need to write queries did not have the programming skills to handle a raw Map-Reduce environment. XQuery is a language for querying a set of XML documents. It has been adopted into the SQL language and is also used with stand alone XML systems. If data is important enough to persist, there is almost always a requirement to provide a simple and easy to use reporting system on that data. A query language handles the simple reporting requirements easily.

On the other hand, SQL has many problems. Here is my thoughts on the most important ones. The first problem is that SQL is not a programming language, it is a data access language. SQL is not designed for writing complete programs, it is intended to fetch data from the database and then anything more than a simply formatted report is done in another programming language. This concept of a data access language for accessing a database goes back to the original concept of a database as promulgated by the CODASYL committee in the late 1960's.

While most implementations of SQL add extra features to make it a complete programming language, they do not solve the problem because SQL is a language unlike any of the other other programming language we have. Firstly, SQL is a relational language. Every statement in SQL starts with a table and results in a table. (Table means a table like in a document, a fixed number of columns and as many rows as are required to express the data.) This is a larger chunk of data than programmers are used to handling. The procedural languages that interface to SQL expects to deal with data at most a row at a time. Also, the rigid table of SQL does not fit well into the more flexible data structures of procedural languages.

Moreover SQL is a declarative language where you specify the desired results and the database system works out the best way to produce them. Our other programming languages are procedural where you describe to the system how it should produce the desired result. Programming SQL requires a different mindset from programming in procedural languages. Many programmers, most of whom just dabble in SQL as a sideline, have difficulty making the leap and are frustrated by SQL because it is just not like the programming languages that they are used to. The combination of a relational language and a declarative language creates a costly mismatch between SQL and our other programming systems.

Finally, SQL becomes excessively wordy, repetitive and opaque as the queries becomes more complicated. Sub-Queries start to abound and the need for correlated sub-queries, outer joins and pivoting data for presentation cause queries to explode in length and complexity. Analytics is the province of complicated queries so this is a particular problem for data analysts. In the past I have suggested that persistence is a ripe area for a new programming language, however although there are many new languages being proposed none of them are concerned with persistence or analytics. The nearest thing to an analytics programming language is R which is powerful but neither new nor easy to use.

Wednesday, October 06, 2010

Vertical Pixels are Disappearing

The quality of monitors for PCs is going backwards. A few years ago, noticing the rise of the widescreen monitor and fearful that all reasonably proportioned monitors would soon disappear, I bought a Samsung Syncmaster 204B (20.1" screen, 1600x1200 pixels). Last night it popped and stopped working. When I went online to research a replacement monitor, the full gravity of the situation became obvious.

Not only is it virtually impossible to find a monitor that is not widescreen, almost all monitors that you can buy, whatever the size of their screen, are 1920x1080 pixels. In the years since I bought the 204B, the number of pixels that we get in the vertical direction has shrunk from 1200 to 1080! Funnily enough, there is a post on Slashdot this morning titled "Why are we losing our vertical pixels" about this topic. The post has drawn many more that the usual number of comments.

For me, the vertical height of the screen is important. I use my computer for reading, writing, programming, editing media and some juggling with numbers. For each activity, having a good height to the screen helps, and width after a certain point does not add much. A Television uses a 1920x1080 pixels for a full 1080p display. The monitor manufacturers are just giving us monitors made from cheap LCD panels designed for televisions. When I watch TV, I have a much larger screen in another room with more comfortable chairs and more room between me and the screen. Thus, I do not need or want a computer monitor that is expressly designed for watching TV.

The real problem is that 1920x1080 monitors are so ubiquitous that it is difficult to find anything else. After a lot of searching I only found a couple of alternatives. Apple has a 27" widescreen monitor that is 2560x1440 pixels at a cost of ~$1000, and only works well with some recent Apple systems. Dell has a 20" monitor in their small business section that is 1600x1200 and costs ~$400. However, Dell seems to vary the type of LCD panel that they use between production runs and one type of panel is a lot better than the other. Unfortunately, you do not know which type of panel you are going to get until it arrives at your door. Neither alternative gets me really excited. One thing is certain, technology is supposed to be about progress, and I am not going backwards and accepting less pixels in any dimension for my next monitor.

Thursday, September 30, 2010

Tablet Aspect Ratios

One important issue with tablet computers that is getting little attention is the screen aspect ratio. Some time ago I wrote about "aspect ratio hell" while trying to decide how to crop holiday photographs. The answer seems to be that you have to crop each photograph independently for each way the photograph is going to be output or displayed. For photographs, the variety of different aspect ratios is a perplexing problem that has no good answer.

Tablet computers have the same problem except that the responsibility lies with app developers who need to make their app work well with the aspect ratios of their target platforms. Aspect ratios for a tablet needs to take into consideration that it will be used in both portrait and landscape mode. The iPad has an aspect ratio of 4:3 (AR 1.33...), which is the same as the iPod Classic while the iPhone and iPod touch have an aspect ratio of 3:2 (AR 1.5). Anyone trying to develop apps for Apple products needs to take this difference into account. On the other hand, both Blackberry and Samsung has announced their Android based tablets with a 7 inch screen which has an aspect ratio of 128:75 (AR 1.706...), which is close to 16:9 (AR 1.77...).

When we look to media, television uses 16:9 and most cinema has a higher ratio like 2.40:1 except for iMax (AR 1.44) which is much squarer. Books and newspaper use a 3:2 ratio (AR 1.5) while magazines tend to be broader with a lower aspect ratio. Frankly, anything with an aspect ratio of more than 3:2 tends to look unnaturally skinny when viewed in portrait mode. A cell phone can get away with a higher aspect ratio because it has to be pocketable, but larger devices meant for viewing both media in both landscape and portrait mode needs to keep its aspect ratio to 3:2 or less. For example, the Kindle, which is mostly used in portrait mode has an aspect ratio of 4:3 (AR 1.33...). From this point of view, the Samsung and Blackberry tablets seem to be designed to be used in landscape mode and not in portrait mode. I hope that other tablet makers do not make the same mistake.

Saturday, September 04, 2010

Understanding the iPad

Some people still struggle to understand the iPad. When it was first announced, there were shrieks of outrage from techies, complaining that it was not a free and open computer system and so nobody should buy one. Then it came out and was adopted by the millions. Steve Ballmer, CEO of Microsoft, expressed dismay that the iPad is easily outselling any tablet computer that Microsoft and ever had a hand in. More recently an executive from LG told the Wall Street Journal that they would bring out a Tablet that would be better than the iPad because it would be oriented towards content creation rather than content consumption.

Then there are many people who get it. For example, Jerry Kaplan, founder of Go Computing, an early slate computer in an interview with Chris O'Brian of the San Jose Mercury News understood that the iPad is oriented for media consumption as opposed to the more general purpose Go slate computer. My belief is that the iPad is a new category of device that addresses a new market.

Last year I wrote about Media Convergence, the idea that in the past, each type of media was different. Books were bound paper sold by booksellers, video was delivered as movies in movie theaters and broadcast as television, records were vinyl goods sold in record stores and heard over the radio, magazines were sold by booksellers or delivered by mail, newspapers had their own content delivery network to ensure that everybody got the previous days news by the following morning. With the digital revolution, all these different types of media are now the same. They are all just buckets of digital bits that are delivered through the Internet. Given this, the next thing we need are devices for consuming all this media. Audio just needs a device the size of your thumb and headphones, whereas video, books, magazines etc. need a screen that is big enough to see, and that is what the iPad is for.

When thinking about these things, I find it useful to draw up some requirements and use cases and then see how the offered devices match those requirements. Here is what I want from my Personal Information Appliance (PIA - remember that acronym).

Light enough that I can lie in bed and read or view media with it.
Instant on, long battery life, able to handle all media types.
Get media without having to plug it into anything else.
A screen large enough to read or view and small enough to make the device portable.

So how does the iPad match these requirements? At 1.5 pounds it is a little heavier than most "light" reading, but there are plenty of hardback books that weigh more. For the second requirement, Adobe Flash is the major missing media type, however there is probably an app to do that. As for screen size, we are going to have to resign ourselves to having multiple devices with different screen sizes until they work out the technology to project images directly onto the retina.

The funny thing is that even although the iPad is speced as a device for consuming media it turns out to be capable of much more. Computer games are the newest type of media and the iPad is a great games platform with a lot of future as Steve Jobs boasted in the recent iPod announcement event. There are many instances in the business world where it will be useful, for example in sales and marketing for giving a presentation or demonstration to an individual. The other day I was astonished to find my boss using his iPad for email while waiting for his laptop to be repaired.