Build and Break

Sunday, October 31, 2010

The New OLAP

Just as there are new approaches to database management with the NoSQL movement, so is there a move to a new OLAP, although this movement is just emerging and has not taken a name yet. This month at the SDForum Business Intelligence SIG meeting, Flurry talked about how they put their data on mobile app usage in a giant data-cube. More recently, Chris Riccomini of LinkedIn spoke to the SDForum SAM SIG about the scalable data cubing system that they have developed. Here is what I learned about Avatara, the LinkedIn OLAP server. DJ Cline has also written a report of the event.

If you do not know what OLAP is, I had hoped to just point to an online explanation, but could not find any that made sense. The Wikipedia entries are pretty deplorable, so here is a short description. Conceptually, OLAP stores data in a multi-dimensional data cube, and this allows users to look at the data from different perspectives in real time. For example, take a simple cube of sales data has three dimensions, a date dimension, a sales person dimension, and a product dimension. In reality, OLAP cubes have more than dimensions than this. Each dimension contains a hierarchy, so the sales person dimension groups sales person by state then sales region, then country. At the base level the cube contains a data point called a measure for each sale of each product made by each sales person and the date when the sales was made. OLAP allows the user to look at the data in aggregate, and then drill down on the dimensions. In the example cube, a user could start by looking at the sales of all products grouped by quarter. Then they could drill down to look at the sales in the most recent quarter divided by sales region. Next they could drill down again to look at sales in the most recent quarter by sales person, comparing say the the Northern region to the Western region, and so on.

The new OLAP is enabled by the same forces that are changing databases with NoSQL. Firstly, the rise of commodity hardware that runs Linux, the commodity operating system, allows the creation of cheap server farms that encourages parallel distributed processing. Secondly, the inevitable march of Moore's law is increasing the size of main memory so that now you can spec a commodity server with more main memory that a commodity server had in disk space 10 years ago. An OLAP data cube can be partitioned along one or more of its dimensions to be distributed over a server farm, although at this stage partitioning is more of a research topic than standard practice. Huge main memory allows large cubes to reside in main memory, giving near instantaneous response to queries. For another perspective on in memory OLAP, read the free commentary by Nigel Pendse at the BI-Verdict (it used to be called the OLAP Report) on "What in-memory BI ‘revolution’?"

LinkedIn is a fast growing business oriented social networking site. They have developed Avatara to support their business needs and currently run several cubes on it. The plan is to to open source the code later this year. Avatara is an in memory OLAP server that uses partitioning to provide scalability beyond the capabilities of a single server.

The presentation was fast paced and it has taken me some time to appreciate the full implications of what was said. Here are some thoughts. Avatara offers an API that is reminiscent of Jolap rather than the MDX language that is the standard way of programming OLAP, probably because an API is easier to implement than a programming language. Avatara does not support hierarchies in its dimensions, but the number of dimension in a typical cube seems to be higher than usual. It may be that they use more dimensions rather than hierarchies within a dimension to represent the same information. This is a trade off of roll-up within the cube for slicing of dimensions. Slicing is probably more efficient and easier to implement while a hierarchy is easier for the user to understand as it allows for drill up and down.

Chris mentioned that most dimensions are small and that can be true, however the real problems with OLAP implementations start when you have more than one large dimension and you have to deal with the issue of sparsity in the data cube. Chris spent some time on the problem of a dimension with more than 4 billion elements and this seems to be a real requirement at LinkedIn. Current OLAP servers seem to be limited to 2 billion elements in a dimension, so they are going to be even more constraining than Avatara.

Sunday, October 24, 2010

Accidental Data Empires

In the new world of big data and analytics a winning business model is to find a novel way to collect interesting big data. Once you have the data, the ways to exploit it are endless. It is a phenomenon that I have seen several times, the latest example is Flurry, a company that collects and aggregates data from mobile applications. Peter Farago, VP Marketing, and Sean Byrnes, CTO abd Co-founder of Flurry spoke to the October meeting of the SDForum Business Intelligence SIG on "Your Company’s Mobile App Blind Spot".

The Flurry proposition is simple, they offer a toolkit that an app developer combines with their mobile app. The app developer goes to the Flurry website, creates a free account and downloads the toolkit. Whenever an instance of the app with the Flurry code is activated or used, it collects information about the usage that is sent back to the Flurry. The amount of information is small, usually about 1.2 kB compressed, so the burden of collection is small. At Flurry, the data is collected, cleansed and put in a gigantic data cube. At any time, an app developer can log into the Flurry website and get reports on how their application is being used. You can get a feel for their service by taking the short Analytics developer tour. Flurry have committed that their Analytics service will always be free.

While there are some issues with data collection that Flurry deals with, the quality of the data is great. Every mobile phone has a unique identifier so there is no problem with identifying individual usage patterns. As the service is free, there is very little friction to its use. Flurry estimates that they are in one in five mobile apps that are out there. In fact, for an app developer, the only reason for not using Flurry is that they have chosen to use a rival data collection service.

In the end however, the big winner is Flurry, who collect huge amounts of information about mobile app and phone usage. In the meeting Peter Farago gave us many different analyses of where the mobile smartphone market is and where it is going, including adoption rates for iPhones versus Android based phones and how the follow on market for apps on each platform is developing. You can get a mouthwatering feel for the information they presented by looking at their blog in which they publish a series of analyses from their data. As I write their latest post shows a graph on the "Revenue Shift from Advertising to Virtual Goods Sales" which shows that apps are growing their revenue from sales of virtual goods, while advertising revenue seems to be stagnant.

With data aggregators, there is always something creepy when you discover just how much data they have on you. Earlier this year there was an incident where a Flurry blog post described some details of the iPad a few days before it was announced that they had gleaned from apps running on these new devices in the Apple offices. Steve Jobs was so provoked by this that he called out Flurry by name and changed the iPhone app developer terms of service to prevent apps from collecting certain sorts of data. You can read more about this incident in the blog report on the meeting by my colleague Paul O'Rorke.

The title of this piece is a reference to the entertaining and still readable book Accidental Empires by Robert X. Cringely about the birth of the personal computer industry and the rivalry between Steve Jobs and Bill Gates.

Wednesday, October 13, 2010

A Critique of SQL

SQL is not a perfect solution as I told the audience at the SDForum Business Intelligence SIG September meeting, where I spoke about "Analytics: SQL or NoSQL". The presentation discusses the difference between SQL and structured data on the one hand versus the NoSQL movement and semi-structured data on the other hand. There is more to the presentation than I can fit in one blog post, so here is what I had to say about the SQL language itself. I will write more about the presentation at another time. You can download the presentation from the BI SIG web site.

Firstly the good. SQL has given us a model of a query language that seems so useful as to be essential. Every system that provides persistence has developed a query language. Here are a smattering of examples. The Hibernate object persistence system has Hibernate Query Language (HQL) which has been developed into the Java Persistence Query language (JPQL). Other Java based object oriented persistence systems either use JPQL or their own variant. Hive is a query interface built on top of the Hadoop Map-Reduce engine. Hive was initially developed by Facebook as a simplified way of accessing their Map-Reduce infrastructure when they discovered that many of the people who need to write queries did not have the programming skills to handle a raw Map-Reduce environment. XQuery is a language for querying a set of XML documents. It has been adopted into the SQL language and is also used with stand alone XML systems. If data is important enough to persist, there is almost always a requirement to provide a simple and easy to use reporting system on that data. A query language handles the simple reporting requirements easily.

On the other hand, SQL has many problems. Here is my thoughts on the most important ones. The first problem is that SQL is not a programming language, it is a data access language. SQL is not designed for writing complete programs, it is intended to fetch data from the database and then anything more than a simply formatted report is done in another programming language. This concept of a data access language for accessing a database goes back to the original concept of a database as promulgated by the CODASYL committee in the late 1960's.

While most implementations of SQL add extra features to make it a complete programming language, they do not solve the problem because SQL is a language unlike any of the other other programming language we have. Firstly, SQL is a relational language. Every statement in SQL starts with a table and results in a table. (Table means a table like in a document, a fixed number of columns and as many rows as are required to express the data.) This is a larger chunk of data than programmers are used to handling. The procedural languages that interface to SQL expects to deal with data at most a row at a time. Also, the rigid table of SQL does not fit well into the more flexible data structures of procedural languages.

Moreover SQL is a declarative language where you specify the desired results and the database system works out the best way to produce them. Our other programming languages are procedural where you describe to the system how it should produce the desired result. Programming SQL requires a different mindset from programming in procedural languages. Many programmers, most of whom just dabble in SQL as a sideline, have difficulty making the leap and are frustrated by SQL because it is just not like the programming languages that they are used to. The combination of a relational language and a declarative language creates a costly mismatch between SQL and our other programming systems.

Finally, SQL becomes excessively wordy, repetitive and opaque as the queries becomes more complicated. Sub-Queries start to abound and the need for correlated sub-queries, outer joins and pivoting data for presentation cause queries to explode in length and complexity. Analytics is the province of complicated queries so this is a particular problem for data analysts. In the past I have suggested that persistence is a ripe area for a new programming language, however although there are many new languages being proposed none of them are concerned with persistence or analytics. The nearest thing to an analytics programming language is R which is powerful but neither new nor easy to use.

Wednesday, October 06, 2010

Vertical Pixels are Disappearing

The quality of monitors for PCs is going backwards. A few years ago, noticing the rise of the widescreen monitor and fearful that all reasonably proportioned monitors would soon disappear, I bought a Samsung Syncmaster 204B (20.1" screen, 1600x1200 pixels). Last night it popped and stopped working. When I went online to research a replacement monitor, the full gravity of the situation became obvious.

Not only is it virtually impossible to find a monitor that is not widescreen, almost all monitors that you can buy, whatever the size of their screen, are 1920x1080 pixels. In the years since I bought the 204B, the number of pixels that we get in the vertical direction has shrunk from 1200 to 1080! Funnily enough, there is a post on Slashdot this morning titled "Why are we losing our vertical pixels" about this topic. The post has drawn many more that the usual number of comments.

For me, the vertical height of the screen is important. I use my computer for reading, writing, programming, editing media and some juggling with numbers. For each activity, having a good height to the screen helps, and width after a certain point does not add much. A Television uses a 1920x1080 pixels for a full 1080p display. The monitor manufacturers are just giving us monitors made from cheap LCD panels designed for televisions. When I watch TV, I have a much larger screen in another room with more comfortable chairs and more room between me and the screen. Thus, I do not need or want a computer monitor that is expressly designed for watching TV.

The real problem is that 1920x1080 monitors are so ubiquitous that it is difficult to find anything else. After a lot of searching I only found a couple of alternatives. Apple has a 27" widescreen monitor that is 2560x1440 pixels at a cost of ~$1000, and only works well with some recent Apple systems. Dell has a 20" monitor in their small business section that is 1600x1200 and costs ~$400. However, Dell seems to vary the type of LCD panel that they use between production runs and one type of panel is a lot better than the other. Unfortunately, you do not know which type of panel you are going to get until it arrives at your door. Neither alternative gets me really excited. One thing is certain, technology is supposed to be about progress, and I am not going backwards and accepting less pixels in any dimension for my next monitor.

Thursday, September 30, 2010

Tablet Aspect Ratios

One important issue with tablet computers that is getting little attention is the screen aspect ratio. Some time ago I wrote about "aspect ratio hell" while trying to decide how to crop holiday photographs. The answer seems to be that you have to crop each photograph independently for each way the photograph is going to be output or displayed. For photographs, the variety of different aspect ratios is a perplexing problem that has no good answer.

Tablet computers have the same problem except that the responsibility lies with app developers who need to make their app work well with the aspect ratios of their target platforms. Aspect ratios for a tablet needs to take into consideration that it will be used in both portrait and landscape mode. The iPad has an aspect ratio of 4:3 (AR 1.33...), which is the same as the iPod Classic while the iPhone and iPod touch have an aspect ratio of 3:2 (AR 1.5). Anyone trying to develop apps for Apple products needs to take this difference into account. On the other hand, both Blackberry and Samsung has announced their Android based tablets with a 7 inch screen which has an aspect ratio of 128:75 (AR 1.706...), which is close to 16:9 (AR 1.77...).

When we look to media, television uses 16:9 and most cinema has a higher ratio like 2.40:1 except for iMax (AR 1.44) which is much squarer. Books and newspaper use a 3:2 ratio (AR 1.5) while magazines tend to be broader with a lower aspect ratio. Frankly, anything with an aspect ratio of more than 3:2 tends to look unnaturally skinny when viewed in portrait mode. A cell phone can get away with a higher aspect ratio because it has to be pocketable, but larger devices meant for viewing both media in both landscape and portrait mode needs to keep its aspect ratio to 3:2 or less. For example, the Kindle, which is mostly used in portrait mode has an aspect ratio of 4:3 (AR 1.33...). From this point of view, the Samsung and Blackberry tablets seem to be designed to be used in landscape mode and not in portrait mode. I hope that other tablet makers do not make the same mistake.

Saturday, September 04, 2010

Understanding the iPad

Some people still struggle to understand the iPad. When it was first announced, there were shrieks of outrage from techies, complaining that it was not a free and open computer system and so nobody should buy one. Then it came out and was adopted by the millions. Steve Ballmer, CEO of Microsoft, expressed dismay that the iPad is easily outselling any tablet computer that Microsoft and ever had a hand in. More recently an executive from LG told the Wall Street Journal that they would bring out a Tablet that would be better than the iPad because it would be oriented towards content creation rather than content consumption.

Then there are many people who get it. For example, Jerry Kaplan, founder of Go Computing, an early slate computer in an interview with Chris O'Brian of the San Jose Mercury News understood that the iPad is oriented for media consumption as opposed to the more general purpose Go slate computer. My belief is that the iPad is a new category of device that addresses a new market.

Last year I wrote about Media Convergence, the idea that in the past, each type of media was different. Books were bound paper sold by booksellers, video was delivered as movies in movie theaters and broadcast as television, records were vinyl goods sold in record stores and heard over the radio, magazines were sold by booksellers or delivered by mail, newspapers had their own content delivery network to ensure that everybody got the previous days news by the following morning. With the digital revolution, all these different types of media are now the same. They are all just buckets of digital bits that are delivered through the Internet. Given this, the next thing we need are devices for consuming all this media. Audio just needs a device the size of your thumb and headphones, whereas video, books, magazines etc. need a screen that is big enough to see, and that is what the iPad is for.

When thinking about these things, I find it useful to draw up some requirements and use cases and then see how the offered devices match those requirements. Here is what I want from my Personal Information Appliance (PIA - remember that acronym).

Light enough that I can lie in bed and read or view media with it.
Instant on, long battery life, able to handle all media types.
Get media without having to plug it into anything else.
A screen large enough to read or view and small enough to make the device portable.

So how does the iPad match these requirements? At 1.5 pounds it is a little heavier than most "light" reading, but there are plenty of hardback books that weigh more. For the second requirement, Adobe Flash is the major missing media type, however there is probably an app to do that. As for screen size, we are going to have to resign ourselves to having multiple devices with different screen sizes until they work out the technology to project images directly onto the retina.

The funny thing is that even although the iPad is speced as a device for consuming media it turns out to be capable of much more. Computer games are the newest type of media and the iPad is a great games platform with a lot of future as Steve Jobs boasted in the recent iPod announcement event. There are many instances in the business world where it will be useful, for example in sales and marketing for giving a presentation or demonstration to an individual. The other day I was astonished to find my boss using his iPad for email while waiting for his laptop to be repaired.

Tuesday, August 31, 2010

Software Update Business Models

These days software updates are a fact of life. If we do not keep our software up to date we risk all sorts of horrendous infections and debilitating attacks. Unfortunately, the providers of our software know this and are starting to use software update to make money or at least remind us that they exist. I have done several software updates recently and noticed this in action.

Adobe just wants to remind me of their presence, so they insist on putting a shortcut to the Adobe Reader on my desktop every time they update. This is relatively benign as it is a matter of a few seconds at most to confirm that it is a shortcut and delete it. Apple is more pushy. I expect to get a new version of iTunes any day now, and I will need to carefully uncheck boxes to ensure that I do not get several applications more than I want. Most insidious is Java, now owned by Oracle. On one system they offered me the Yahoo tool bar, on another system which already had the Yahoo tool bar, they offered me some other software, so they obviously look to see what is installed to guide the offer. Judging by the fact that these offers were for third party software, I am sure that they get some sort of compensation for it.

Soon we will see advertisements and offers in the installer, and new ways to confuse us. The tactic that always gets me is to require some input that I forget to fill in, then when I go back to fill in this information, all the boxes I so carefully unchecked have been mysteriously filled in again. In a hurry, I just click "Install" not noticing that I am now getting all the extras that I had carefully tried to avoid. It is coming to a computer near you soon.

Saturday, August 28, 2010

Mad Skills for Big Data

Big Data is a big deal these days, so it was with great interest that we welcomed Brian Dolan to the SDForum Business Intelligence SIG August meeting to speak on "MAD Skills: New Analysis Practices for Big Data". MAD is an acronym for Magnetic Agile Deep, and as Brian explained, these skills are all important in handling big data. Brian is a mathematician who came to Fox Interactive Media as Lead Analyst. There he had to help the marketing group with deciding how to price and serve advertisements to users. As they had tens of millions of users that they often knew quite a lot about, and served billions of advertisements per day, this was a big data problem. They used a 40 node Greenplum parallel database system and also had access to a 105 node map reduce cluster.

The presentation started with the three skills. Magnetic, means drawing the analyst in by giving them a free reign over their data and access to use their own methods. At Fox, Brian grappled with a button down DBA to establish his own his own private sandbox where he could access and manipulate his own data. There he could bring in his own data sets, both internal and external. Over time the analysts group established a set of mathematical operations that could be run in parallel over the data in the database system speeding up their analyses by orders of magnitude.

Agile means analytics that adjust react and learn from your business. Brian talked about the virtuous cycle of analytics, where the analyst first acquires new data to be analyzed, then runs analytics to improve performance and finally the analytics causes business practices to suit. He talked through the issues at each step in the cycle and led us through a case study of audience forecasting at Fox which illustrated problems with sampling and scaling results.

Deep analytics is about producing more than reports. In fact Brian pointed out that even data mining can concentrate on finding a single answer to a single problem where big analytics has the need to solve millions of problems at the same time. For example, he suggested that statistical density methods may be better at dealing with big analytics than other more focused techniques. Another problem with deep analysis of big data is that, given the volume of data, it is possible to find data that supports almost any conclusion. Brian used the parable of the Zen Tea Cup to illustrate the issue. The analyst needs to be to approach their analysis without preconceived notions or they will just find exactly what they are looking for.

Of all the topics that came up during the presentation, the one the caused most frissons with the audience was dirty data. Brian's experience has been that cleaning data can lose valuable information and that a good analyst can easily handle dirty data as a part of their analysis. When pressed by an audience member he said "well 'clean' only means that it fits your expectation". As an analyst is looking for the nuggets that do not meet obvious expectations, sanitizing data can lose those very nuggets. The recent trend to load data and then do the cleaning transformations in the database means that the original data is in the database as well as the cleaned data. If that original data is saved, the analyst can do their analysis with either data as they please.

Mad Skills also refers to the ability to do amazing and unexpected things, especially in motocross motor bike riding. Brian's personal sensibilities were more forged in punk rock, so you could say that he showed us the "kick out the jams" approach to analytics. You can get the presentation from the BI SIG web site. The original MAD Skills paper was presented at the 2009 VLDB conference and a version of it is available online.

Monday, August 23, 2010

End of Moore's Law

The recent announcement that Intel is buying McAfee, the security software company, has the analysts and pundits talking. The ostensible reason for the deal is that Intel wants the security company to help them add security to their chips. Now, while security is important, I do not believe that is the reason Intel bought McAfee. In my opinion, this purchase signals that Intel sees the coming end of Moore's Law.

In 2005, the Computer History Museum celebrated 40 years of Moore's Law, the technology trend that every 2 years, the number of transistors on a silicon chip, and thus its capabilities doubles. On the stage Gordon Moore told us that throughout the 40 years, "they have always been able to see out about 3 generations of manufacturing technology", where each generation is about 2 years. So Intel can see its technology path for about the next 6 years. At that time Moore told us that they could still see how they were going to carry on Moore's Law for the next three generations.

Now what would happen if Intel looked 6 years into the future and saw that it was no longer there. That they could see the end of Moore's law and that meant that they would no longer have the ability to create new and more powerful chips to keep their revenue growing. I believe that they would start looking to buy up other profitable companies in related lines of business to diversify their revenue.

McAfee is a large security software company, its main business is selling security solutions to large enterprises. If Intel had wanted to buy security technology they could have gone out and bought a security start-up with better technology than McAfee for a few hundred million dollars. Instead they are spending an expensive 8 billion dollars on an enterprise security software company. This deal does not make sense for the reasons given, however it does make sense if Intel wants to start buying its way into other lines of business.

Now there are many reasons that Intel wants diversify their business. Perhaps they see the profitable sales of processor chips disappearing as chips gain so many transistors that they do not know what to do with them. However the most likely reason is that they can see the end of Moore's Law and that it is now time to move on and add some other lines of business.

Saturday, August 14, 2010

Analytics at Work

Analytics has become a major driving force for competitive advantage in business. The new book "Analytics at Work: Smarter Decisions, Better Results" by Thomas H. Davenport, Jeanne G. Harris and Robert Morison discusses what analytics can do for a business, how to manage analytics and how to make a business more analytical.

Analytics at Work has a useful introductory chapter and then divides into two parts. The first part discusses five major aspects of analytics in a business environment. The second part looks at the lifecycle of managing analytics in a business. The organization is good and there is no overlap between the topics in each part, however the order in which the information is presented seems designed to put the reader off.

The first part starts with a plodding chapter on what needs to be done to get the data organized and related topics, followed by a diffuse chapter called Enterprise. The interesting chapters in this part are the last two chapters. The Targets chapter discusses the important topic of picking targets for analytics. The Analysts chapter discusses how to effectively employ and organize analysts in a large enterprise. Similarly the second part of the book starts with a plodding chapter on how to Embed Analytics in Business Processes, followed by much more inspiring chapters on building an analytical culture, and the need to continually review a business comprehensively as part of an analytics push. If you find yourself stuck reading the book, try skipping to one of the interesting chapters that I have indicated.

Scattered throughout the book are many useful tools. In the introductory chapter there are the six key questions that an analyst asks. We come back to these questions from several places in the book. Running throughout the book is a five step capability maturity model for judging how analytical an organizations is and showing the path to making the organization more analytical. Each chapter in the first part ends with a discussion on how to take that aspect of the organization through the five steps.

It is important to understand the target audience. The book is aimed at senior management and executives, particularly in large enterprises. While the book contains many brief case studies as inspiration and it touches on all the important management issues that need to be considered, it does not go into great depth about what analytics is or the specific analytical techniques and how they can be used. This is not a book for analysts, unless they have ambitions to grow their career beyond analytics. I recommend this book to anyone in the target audience who wants to grow their organizations analytics capabilities.