Wednesday, September 17, 2008

SaaS Data Integration

Data integration is the problem of gathering data, perhaps from many different application for the purpose of doing some analysis of the data as a whole. Mike Pittaro, Co-Founder of SnapLogic spoke to the SDForum Business Intelligence SIG September meeting on "Enhancing SaaS Applications Through Data Integration with SnapLogic".

The big players in data integration are Informatica and Ascential (now IBM Information Integration) who sell large, expensive and complex products. Because of the cost, these products are often not used, particularly for one off projects which are common. Mike helped found SnapLogic in 2005 to bring a new perspective to data integration. SnapLogic is an open source framework and therefore both affordable and extensible by its users.

He showed us the complexity of data integration. It involves dealing with many different access protocols, multiple ways of getting the data and each type of data has its own metadata format to describe the data. This he contrasted with the World Wide Web where huge amounts of data are pulled back and forth every day, without interoperability problems. There are almost 200 million web sites, and billions of users, yet World Wide Web is completely decentralized, with heterogeneous model that allows for different operating system, servers, client software applications and frameworks, and yet they are all compatible and interoperable.

The World Wide Web is based on open standards and protocols and an architectural principal called REST, which stands for REpresentational State Transfer. REST plays with data resources, in standardized representations and each resource identified by a unique identifier like a URL.

SnapLogic builds on this by turning data sources into standard web resources. With SnapLogic you configure a server to extracts data from a datasource like a file or database and transform the data into the form you want. The server presents the datasource as a standard web resource with a URL. These servers are the blocks for building a data integration application.

Thursday, September 04, 2008

Chrome

On Tuesday, Google announced their new browser Chrome. Although it has generated huge discussions in various forums and an astonishing adoption rate, I am not going to rush to use it. In fact, I think I will wait until it is out of beta before considering whether to adopt it. That should give me many years before I have to even think about making a change!

Wednesday, September 03, 2008

A Tale of Two Search Engines

At the SDForum Software Architecture and Modeling (SAM) SIG last week, John D. Mitchel, Mad Scientist at MarkMail and previously Chief Architect of Krugle talked about the architectures of the search engines that he has built for these two companies.

Krugle is a search engine for code and all related programming artifacts. The public engine indexes all the open source software repositories on the web. This system was built a couple of years ago with cheap of the shelf commodity hardware, open source software and Network Attached Storage (NAS). In total it has about 150 computer systems in its clusters. The major software components are Lucene (search engine), Nutch (web crawling and etc.) and Hadoop (distributed file system and map-reduce). These are all Open Source projects written in Java and sponsored by the Apache Software Foundation. Krugle sells an enterprise edition that can index and make available all source code in an enterprise.

MarkMail is a search engine for email. It indexes all public mailing lists and is a technology demonstrator for the MarkLogic XML Content Server. MarkMail is built with newer hardware that is more capable. It uses Storage Area Network (SAN) for storage which offers higher performance at a greater cost than NAS. The MarkMail search system is built on about 60 computer systems in its clusters.

Saturday, August 16, 2008

Windows Woes

For years it seemed like a good idea, Microsoft produced the software and many vendors sold compatible hardware. Competition kept the hardware innovation flowing and prices low. Then Microsoft turned into a big bloated monopoly that could not create a decent product if it tried. Moreover, Microsoft is not really in control, it itself is hostage to other interests. The result is a horrible user experience. Here are a couple of my recent experiences.

A few months ago I bought a new video card so that it would use the digital input to the monitor. Installing the card was a breeze and the digital input makes the monitor noticeably sharper. The only problem was that the sound had stopped working. After a couple of hours scratching my head and vigorous Googling, the problem turns out to have been caused by Hollywood.

The connection between a computer and its display uses HDMI, a digital interconnect standard that can transmit both video and audio. This allows a PC to connect to a digital television as well as a simple display. It also allows the video and audio content to be encrypted so that you cannot steal it from your own computer. This was mandated by Hollywood and Microsoft meekly acquiesced to it so that they could provide media center software that would display Hollywood movies in high definition.

So after the video card installation, Windows software assumed that I was going to use the digital audio output on the video card and ignored all other audio output devices. This even although my display does not have any speakers. I had to go into the BIOS and change some low level settings for sound so that Windows would allow me to select the sound settings that I had been using before installing the video card. Any time you have to go into the BIOS to change settings, the user experience loses.

More recently my brother and family came to visit during a tour of California. He wanted to unload all the pictures on his cameras flash card and write them to a CD as the flash card was full. I suggested the easy way out, visit Fry's Electronics and buy another flash card, but that deemed more trouble. In practice it would have been much easier.

We downloaded the flash card to my PC. The first difficulty is that you are presented with a list of 6 competing programs that want to download your pictures. Which one should I use? I know that in practice they are all going to put the pictures in some ridiculous place where you can never find them again (that is the subject of another tirade). I chose the first in the list which happened to be compatible with the brand of digital camera.

The next problem came when we went into Windows Explorer so that we could drag the pictures to the CD ROM folder. Every time we went into the folder where the pictures were, Explorer exited saying that it had an unexpected fault. I knew exactly what the problem was because I had seen it before. There were some movie files taken with the digital camera, and Windows has problem with these movie (.avi) files. For some reason, Explorer tries to open every file in a folder when it enters the folder, even although I set it to just list the files and not display thumbnails.

The fix was to open a DOS window, navigate to the folder with the files and rename them so that Windows would not think they were media files. I added the extension .tmp to each .avi file by laborious typing. Then it was possible to do the intuitive drag and and drop with Explorer to make a CD ROM. Any time you have to resort to using a DOS window to do a straightforward function in Windows, usability has gone out the window.

I could go on (as I have in the past), there have been more problems, however with each problem the Apple alternative looks better. Apple is by no means perfect, however the Apple OS is built on a better foundation and the innovations that it makes when it comes out with a new version are both useful and innovative.

Thursday, July 17, 2008

A Gentle Introduction to R

We were given a gentle introduction to the R statistical programming language and its application in Business Intelligence at the July meeting of the SDForum Business Intelligence SIG. The speakers were Jim Porzac ( Senior Director of Analytics at Responsys) and Michael Driscoll (Principal at Dataspora). Jim has posted the presentation here.

R is an Open Source project that uses the GNU license. It has a growing user base with a strong support community and a user group (called UseR Group - try Googling that). There are now almost 1500 packages for the languages that supports various statistical techniques and specialized application areas. Packages include: Bayesian, Econometrics, Genetics, Machine Learning, Natural Language Processing, Pharmacokinetics, Psycometrics, which gives some idea of the range of subjects and techniques that R covers.

Jim did most of the talking, introducing the language and showing us some examples of its use. One example is his data quality package that he uses on each new dataset that he receives for analysis at Responsys. Another example showed how reporting capabilities while a third showed sophisticated graphs and plots used for customer segmentation analysis. Michael showed us how he used R to do some interesting and very practical analyzes of Baseball statistics.

The audience probed R's strength and weakness. R has the connectivity to get data for analysis from databases and other sources. R also has excellent graphing and reporting capabilities. Currently R works by reading data into memory where it is manipulated, which limits the maximum size of data set that can be analyzed to the many Gigabyte range.

One person asked for a comparison with SAS. R has the advantages of being free with an enthusiastic user base to keeps it on the cutting edge. Also R is a more coherent language than SAS, which is a collection of libraries, each of which may be very good but they do not necessarily make a whole.

Jim and Michael are starting a Bay Area chapter of the UseR Group. If you are interested, contact Jim Porzac at Responsys.

Wednesday, July 09, 2008

Social Search

The SDForum Search SIG pulled together an A-List panel for their July meeting on Social Search. Moderator Safa Rashtchy hosted Bret Taylor of FriendFeed, Ari Steinberg of FaceBook, Jason Calacanis of Mahalo and Jeremie Miller of Wikia Search. Of the panelists, Jason Calacanis had the most to say, was arguably the most interesting and definitely the most opinionated. He also recorded the event with the camera in his MacBook Air. Vallywag has a better and more concise video excerpt of Jason in action shaded by their desire to capture controversy.

Facebook and FriendFeed are working on automated search within their social networks, while Mahalo and Wikia Search are working on improving general search by using people to curate the results. Mahalo is paying people, while Wikia Search is trying to use the Wikipedia model of free community involvement.

Most of the audience questions to the panel were about their business models and monetization. I tried to tried to get into technicalities by asking a question about Search Quality, there was a question on privacy, and one audience member argued that none of the panelists companies were doing social search as he defined it.

Saturday, June 21, 2008

Master Data Management - What, Why, How, Who?

I got two interesting things out of Ramon Chen's talk on Master Data Management (MDM) to the SDForum Business Intelligence SIG June meeting. Ramon is VP Product Marketing at Siperian. The first thing is the importance idea is the notion of Data Governance, and as part of governance the emerging role of the Data Steward. The second thing is the big enterprise software vendors are circling.

Large organizations, companies and government collect vast amounts of information and Data Governance is the process of looking after that data. First is the problem of cataloging all the data that the organization has. Next, there may be different versions of the same data that needs to be reconciled and the quality of the data that needs to be ensured. Finally there is the question of deciding who has access to different parts of the data and ensuring that it is correctly secured. A Data Steward is a person who is responsible for some part of the data.

Ramon had some specific examples of problems with data. One is in the Medical field where gifts to doctors are highly regulated. The problem is in identifying a specific doctor particularly where a father and son with similar names may share a practice, which is not uncommon. Another problem is security. Siperian has implemented security down to the cell level to ensure that each user can only see data that they are allowed to look at.

Ramon also described how MDM software vendors are being consolidated by data providers and the big enterprise software vendors. For example, Purisma, who presented to the BI SIG a couple of years ago was bought by Dunn and Bradstreet last year. IBM has been particularly active in buying small MDM related software vendors, however SAP, Microsoft and Oracle have also bought companies in this area recently.

Thursday, June 12, 2008

Flex, ActionScript, MXML?

It has been over a week since James Ward and Chet Haase of Adobe gave a talk on Flex to the SDForum Java SIG, and I am still trying to get my mind around what it all means. Adobe has a bunch of technologies and products in the Rich Internet Application (RIA) area, but it is difficult to work out what they are, how they fit together and which one I should use for any particular application. Here is the story as I understand it.

Lets start with the programming language ActionScript. ActionScript is ECMAScript which is JavaScript. There are differences in implementation between ActionScript and other forms of JavaScript, however most of the difference is in the Document model, which can be called the API , but is more like the object environment in which the program executes. JavaScript programs execute in a web page defined by the Document Object Model (DOM). ActionScript started as the language of the Flash player so it is more oriented to construction an environment and this leads to some differences in the objects that it can use.

MXML is an XML based declarative language that compiles into ActionScript. Basically it is a shorthand for defining the static parts of an ActionScript environment. By the way, when I entered MXML into the Adobe site search engine, the first thing that came back was the question "Do you mean MSXML?", where MSXML is a MicroSoft technology.

Next we come to the runtime environments. The Flash player is lightweight client that executes compiled ActionScript and is most commonly deployed as a browser plug in (as opposed to, for example, a Browser which contains a JavaScript interpreter). Adobe AIR is a larger and more capable stand alone client for executing compiled ActionScript as well as HTML, Java etc. (as opposed to, for example, a Browser which is a client for interpreting HTML, JavaScript, et al).

Flex is the framework which means that it is a overarching name for the whole pile of technology. The one piece of technology called Flex is Flex Builder, the Eclipse based development environment for ActionScript and MXML. As they have done with other products, Adobe has open sourced a lot of technology surrounding Flex to bring more developers to the platform.

Overall, I am not sure which is more impressive, the melange of technology in Adobe Flex or the marketing effort that tries to make the whole melange of technology seem like one coherent whole.

Monday, June 09, 2008

New iPhone - New Business Model

Steve Jobs announced the widely anticipated new iPhone at the WWDC today. I have seen a lot of comments on features and price, but nothing interesting on the new business model. Here is my take.

In the old business model, Apple and ATT sold the iPhone at full price and in a highly unusual arrangement, ATT shared its ongoing revenue with Apple. Now Apple and ATT sell the iPhone at a discount. ATT presumable pays Apple for each phone they sell, however there is no ongoing revenue sharing. We will have to see exactly how this plays out when the iPhone goes on sale. It may well be that you have to sign up with ATT to unlock the phone when you register it.

Apple still has a couple of revenue streams which are unusual concessions from a mobile-phone companies, especially in the USA. First, Apple gets to sell all the media and games on the phone through its iTunes store. Songs are still $1, movies and TV shows range from $2 to $5, games and applications range from free to $10. This is a useful revenue stream even although it has a margin of only 20% to 30%.

More interesting is the MobileMe storage and syncing service that costs $100 a year. Verizon charges me $10 to move my phone list from an old phone to a new phone when I have to buy a new one. Nobody there or at any other phone company thought of charging $100 for making this service continuously available. At the same time it is a great idea, that many have picked up as a good reason to get the iPhone.

The only problem with MobileMe is the ridiculously small storage capacity of 20 GB. The phone has 8GB or 16GB. What is the point of having a backing store that is about the same size as my phone? Particularly as storage is not that expensive these days. Google Apps offers 25GB for $50 per year, Apple ought to offer something equivalent.

Apart from that, the new business model keeps Apple ahead of the game which is exactly where it needs to be.

Monday, June 02, 2008

Still HDTV - Not

It is old news to me but HDTV has still to turn on viewers. While working out at the gym the other day and listening to Harry Shearer's Le Show, he mentioned recent research conducted by the Scripps Network that large numbers of viewers that receive a HDTV feed continue to watch the same content in standard definition.

I posted about this problem a couple of years ago. Since then I have successfully trained my family to watch HDTV when it is available, but it took some work to make them HD aware. Still, we do not get a lot of channels in HD and the channels are still in the same obscure 700 range of the "dial".

There is some good programming in HD. Last night we watched 2001: A Space Odyssey on the Universal HD channel and it was stunning. Fortunately there were only a few commercial breaks, because the commercials came in so much louder than the film we had to mute the entire commercial break to remain sane.