Monday, September 29, 2008

42 Revisited

Last week TechCrunch had a post on the State of The Blogosphere: The More You Post, The Higher You Rank. One statistic is that the top 100 bloggers post on average 310 times a month, which sounds quite exhausting. As you know, I post 42 times a year. I am going to promise to my faithful reader that I will stick to my pace. You will not get an unreadable avalanche of overlapping verbiage from this blog.

If I have not posted much recently, it is because I have spent a lot of time reading blog posts on the financial crisis. It is very entertaining to see these extrordinary events unfold around us. Who would have thought that George W. Bush will be known to future generations as the President who nationalized the American financial services industry?

Wednesday, September 17, 2008

SaaS Data Integration

Data integration is the problem of gathering data, perhaps from many different application for the purpose of doing some analysis of the data as a whole. Mike Pittaro, Co-Founder of SnapLogic spoke to the SDForum Business Intelligence SIG September meeting on "Enhancing SaaS Applications Through Data Integration with SnapLogic".

The big players in data integration are Informatica and Ascential (now IBM Information Integration) who sell large, expensive and complex products. Because of the cost, these products are often not used, particularly for one off projects which are common. Mike helped found SnapLogic in 2005 to bring a new perspective to data integration. SnapLogic is an open source framework and therefore both affordable and extensible by its users.

He showed us the complexity of data integration. It involves dealing with many different access protocols, multiple ways of getting the data and each type of data has its own metadata format to describe the data. This he contrasted with the World Wide Web where huge amounts of data are pulled back and forth every day, without interoperability problems. There are almost 200 million web sites, and billions of users, yet World Wide Web is completely decentralized, with heterogeneous model that allows for different operating system, servers, client software applications and frameworks, and yet they are all compatible and interoperable.

The World Wide Web is based on open standards and protocols and an architectural principal called REST, which stands for REpresentational State Transfer. REST plays with data resources, in standardized representations and each resource identified by a unique identifier like a URL.

SnapLogic builds on this by turning data sources into standard web resources. With SnapLogic you configure a server to extracts data from a datasource like a file or database and transform the data into the form you want. The server presents the datasource as a standard web resource with a URL. These servers are the blocks for building a data integration application.

Thursday, September 04, 2008

Chrome

On Tuesday, Google announced their new browser Chrome. Although it has generated huge discussions in various forums and an astonishing adoption rate, I am not going to rush to use it. In fact, I think I will wait until it is out of beta before considering whether to adopt it. That should give me many years before I have to even think about making a change!

Wednesday, September 03, 2008

A Tale of Two Search Engines

At the SDForum Software Architecture and Modeling (SAM) SIG last week, John D. Mitchel, Mad Scientist at MarkMail and previously Chief Architect of Krugle talked about the architectures of the search engines that he has built for these two companies.

Krugle is a search engine for code and all related programming artifacts. The public engine indexes all the open source software repositories on the web. This system was built a couple of years ago with cheap of the shelf commodity hardware, open source software and Network Attached Storage (NAS). In total it has about 150 computer systems in its clusters. The major software components are Lucene (search engine), Nutch (web crawling and etc.) and Hadoop (distributed file system and map-reduce). These are all Open Source projects written in Java and sponsored by the Apache Software Foundation. Krugle sells an enterprise edition that can index and make available all source code in an enterprise.

MarkMail is a search engine for email. It indexes all public mailing lists and is a technology demonstrator for the MarkLogic XML Content Server. MarkMail is built with newer hardware that is more capable. It uses Storage Area Network (SAN) for storage which offers higher performance at a greater cost than NAS. The MarkMail search system is built on about 60 computer systems in its clusters.