At the SDForum Software Architecture and Modeling (SAM) SIG last week, John D. Mitchel, Mad Scientist at MarkMail and previously Chief Architect of Krugle talked about the architectures of the search engines that he has built for these two companies.
Krugle is a search engine for code and all related programming artifacts. The public engine indexes all the open source software repositories on the web. This system was built a couple of years ago with cheap of the shelf commodity hardware, open source software and Network Attached Storage (NAS). In total it has about 150 computer systems in its clusters. The major software components are Lucene (search engine), Nutch (web crawling and etc.) and Hadoop (distributed file system and map-reduce). These are all Open Source projects written in Java and sponsored by the Apache Software Foundation. Krugle sells an enterprise edition that can index and make available all source code in an enterprise.
MarkMail is a search engine for email. It indexes all public mailing lists and is a technology demonstrator for the MarkLogic XML Content Server. MarkMail is built with newer hardware that is more capable. It uses Storage Area Network (SAN) for storage which offers higher performance at a greater cost than NAS. The MarkMail search system is built on about 60 computer systems in its clusters.