Wednesday, September 03, 2008

A Tale of Two Search Engines

At the SDForum Software Architecture and Modeling (SAM) SIG last week, John D. Mitchel, Mad Scientist at MarkMail and previously Chief Architect of Krugle talked about the architectures of the search engines that he has built for these two companies.

Krugle is a search engine for code and all related programming artifacts. The public engine indexes all the open source software repositories on the web. This system was built a couple of years ago with cheap of the shelf commodity hardware, open source software and Network Attached Storage (NAS). In total it has about 150 computer systems in its clusters. The major software components are Lucene (search engine), Nutch (web crawling and etc.) and Hadoop (distributed file system and map-reduce). These are all Open Source projects written in Java and sponsored by the Apache Software Foundation. Krugle sells an enterprise edition that can index and make available all source code in an enterprise.

MarkMail is a search engine for email. It indexes all public mailing lists and is a technology demonstrator for the MarkLogic XML Content Server. MarkMail is built with newer hardware that is more capable. It uses Storage Area Network (SAN) for storage which offers higher performance at a greater cost than NAS. The MarkMail search system is built on about 60 computer systems in its clusters.


John D. Mitchell said...

Hi Richard, thanks for the write up!

FYI, I've made the slides of the talk available on the MarkMail blog.

There is a bit of confusion in your description of the clusters that I'd like to clarify if I may...

Krugle was using about 150 physical servers in total for all of the crazy stuff that it has to do to crawl, process, store, and deliver all of the various services (both internally and externally). That's due to a lot of reasons such as pushing many servers to the edge of their performance, instability of the various virtualization options at the time we started, architectural constraints due to issues with software that we were using that had various problems (such as Hadoop), etc.

On the other hand, is heavily virtualized. That is, the live cluster (see the architecture slides in the talk), which is currently serving over 25 million emails with << sub-second response times, is run on a handful of physical machines. I.e., a native MarkLogic Server cluster with 1 'e' ("master") node and 3 data nodes running on bare, physical machines -- because the MarkLogic database server will take advantage of all of the horsepower on the box so as to be as fast as possible. In fact, until recently, we ran the entire database on 1 server -- we only switched to the 4 node cluster so that we'd be well ahead of the curve as we ingest emails at an accelerating rate.

Everything else is virtualized -- all of the rest of the services are run inside OpenVZ virtualized Linux servers.

So, the "60 servers" number that you mention for MarkMail is a bit erroneous. We spool up and destroy instances of the virtual servers all the time -- i.e., development and testing made easy by spooling up a clean instance of the various services that you're working on and then nuking it when you're done. In terms of actual server hardware, we're running all of the MarkMail organization on about 20 physical servers and the super-majority of them are for testing and development.

This new reality of having a mix of services running on bare metal when it's worth it and the rest of the services running virtualized makes for a wonderfully cost effective and manageable solution that delivers excellent performance.

Have fun,

Richard Taylor said...

John, thanks for the clarification. I tried to download the slides of your presentation but get a message that the .pdf file is broken and cannot be repaired.

John D. Mitchell said...

Did your browser get the entire .pdf file? It's large -- 31MB -- so if the connection was lost, that would certainly cause a problem viewing it.