As a starting point, let us consider using the same data structure, a two column table where the first column contains the URL of each web page and the second column contains the contents of that web page. However, this immediately presents a problem. Data in a relational database should be normalized, and the first rule of normalization is that each data item should be atomic. While there is some argument as to exactly what atomic means, everyone would agree that the contents of a web page with multiple links to other web pages is not an an atomic data item, particularly if we are interested in those links.
The obvious relational data structure for this data is a join table with two columns. One column, called PAGE_URL, contains the URL of the page. The other column, called LINK_URL, contains URLS of links on the corresponding page. There is one row in this table (called WWW_LINKS) for every link in the World Wide Web. Given this structure we can write the following SQL query to solve the problem in the example (presuming a function called getdomain that returns the domain from a URL):
SELECT LINK_URL, count(distinct getdomain(PAGE_URL))
FROM WWW_LINKS
GROUP BY LINK_URL
The point of this example is to show that Map-Reduce and SQL aggregate functions both address the same kind of data manipulation. My belief is that most Map-Reduce problems can be similarly expressed by database aggregation. However there are differences. Map-Reduce is obviously more flexible and puts less constraint on how the data is represented.
I strongly believe that every programmer should understand the principals of data normalization and why it is useful, but I am willing to be flexible when it comes to practicalities. In this example, if the WWW_LINKS table is a useful structure that is used in a number of different queries, then it is worth building. However if the only reason for building the table is to do one aggregation on it, the Map-Reduce solution is better.
No comments:
Post a Comment