Wednesday, October 13, 2010

A Critique of SQL

SQL is not a perfect solution as I told the audience at the SDForum Business Intelligence SIG September meeting, where I spoke about "Analytics: SQL or NoSQL". The presentation discusses the difference between SQL and structured data on the one hand versus the NoSQL movement and semi-structured data on the other hand. There is more to the presentation than I can fit in one blog post, so here is what I had to say about the SQL language itself. I will write more about the presentation at another time. You can download the presentation from the BI SIG web site.

Firstly the good. SQL has given us a model of a query language that seems so useful as to be essential. Every system that provides persistence has developed a query language. Here are a smattering of examples. The Hibernate object persistence system has Hibernate Query Language (HQL) which has been developed into the Java Persistence Query language (JPQL). Other Java based object oriented persistence systems either use JPQL or their own variant. Hive is a query interface built on top of the Hadoop Map-Reduce engine. Hive was initially developed by Facebook as a simplified way of accessing their Map-Reduce infrastructure when they discovered that many of the people who need to write queries did not have the programming skills to handle a raw Map-Reduce environment. XQuery is a language for querying a set of XML documents. It has been adopted into the SQL language and is also used with stand alone XML systems. If data is important enough to persist, there is almost always a requirement to provide a simple and easy to use reporting system on that data. A query language handles the simple reporting requirements easily.

On the other hand, SQL has many problems. Here is my thoughts on the most important ones. The first problem is that SQL is not a programming language, it is a data access language. SQL is not designed for writing complete programs, it is intended to fetch data from the database and then anything more than a simply formatted report is done in another programming language. This concept of a data access language for accessing a database goes back to the original concept of a database as promulgated by the CODASYL committee in the late 1960's.

While most implementations of SQL add extra features to make it a complete programming language, they do not solve the problem because SQL is a language unlike any of the other other programming language we have. Firstly, SQL is a relational language. Every statement in SQL starts with a table and results in a table. (Table means a table like in a document, a fixed number of columns and as many rows as are required to express the data.) This is a larger chunk of data than programmers are used to handling. The procedural languages that interface to SQL expects to deal with data at most a row at a time. Also, the rigid table of SQL does not fit well into the more flexible data structures of procedural languages.

Moreover SQL is a declarative language where you specify the desired results and the database system works out the best way to produce them. Our other programming languages are procedural where you describe to the system how it should produce the desired result. Programming SQL requires a different mindset from programming in procedural languages. Many programmers, most of whom just dabble in SQL as a sideline, have difficulty making the leap and are frustrated by SQL because it is just not like the programming languages that they are used to. The combination of a relational language and a declarative language creates a costly mismatch between SQL and our other programming systems.

Finally, SQL becomes excessively wordy, repetitive and opaque as the queries becomes more complicated. Sub-Queries start to abound and the need for correlated sub-queries, outer joins and pivoting data for presentation cause queries to explode in length and complexity. Analytics is the province of complicated queries so this is a particular problem for data analysts. In the past I have suggested that persistence is a ripe area for a new programming language, however although there are many new languages being proposed none of them are concerned with persistence or analytics. The nearest thing to an analytics programming language is R which is powerful but neither new nor easy to use.

2 comments:

The Editor said...

APL + SQL? Or would just further complicate things for procedural programmers?

Richard Taylor said...

I have never written a program in R, and the R programs that I have seen seem to be extremely concise given the results that they produce. However, I hear that R is a very difficult language to use, so I cannot imagine that a combination of SQL and R would result in something more comprehensible than either on their own.