Fine, make me a blog

It's a blog, okay?
Follow What I Follow | My Book Reviews | About Me | Tweets

Oct 4

Why is writing SQL difficult?

The tongue-in-cheek answer is that most people don’t care how hard writing SQL is.  It keeps people employed, and the next milestone in automation is a lot of R&D with scary consequences for most enterprise IT professionals.

But here is an engineering answer:

It seems to me that in SQL, one of my favorite questions becomes particularly important: what optimizations are optional and what optimizations are required, i.e. can be relied upon as part of the semantics of the language (and therefore could be argued to not be rightly called optimizations at all).

It might shock you, but for a long time major open source databases like PostgreSQL considered trivial examples like select COUNT(*) from Transaction as a corner case for correctly implementing multiversion concurrency control (the same works for MAX(expr) and MIN(expr)). In fact, many didn’t even notice the problem because many users had “embarrassingly small” databases or didn’t ever write these sorts of aggregate queries. In addition, prior to the creation of new search-based testing techniques, it was very difficult to generate data sets with distributions that could find flaws with the heuristics used in the Cost-Based Optimizer (CBO) module. And generating a bunch of funky data sets is not good enough - you need to know why that data set is funky. How does it stand out from other data sets?

As I suggest above, this is much more of an implementation issue with concurrency control or the fact the cost-based optimizer is a black box. Different database vendors do it differently. Some vendors have blazing fast rollbacks, which allow programmers to get away with some cockeyed database designs that have ridiculous amounts of trigger logic. Other vendors have blazing fast writes but are penalized more heavily on rollbacks, so if there is a trigger that causes nested transactions then the performance will tank on a large system (so the solution is don’t do that and instead partition your logic across tiers). The difference between blazing fast and slow can be fractions of a second. Having fast rollbacks for most operations can be convenient for other (very good!) reasons, though, such as the fact that rollbacks in general aren’t guaranteed to be linear with respect to a transactions current progress / time accrued.

Still, there are a lot of flaws with SQL itself that work against it being a good IR for representing queries - and there is the additional basic problem that each vendor might have only stress-tested their implementation up to a point and you might push past that point. “Solutions” to the language have been bolted on, not unlike other “reporting languages” like XQuery. Vendors try to stick the the standard, but they also differ in significant ways.

So, getting back to the max problem at hand, some solutions may only be “syntactically hard” if you can prove to yourself that the optimizer makes them “semantically easy.”

Again, this requires knowing what the CBO is doing - a naive CBO could generate plans for eternity, but real-world CBOs employ heuristics e.g. that try to avoid creating plans for “simple queries” where a table scan is sufficient, and also try to use statistics on columns to infer how expensive index seeks would be, etc. And they must do this while pruning the search tree for the “ideal plan” fast enough that they don’t waste system resources generating a bunch of effectively equivalent plans.

On top of this, most CBOs are not deterministic in a way that can be usefully used to predict what the DBMS will actually do. For example, many CBO engines allow you to retrieve an “estimated execution plan”, but many of these estimations are hard-coded based on some dude’s workstation and not your blade servers connected to a SAN.

It is certainly possible to build a naive artificially intelligent agent that could automate what a DBA does manually. But ultimately it is no better than the proverbial ant walking on a beach. You still have a black box performance profile and an under-specified IR. For example, windowing functions behave wildly differently from vendor to vendor, and most high quality ORMs deal with it by using a massive case explosion that can include multiple cases for the same (vendor,version) pair. One case might use a nested select statement if it thinks the result set it is windowing is small, and might shift to using a temporary table if the result set it is windowing is large. For some (vendor,version), the implementation of the DBMS might not allow windowing on “very large data” windows.

Implicit in my question is also the question of what does it mean to be “hard” or “easy,” which I guess sometimes gets discussed (often unproductively) under the banner of expressiveness. What are some objective definitions out there that can make the discussion productive?

Hard or easy — For who? For examples, suppose we are creating a:

  • a graphical application that does automatic presentation of database information. There is a visual query language, VizQL, that tries to tackle this problem.
  • Object-Relational Mapper that manages marshaling data to/from the database
  • Framework for embedding type-safe queries on streams and objects in procedural code (e.g., LINQ)

In terms of defining expressiveness, we can look at each of these examples as a case study and see where they fall apart.

For example, for Tableau Software’s products, if they wish to integrate with every mainstream vendor, then they have to have a case explosion that handles all the specific oddities of each implementation. Since they are targeting a fixed (vendor,version), they have no choice but to target that IR and can’t improve on it. The best thing they can probably do to “make it easy” is to have lots and lots of tests, and have all the tests run on identical data sets for each (vendor,version) pair. (This is an engineering problem, which benefits slightly from age-old compiler wisdom.)

For example, ORMs often provide their own “Object Query Languages” that operate on object graphs. They have to translate an object graph query into a relational query. The code for this in most ORMs is a cocktail of anti-patterns and generally poorly designed such that it is impossible to specialize the translator for different backends, and no ORMs allow developers to write new case rules in an open, extensible fashion. (This is a real language design need.)

For example, Oren Eini recently published an article in the Communications of the ACM about The Pain of Implementing LINQ Providers for NoSQL languages that don’t understand LINQ operatives like GroupBy. Specifically, there is no way to statically control what operatives are allowed or disallowed, and so some operatives throw compile-time exceptions. In some cases there is simply no way to statically detect, at compile-time, potential problems. Like trying to return a windowed result for a ridiculously large table. The variables just aren’t there. (This is a real language design problem, and there just isn’t much research yet on what to do. Barry Jay has some interesting results on XML in his bondi language, but I’m not aware of much else.)

Finally, if you want more examples of SQL difficulties, then the best book I can recommend is fellow LtU member Vadim Tropashko’s SQL Design Patterns.  He references other books, such as ones by CJ Date, and journal and conference articles and comes up with a list of common problems in SQL and (adequate) solutions for those problems.