Relational databases as reality sandwiches: thoughts about C.J. Date's "SQL and Relational Theory"

By Andy Oram
July 15, 2009 | Comments: 7

A Universe in which I am condemned to write statements.
--Allen Ginsberg, Aether, 1960 (Reality Sandwiches)

I recently returned to a book I edited over this past year--SQL and Relational Theory: How to Write Accurate SQL Code by C.J. Date, a leading researcher in the field of relational databases--as I learned more about some of the alternative forms of data storage that are becoming popular for Web-based or text-heavy repositories. C.J. Date's book is aimed at DBAs and programmers using SQL, but the questions he poses to the reader would be useful for anyone trying to extract insights from data.

His book requires careful reading, oriented as it is toward scoring a relatively small number of points at the conclusion of long discussions. But the discussions and concluding points are solid, because Date applies his training as a mathematician to database realities. It was the vision that results from his discipline, more than any particular point, that ultimately allowed his arguments to win my respect--and that of several readers I spoke to (for instance, author and MySQL expert Baron Schwartz).

While a database is a collection of facts and relationships, C.J. Date sees it more fundamentally as statements of truth. The relations define a universe in we define true propositions--by entering data items into relations--and then derive other propositions that we can feel secure to be true. In this mindset, the DBA or programmer has the twin tasks of ensuring that:

  1. Everything going into the database is true.
  2. All operations accurately reflect the real-life questions being asked, so that results coming out of the database are also true.

Some readers of this article may be surprised to see this description of a database--and would be even more surprised to learn that Date applies these principles to very practical ends--while others are shrugging right now and asking where's the novelty. But I'll wager that a large fraction of readers will complain that life is too big and hairy to be categorized so strictly. Data is imprecise, insights are fuzzy, and truth changes with the flutter of a butterfly's wings. Hence the use of XML text repositories, RDF, key-value stores with logic relegated to the application layer, etc.

Both visions have their place, so in this article I'll explore Date's ideas a bit, then find some possible common ground between relational databases and the intriguing new (or sometimes old and refurbished) technologies.

I'll start where Date (almost) ends, where he declares in Appendix A that the relational model holds a unique and unchallenged place among models for structuring data. It is the only model that is logically robust and independent from its various representations. He wisely leaves this claim for the end because a reading of the book is required to understand how he derives applications from logical principles, what the relational model is, and how a proper use of the model produces insights. But the claim is worth ruminating on while we roam the field of databases a bit.

Insights that take time

Don't be surprised that I needed several months, after working intensively on Date's book, to assimilate its vision. The book does not lay down all its cards on the first round of play. The specific, practical tenets Date offers for crafting SQL statements could probably be summarized on a two-page cheat sheet, but to do so would cheat the reader.

You must accept one premise to get through this book: that logical thinking is important to using data properly. I don't have the strongest training in logical thinking (and I'm sure commenters will find proofs of that in this very article) but Date convinced me that he was right to stress it.

After all, databases last for decades (the reason for the Y2K problem) and can handle thousands of statements a second. That's a long time and a lot of queries during which some tiny inconsistency or logical flaw could trigger bad results. Some databases contain hundreds of tables containing millions of rows, and queries flit among them at lightning speed snapping up items of data here and there. Something's got to give.

Date knows that beyond his specific guidelines (add the DISTINCT keyword to SELECTs, don't update a table using a cursor, etc.) lies a knowledge of how a relational database is supposed to work and how SQL actually works. If you don't know the traps embedded in the SQL language as well as its various implementations, you won't know how to apply the guidelines.

Date does throw in, for those with the stamina to follow his path, a delightful training exercise of a more standard kind. A study of the logic behind relational operations leads in the penultimate chapter to a system for transforming business logic into SQL queries and CONSTRAINT clauses. Some business logic requires a lot of transformation, so this is a useful skill to pick up, and conveying it was one of Date's main goals in the book.

Example: the case of the null

Probably the most oft-violated of Date's tenets is his ban on nulls. I don't find his advice here surprising; I actually encountered similar advice while working on a very different book (High Performance MySQL). In Chapter 3, the authors (who include Baron Schwarz, mentioned at the beginning of this article) suggested defining columns NOT NULL, less for structural integrity than simply to improve performance.

Date takes a disciplined approach to the question, as always. Through numerous examples he shows that nulls can lead to incorrect results, even if you jump through hoops to deal with their presence.

But these examples are just icing on the cake. If you see the database as representation of reality, the ban on nulls makes intuitive sense. Wouldn't you be wary of a legal contract that listed a task without assigning it to anyone, or of driving directions that mentioned an intersection without saying which way to turn? Life contains information lapses, so sometimes we must make compromises, but simply leaving a blank is a recipe for a bad outcome.

Relational databases and all the others

Whose eyes collect the streets and mountain tops for storage in his memory?
--Allen Ginsberg, Over Kansas, 1954 (Reality Sandwiches)

A hundred flowers have sprung up over the past few years on what some observers already claim to be the grave of the relational database. Many new models store metadata with the data (for instance, in RDF tags). Most push logic out of the database into the application. In general, they break the strict coordinate-based (tabular) model and let designers associate any item or group of items with other items using arbitrary relations. Thus, an XML file with a couple hundred elements could have thousands of varying links. Even Ted Nelson (the famed inventor of hypertext) has gotten into the act with ZigZag.

Let's find the common ground now. These nonconventional data stores still treat data in many of the same ways relational databases do:

  1. Relationships are still crucial. Triples, the foundation of RDF, are simply binary relationships embedded in text. Fields or columns (which SQL uses to store attributes, the official term used in relational theory as described by Date) can be found in other systems too, under such names as properties. Not all attributes represent relations, but many do.
  2. Filtering is still done in advance, because letting every query process the raw data would be too time-consuming. Thus, pre-built indexes can be found in nearly all systems, representing the administrator's guess about what the queries will be looking for. According to a friend of mine in the field of taxonomy, the "heart" of taxonomy work is predicting what will be queried.
  3. Constraints are universal. For instance, if you want to record temperatures, every system requires that you choose a data type, a physical representation (probably floating-point) and a scale (such as Celsius). You probably also want to do some basic sanity checks, such as requiring the temperature of water to be less than 100.0 Celsius unless you expect it to be severely compressed or to exist as a gas.

Light-weight systems ask the application to handle a lot of the constraints, as well as joins and other processing, that most relational databases handle internally. Modern programming and test frameworks may make reliance on the application safer than it was in the past, by allowing programmers to establish and stick to conventions. But if you take Date's vision of the database as a representation of reality, you'll want the logic and constraints close to the data.

I think most people still feel comfortable using relational databases to store information that's highly structured and calls for precise accuracy, like customer information and the results of scientific surveys. Storage formats and query strategies could be improved--and are being improved all the time--but the joins, keys, constraints, and other accoutrements of the relational model have clear value.

If your data doesn't fit into fixed set of dimensions, you should use something different--but the rigor Date brings to databases is still good thinking. And probably there's space for a relational database somewhere in your architecture, perhaps as a kind of reality sandwich to hold the filtered results of operations on the broader data.

The strength of the relational model is the mathematical discipline that lies behind it. Perhaps if the field adopted technologies closer to this pure model, as Date expounds it, a wide range of data stores would be more robust and scalable. I hope Date applies his formidable analytical skills to this question in a future book!


You might also be interested in:

7 Comments

Hi

Interesting review.

I'm sure most experienced and mathematically trained DBAs would agree wholeheartedly with Date's views. In my experience most DBAs (and programmers) with little or no mathematical training (at university level) neither understand nor do they show interest in the things that make data management easy and useful.

So for 30 years Date has been banging his head against the proverbial brick wall. It's the lack of mathematics that is the real problem. And faced with an inability to comprehend the relational model programmers et al have invented things they understand - even though in most cases they don't work as well.

Personally I find the relational model exceptional.

Having said that, I don't use SQL because if there is a shortcoming in SQL it is the inability of the language or the perhaps the relational model to store, inherently, the table relationships and then use those when evaluating queries. (ie why does the table join exist explicitly in the query).

I've worked on extending the relational model for 30 years in such a way that the integrity of the model is maintained and extended to included true meta data - table relationships, database wide calculations, etc.

Interestingly table relationships answer one of the fundamental implementation problems - what indexes should the table have?

I could go on, but my main point is that if you can't do the math, this book like it's predecessors will just give people more reasons not to use the relational model; and real attempts like mine to extend the model in a consistent and mathematically sound manner will never be understood.

But again, thanks for the review it's good to know that there are people still thinking deeply about the model and not just the hype.

I'm a history major, but I took a math minor and a comp sci minor, and while I can't say I have the stomach for the rigors of pure logic, I have to admire it. And I also have to admire the power of the relational model in bringing logic into the database world. But the main reason the relation model is the best isn't the abstract beauty (though it hurts my mathematician's heart to say so).

Rather the relational model keeps data organized, optimized, and corruption free. So I'd like to thank Date for doing the work of writing a book that actually shows this at a time when so many companies are being swept up in fadish new database models that don't give nearly as much robustness as the relational model.

As an example, take Google's BigTable. When AppEngine was first coming out Google was bragging about how much better their non-relational datastore model was, but at a Google AppEngine Users Meeting I attended a couple months after the launch, I heard from Google employees themselves that they had discovered that for powerful AppEngine applications there was a large potential for read-write errors, they had a plan to solve it, but that plan was essentially the same technique used to deal with read-write errors for relational databases.

People keep trying to re-invent the wheel, but usually they just make it worse

>> If your data doesn't fit into fixed set of dimensions

Then it's just noise not data, no matter what the "unstructured data" zealots say.


>> I hope Date applies his formidable analytical skills to this question in a future book!

He has. For 8 editions. His O'Reilly pamphlets are watered down sections of his "An Introduction to Database Systems". Sorry, but true.


>> Many new models store metadata with the data (for instance, in RDF tags). Most push logic out of the database into the application.

Both are reasons that coders like file data: the removal of data from its constraints into code allows coders to create Towers of Babel, thus lengthening their employment. And are the overwhelming reason applications are a mess.


>> Data is imprecise, insights are fuzzy, and truth changes with the flutter of a butterfly's wings.

Only if you *want* muddle. Coders love muddle.

"the reason for the Y2K problem"

Hardly. 2 digit years in dbs were easy to fix. A sense of context and you could expand the field and guess the correct century.

2 digit years in program fields on the other hand were a pain in the arse. Checking every 2 digit field (including constants) and then altering the code that used them appropriately is not something I want to repeat.

Nulls are unavoidable. Tables (pardon, relations) change, not least of all because the original data modeler doesn't necessarily have a perfect & eternal comprehension of the domain. When change happens, shouldn't all the prior rows (pardon, tuples) contain nulls in the added columns (pardon, attributes) to accurately represent that at the time the data was captured absolutely no value was present (rather than sticking in a 0 or empty string)?

Reedo, you're only talking about one particular use case for NULL. You should familiarise yourself with Date's reasoning before taking up the defence. One way to start is online, http://www.dbdebunk.com/publications.html
http://thethirdmanifesto.com/

Just to add a brief note.

I have long been associated with Chris Date over here in Europe, where I have arranged numerous seminars and meetings for him. I was also one of the reviewers for this book on SQL and relational theory. (Hello Andy!)

One can construct various logical and mathematical arguments for the correctness of Date's approach, many of which will sail straight over the heads of the information practitioners that one might be trying to influence.

The single strongest means of persuading someone to take Chris's arguments seriously is, to my mind, to ask the simple question - can you PROVE that the answers from your database queries are correct?

If you cannot, and you are working in a field such as the aerospace industry, or indeed any other industry in which the consequences of a wrong answer could be loss of life, then you are ultimately vulnerable to litigation in the event of something going wrong. Just imagine the damage that could be done to a company in which a forensic analyst determined that database practices were being followed in which the data retrieved from the database could not be validated as correct.

I have certainly seen instances in which developers have twisted their result set because their queries were not giving the correct answers. I wonder how prevalent this malpractice is?

Once this point has struck home - apply for funds to send your database designers on a course to learn the sort of things that Chris has been teaching for years. His approach absolutely enables proof of correctness.

Oh yes - Chris is doing a 2-day course on his book here in Edinburgh in May this year.

-

News Topics

Recommended for You

Got a Question?