A Universe in which I am condemned to write statements.
--Allen Ginsberg, Aether, 1960 (Reality Sandwiches)
I recently returned to a book I edited over this past year--SQL and Relational Theory: How to Write Accurate SQL Code by C.J. Date, a leading researcher in the field of relational databases--as I learned more about some of the alternative forms of data storage that are becoming popular for Web-based or text-heavy repositories. C.J. Date's book is aimed at DBAs and programmers using SQL, but the questions he poses to the reader would be useful for anyone trying to extract insights from data.
His book requires careful reading, oriented as it is toward scoring a relatively small number of points at the conclusion of long discussions. But the discussions and concluding points are solid, because Date applies his training as a mathematician to database realities. It was the vision that results from his discipline, more than any particular point, that ultimately allowed his arguments to win my respect--and that of several readers I spoke to (for instance, author and MySQL expert Baron Schwartz).
While a database is a collection of facts and relationships, C.J. Date sees it more fundamentally as statements of truth. The relations define a universe in we define true propositions--by entering data items into relations--and then derive other propositions that we can feel secure to be true. In this mindset, the DBA or programmer has the twin tasks of ensuring that:
- Everything going into the database is true.
- All operations accurately reflect the real-life questions being asked, so that results coming out of the database are also true.
Some readers of this article may be surprised to see this description of a database--and would be even more surprised to learn that Date applies these principles to very practical ends--while others are shrugging right now and asking where's the novelty. But I'll wager that a large fraction of readers will complain that life is too big and hairy to be categorized so strictly. Data is imprecise, insights are fuzzy, and truth changes with the flutter of a butterfly's wings. Hence the use of XML text repositories, RDF, key-value stores with logic relegated to the application layer, etc.
Both visions have their place, so in this article I'll explore Date's ideas a bit, then find some possible common ground between relational databases and the intriguing new (or sometimes old and refurbished) technologies.
I'll start where Date (almost) ends, where he declares in Appendix A that the relational model holds a unique and unchallenged place among models for structuring data. It is the only model that is logically robust and independent from its various representations. He wisely leaves this claim for the end because a reading of the book is required to understand how he derives applications from logical principles, what the relational model is, and how a proper use of the model produces insights. But the claim is worth ruminating on while we roam the field of databases a bit.
Don't be surprised that I needed several months, after working intensively on Date's book, to assimilate its vision. The book does not lay down all its cards on the first round of play. The specific, practical tenets Date offers for crafting SQL statements could probably be summarized on a two-page cheat sheet, but to do so would cheat the reader.
You must accept one premise to get through this book: that logical thinking is important to using data properly. I don't have the strongest training in logical thinking (and I'm sure commenters will find proofs of that in this very article) but Date convinced me that he was right to stress it.
After all, databases last for decades (the reason for the Y2K problem) and can handle thousands of statements a second. That's a long time and a lot of queries during which some tiny inconsistency or logical flaw could trigger bad results. Some databases contain hundreds of tables containing millions of rows, and queries flit among them at lightning speed snapping up items of data here and there. Something's got to give.
Date knows that beyond his specific guidelines (add the DISTINCT keyword to SELECTs, don't update a table using a cursor, etc.) lies a knowledge of how a relational database is supposed to work and how SQL actually works. If you don't know the traps embedded in the SQL language as well as its various implementations, you won't know how to apply the guidelines.
Date does throw in, for those with the stamina to follow his path, a delightful training exercise of a more standard kind. A study of the logic behind relational operations leads in the penultimate chapter to a system for transforming business logic into SQL queries and CONSTRAINT clauses. Some business logic requires a lot of transformation, so this is a useful skill to pick up, and conveying it was one of Date's main goals in the book.
Example: the case of the null
Probably the most oft-violated of Date's tenets is his ban on nulls. I
don't find his advice here surprising; I actually encountered similar
advice while working on a very different book
(High Performance MySQL).
In Chapter 3, the authors (who include Baron Schwarz, mentioned at
the beginning of this article) suggested defining columns NOT NULL,
less for structural integrity than simply to improve performance.
Date takes a disciplined approach to the question, as always. Through numerous examples he shows that nulls can lead to incorrect results, even if you jump through hoops to deal with their presence.
But these examples are just icing on the cake. If you see the database as representation of reality, the ban on nulls makes intuitive sense. Wouldn't you be wary of a legal contract that listed a task without assigning it to anyone, or of driving directions that mentioned an intersection without saying which way to turn? Life contains information lapses, so sometimes we must make compromises, but simply leaving a blank is a recipe for a bad outcome.
Whose eyes collect the streets and mountain tops for storage in his memory?
--Allen Ginsberg, Over Kansas, 1954 (Reality Sandwiches)
A hundred flowers have sprung up over the past few years on what some observers already claim to be the grave of the relational database. Many new models store metadata with the data (for instance, in RDF tags). Most push logic out of the database into the application. In general, they break the strict coordinate-based (tabular) model and let designers associate any item or group of items with other items using arbitrary relations. Thus, an XML file with a couple hundred elements could have thousands of varying links. Even Ted Nelson (the famed inventor of hypertext) has gotten into the act with ZigZag.
Let's find the common ground now. These nonconventional data stores still treat data in many of the same ways relational databases do:
- Relationships are still crucial. Triples, the foundation of RDF, are simply binary relationships embedded in text. Fields or columns (which SQL uses to store attributes, the official term used in relational theory as described by Date) can be found in other systems too, under such names as properties. Not all attributes represent relations, but many do.
- Filtering is still done in advance, because letting every query process the raw data would be too time-consuming. Thus, pre-built indexes can be found in nearly all systems, representing the administrator's guess about what the queries will be looking for. According to a friend of mine in the field of taxonomy, the "heart" of taxonomy work is predicting what will be queried.
- Constraints are universal. For instance, if you want to record temperatures, every system requires that you choose a data type, a physical representation (probably floating-point) and a scale (such as Celsius). You probably also want to do some basic sanity checks, such as requiring the temperature of water to be less than 100.0 Celsius unless you expect it to be severely compressed or to exist as a gas.
Light-weight systems ask the application to handle a lot of the constraints, as well as joins and other processing, that most relational databases handle internally. Modern programming and test frameworks may make reliance on the application safer than it was in the past, by allowing programmers to establish and stick to conventions. But if you take Date's vision of the database as a representation of reality, you'll want the logic and constraints close to the data.
I think most people still feel comfortable using relational databases to store information that's highly structured and calls for precise accuracy, like customer information and the results of scientific surveys. Storage formats and query strategies could be improved--and are being improved all the time--but the joins, keys, constraints, and other accoutrements of the relational model have clear value.
If your data doesn't fit into fixed set of dimensions, you should use something different--but the rigor Date brings to databases is still good thinking. And probably there's space for a relational database somewhere in your architecture, perhaps as a kind of reality sandwich to hold the filtered results of operations on the broader data.
The strength of the relational model is the mathematical discipline that lies behind it. Perhaps if the field adopted technologies closer to this pure model, as Date expounds it, a wide range of data stores would be more robust and scalable. I hope Date applies his formidable analytical skills to this question in a future book!