The first reports of BigTable, CloudDB, and the other innovations that have come along now to be casually categorized as NoSQL, emphasized their architecture. I assumed that these experiments would stand or fall on their fundamental designs. I arrived at the Boston's NoSQL Live conference yesterday hoping to untangle the architectural choices behind each technology and lay out their implications in long, straight lines that could lead to decisions about their deployment. Now I wonder whether they'll compete like other projects do, citing their features more than their architecture.
Of course, architecture to a large extent dictates features. But work-arounds and extensions can make up for features that one doesn't expect to find in the architecture itself. My previous blog introducing NoSQL Live explained some of the basic differences between the technologies lumped together at the conference, but these distinctions all blur under the pressure of close examination.
Consider one of the first design decisions to greet a project's creators, and one that would seem to lay out the fate of a project: the choice between a simple key-value store (which goes back several decades, long before relational databases) or a document model, which allows complex structures and rich annotations of data. Bryan Fink from Riak pointed out these could be placed along a continuum where usage determines a project's place as much as architecture. In a key-value store, if you assign a data type to the value, it starts to take on semantics. If you break it into fields, it takes on even more.
The urge for more and more features impinges on the purity and simplicity of many NoSQL models. At the NoSQL Live conference, I noticed a progression during the day from NoSQL as a ticket to speed, freedom, and flexibility to NoSQL as a set of alternative data structures that could be even more overloaded with semantics than the relational ones. Let's trace NoSQL from the meadows to the gardens.
NoSQL running wild through the meadows
During the morning sessions, NoSQL came across as cool and fluid. Just stream terabytes of whatever you like, rip through it, and pick out what suits your fancy.
Dwight Merriman, founder of the main conference sponsor 10gen, started us off nicely with a definition of NoSQL as data stores that don't support joins and that have "light transactional semantics," allowing superior horizontal scaling. He also set the tone of diversity by saying the day of "one size fits all" in databases is over.
Tim Anglade delivered the keynote, pointing to an impressive chart generated by Google Trends that showed a tremendous and continuing rise in searches for the term "NoSQL" in early 2010. He also tracked a number of popular projects, all of which were gradually on the rise in searches, but 10gen's MongoDB was the only one whose sudden uptick matched that of NoSQL as a whole.
Anglade recommended four initiatives that would raise the movement's prospects in large corporate circles:
- College courses
- An overview book
- An industry body to act as advocate and liaison to corporations, and to link them with researchers
- More conferences on several continents
A session on scaling laid out some of the strengths of the most popular tools. The ceiling they're up against now is partitioning and sharding, which many have solved within a local networks but few have dealt with across networks. Google has recently developed an internal tool called Spanner to meet the needs of its data to cross network boundaries.
Mark Atwood joined the panel to talk about memcached and to comment on the NoSQL movement. Memcached played an odd mix of roles at the conference. It was a grand old man that inspired many of the current NoSQL offerings in its audaciously simple path to success, but it was equally a stand-in for the relational databases that received no other representation at the show, because memcached is usually deployed to enable the use of relational databases behind high-volume web sites and is intertwined with their deployment. Atwood was the one to whisper cautions in the ears of the conference participants as they sped forward on the NoSQL chariot.
The session also touched on administration, with unclear findings. Startup is still a serious administrative task, as is the handling of traffic spikes. Certainly, most of the tools make it easy to add nodes and do replication. But even MySQL is fairly simple in these areas. For instance, MySQL replication can be set up through two or three configuration options and an SQL command on each server. Where replication on MySQL becomes hard is when transactions and database state enter the picture--precisely the areas that NoSQL solutions steer away from.
The next session, covering NoSQL in the cloud, started with an extra spark because it combined two trends that are new of interest to sites saddled with large data sets. Although not stated on the podium, the session treated two quite different types of service. The first applies NoSQL tools to data stored in a service such as Amazon's S3; the tools themselves are still installed and run by the user. The second involves services that run NoSQL solutions and offer APIs to users for uploading and processing their data.
Both types of service hold promise. However, one consideration they both entail is calculating available bandwidth and factoring in its costs. They also raise the classic question faced by anyone who uses a cloud vendor or Software as a Service: what does it take to get data out and switch services?
The discussion of this question raised, for the first time in the day, the question of a standard abstraction layer that would let the same program run on many competing services. Jonathan Ellis, a Rackspace staffer and Cassandra contributor, forcefully laid down the position that such a layer would have too many drawbacks to be of value. As we've seen with operating systems and SQL implementations, they offer beyond the historical digressions and idiosyncratic choices some non-standard features that offer real value. To give up such features in pursuit of portability would be to lose much of what made a project attractive in the first place.
The morning ended with five lightning talks on tools mostly of interest to niches.
NoSQL tiptoeing through the gardens
A change was rung in after lunch with a session on schema design. Freedom, it seems, derives value from structure, and now the talons of the relational model thrown off during the first half of the day started creeping back.
Indexes were of particular interest. Some tools incorporate them directly, while others typically leave them to the discretion of the user through the use of higher-level tools such as Lucene and SOLR. Here we have an illustration of the principle aired at the top of this blog, that features missing from some products can be added through extensions and work-arounds.
Attendees asked the panelists whether their schemas could enforce constraints and support validation. The unintimidated panelists reminded the audience that the demands of constraints and validation were exactly the heavy-weight features they jettisoned to make their tools possible. These tasks are normally left up to the application. Nevertheless, some of the attendees opened the door a chink to these features. For instance, Riak will add hooks at some point that permit one to write validation routines, among other things.
NoSQL thus gradually shaped up and accrued features as the afternoon went on. But while we were assessing the stakes (I use this term not in the sense of gambling, but of fences that turn a meadow into a garden), they really got raised during the next session on graph data stores.
Graphs, which are collections of nodes and edges such as we see in genealogies, yield extremely supple data structures for representing real-world relationships. Simplicity and litheness are not, however, normally among their attributes. For instance, Borislav Iordanov admitted that partitioning graph databases was hard because of all the interrelationships; nevertheless, his HyperGraphDB does tie together multiple nodes.
Relational databases groan to the breaking point under applications that try to traverse graphs, so there's a place in the computer field for graph databases. They are manipulated by different languages from those normally used with NoSQL projects discussed earlier in the conference. Common languages in graph communities include SPARQL (which resembles SQL but follows data to extract further attributes) and OWL (which uses syllogistic logic to find truths in data relationships). Peter Neubauer of Neo4j mentioned a new language called Gremlin.
Even ACID, banned early in the day, returned from exile for a cameo appearance during the graph database session. No one mentioned a RESTful interface, which many of the more familiar NoSQL projects have raced to offer and has long been the key selling point for CouchDB.
The day reached its formal peak in the final presentation, given by Sandro Hawki. One could hardly find a stronger message that it was time for a movement to grow up and don cufflinks and tie clips than a presentation from a standards organization.
I must admit Hawki made a very congenial standards body emissary, however. He joked that he wouldn't mind if XML died off. More seriously, he assured attendees that, "RDF is described in the theoretical literature in terms of graphs, but I don't personally see it in terms of graphs," and "You can get along while ignoring OWL and RDF Schema."
Hawki gave an overview of W3C standards and the standardization process, but most interesting were his speculations about what standards might become adopted by the NoSQL movement. He reminded us of the controversy earlier in the day over an abstraction layer, and suggested that some people would be willing to make the sacrifices involved in using one. He suggested a modified SQL with no ACID, improved portability, and a RESTful API. He also said RDF (which is associated with XML but can be used with it) and SPARQL might prove useful standards to pull around.
A preview of coming attractions
One expects previews at a one-day conference, and that is mostly what I found at NoSQL Boston. But I left with the impression that the features are the center of NoSQL offerings. Does a particular project support the language I want? Does it have a RESTful interface? How about indexes, referential integrity, etc.?
As I've said, these features don't need to be designed in from the start. Users can find an enhancement to plug in to each hole, drawing on the ingenious contributions of community members or a little hand-coding.
For instance, I talked to a couple attendees who run an e-commerce firm that just adopted MongoDB (what features attracted them to it, I never found out). They complained that they need to offer tailored content to many different partners, and therefore have to do a lot of joins on their data. In theory, they could store everything in a single MongoDB database and filter it without doing any joins, but this solution didn't seem to satisfy their needs. What they'll probably do is prefetch and cache the data they need for each partner.
Work-arounds like that are a fixture of data processing. After all, memcached is itself a work-around for the performance gaps one experiences making repeated queries to a relational database.
In short, I'm no longer as interested in finding essential and inviolable differences between NoSQL projects. I believe they'll see evolution and convergence. Meanwhile, we'll be scrambling to learn, and have some fun. One participant told me, with a bit of exultant exaggeration, that NoSQL has taken a field that was "dead" (database development) and suddenly brought it back to life.
The creation and adoption of NoSQL projects by such enormous data sites as Google, Amazon, Facebook, Twitter, and LinkedIn proves they're no blip; they're part of the next generation of data stores. The NoSQL projects also help us remember what relational databases are for in the first place, and teaches us to use them with more care and finesse. (I'm thinking ahead here to my next major series of blogs, which will come from the MySQL conference in April.) Everyone will learn and grow from this movement.