An Infrastructure for Big Data

By Kurt Cagle
January 12, 2009 | Comments: 2

Recently, Michael Driscol wrote an intriguing article about the potentially impending paradigm shift brought about by the interconnectivity of Big Data. One of the points that he brings up - and something I've been thinking about a lot lately - has been the difficulties of standardization and making data offered between businesses synchronized.

The potential benefits of being able to expose even a portion of data that businesses and organizations produce in a compatible manner would be huge - it would, indeed, be a major boost for businesses that are built on or around the Internet as well as provide the framework to turn much of the economy into a Mashup Economy.

The problem, of course, is standardization.

Businesses, left by themselves, do not standardize at any level beyond that of their own boundaries (the corporation itself, or its captive supply chain if it is large enough). Common standards provide little immediate material benefit when one is the market leader while increasingly the possibility that critical supply chain components may "defect" to competitors, thus raising prices for the market leader and offering potential advantage to their competitors.

This is the primary reason why standards almost invariably result due to specific government action, whether it be the establishment of a given standard in order to win government contracts, the requirement that reporting documents be provided in a specific format, or through the presence of governmental entities in the development of (typically international) standards.

There are significant signs that a sea-change is underway throughout much of North America and Europe as government after government has been forced to reassert their primacy in the economic sphere after having been largely pushed aside in the favor of the private sector. Poor and frequently corrupt business practices have left the financial sector, energy distribution, health care, education and so forth effectively ruined, and this has essentially meant that, largely against anyone's desire, government now must work towards the common weal, rather than the investors'.

I don't necessarily see the Obama administration - or other governments that seem to be forming in the ensuing collapse of the global financial markets - all that much more "socialist" than what they replaced. However, what does seem to be a common thread is both the recognition that transparency in business is becoming far more of a necessity, and that the role of government is not necessarily to facilitate business alone but to act more as a referee to assure that business practices benefit, or at least do not harm, the public weal.

What this means in practice is that certain kind of documents - what I term life-cycle portfolios, or LPCs - are going to start being very important. The term document here is a little misleading. When you talk about a person's medical records, for instance, you are seldom talking about a single document. Instead, what you're talking about is a collection of related documents - personal medical history, doctor/patient encounters, insurance transactions, lab reports and so on. However, if you think of all of these things as being documents within a portfolio, then it is this portfolio that needs to be standardized.

This is a Big Data problem, but it is also a network distribution problem and a security problem. There are a lot of companies, of course, that would like to be owners of these portfolios, as well as the ones to determine the standards, but the role of the government is to insure that any standard that does emerge is one that is not unduly beneficial to any one company.

Another role that government must play in this regard is establishing a blind portfolio standard - one in which the owner of the portfolio is essentially made anonymous, at least by normal analytics techniques. There is a wealth of data that can be mined in medical data reports that are of enormous value to pharmaceutical companies, biomedical researchers, public health officials and others, but a system will need to be developed that also protects the rights of the subjects of those medical records, as well as that gives those same subjects to opt out of such protection for specific researchers.

This is a program that a number of companies have tried, though with limited success, in great part because the laws that exist make it remarkably difficult for even patients to get access to their own medical records. Yet a significant part of the rising costs in health care can in fact be directly attributed to this growing impedance mismatch between record formats, questions of ownership and access, and a patchwork of laws and regulations that have failed to keep up with advances in information technology.

Standards such as HL7 v.3 could go a long way to solving these problems, but without a formal mandate from the government that such standards be used in any cases in which the government has some interest, HL7 will only be adopted in fits and starts, and usually in ways that make the standard a pseudostandard ripe for the rise of "Value Added Resellers" who exist only to transform one company's HL7 to another, taking a cut of the transaction each time.

I bring up medical records here because these provide a fairly intuitive example of lifecycle portfolios, but the same information can be said to hold true for business reporting. A business LCP is essentially a snapshot of a company's financial health, and like a patient LCP it actually consists of a running record (perhaps quarterly, perhaps even daily, in time) of a company's cash flows, resource base, capitalization and market performance.

Languages such as XBRL could certainly document much of this, and again one of the big struggles to be resolved yet is the balance between how much information a company is required to expose to what agencies. Certainly, blind business RCPs could be used to perform economic research without necessarily exposing business advantages to competitors, while at the same time protecting stock-holders (and perhaps helping to identify and penalize criminal trading activity).

The move towards an XBRL document or similar LCP as a business analytics document (more than just a glossy, and data opaque, annual report) would also make it easier for analysts to recommend healthy companies or short ailing ones (which should also provide a general strengthening of the typical corporate balance sheet, and hence the economy overall), for regulators to spot discrepancies, and for economists to better spot trends and movements in the economy.

The role of the government in this case may vary, though certainly it has the potential to be an aggregator of links to such resources. This is a novel idea, and one that I don't think many people have fully explored. One of the big problems with distributed data grids is the question of ownership of documents of record. Most people implicitly distrust the notion that the government should be the holders of these critical documents, and from an architectural standpoint, centralized data management in a distributed environment often defeats the purpose of the distribution in the first place.

What I could see emerging instead is the notion that holders of records would essentially be autonomous agents who would act as the data repositories for a limited group of clients. A large enterprise, for instance, might act as its own holder of record, while small businesses might either join a registry hosted by the state or other jurisdiction that holds their business license or a private repository that provides such services. An HMO or insurance company may act as holders of record for its customers, and so forth. In this case, the role of the government would be to establish a common set of APIs (REST and SOAP based) for accessing content at various levels.

What I see emerging is a DNS-like system for accessing these LCPs as documents and linkages, perhaps tied into something like an OpenID type system for performing authentication at various levels. The XML in question would be representations of various parts of the LCPs through appropriate transformation filters - an investor, for instance, may get a representation of the LCP as a PDF report, a researcher (with appropriate credentials) may receive the LCP as a sanitized XML document that aggregates the various subdocuments into a single report, a regulator would receive the government access level LCP, the CFO of the company could get graphical representations of critical processes, a published report, a spreadsheet, or even the "raw" XML of the "books" of the company and so forth. It becomes the responsibility of the business in question, however, to assure that such information is up to date and accurate (which will likely have the effect of incorporating it into their production pipeline, which should certainly help the ailing IT industry as well).

One final point - intrastructure is infrastructure. Once the infrastructure is in place for these LCPs, I think it becomes increasingly advantageous for businesses to take advantage of such infrastructures to make other forms of data available, in a way that still may benefit the company but that won't keep the data locked up in silos accessible via one of a thousand unique APIs. This should also help considerably with the likely explosion of information to come from this process - by creating a uniform standard for data sharing, you also lay the foundation for frameworks that can use this information more readily.

It should be an interesting decade.

Kurt Cagle is an Online Editor for O'Reilly Media. You can subscribe to his Atom Feed or follow him on Twitter.


You might also be interested in:

2 Comments

Very good article. Points to the ever-increasing importance of information security and data governance, as we enter the Web 3.0 era.

All this will turn out quite well (eventually). We all know that data/information "quality" is directly proportional to data/information "use". Therefore, it follows that secure syndication of managed information across a semantic web architecture will help us get there.


Kurt - Great series of insights here and thanks for the shout-out. I agree with you: data standards & APIs are public goods. They're the roads, bridges, and postal codes of an information society, and best provided by the public and non-profit sector.

Here's hoping the incoming Obama administration will take the lead in further liberating the Big Data that existing agencies -- SEC, Census, Commerce, Burea of Labor Statistics, CBP, and others -- already collect.

Mike D.

News Topics

Recommended for You

Got a Question?