eGov Watch: The Importance of Data.Gov

By Kurt Cagle
March 26, 2009 | Comments: 14

Illinois river.jpgThe Illinois River is a slow moving, meandering waterway that originates out of Lake Michigan, flows beneath downtown Chicago, then cuts through the rich Illinois topsoil as it wends its way to Peoria (giving the area its distinctive river bluffs formation) then through the middle of the state until it finally meets the Mississippi river at Alton, Illinois, on the Missouri border. Given where it begins and ends, the Illinois sees a lot of river traffic, from barges laden with grain to shipping containers to steam-powered paddle-wheel boats that evoke the memories of Mark Twain.

When I went to high school in Peoria, my mother worked for the Army Corps of Engineers, where one of her responsibilities (though far from the only one) was to maintain the "river watch". Three times a day she'd go down to the river's edge, lower what amounted to a large dipstick into the river, then recorded such information as the river height, the degree of turbidity, the depth of the silt and so forth. This information would be recorded and sent to a central dispatcher, who would combine it with similar information from up and down the river, and would use it to create a real time "map" of the state of the river.

This map had a lot of consumers. River barge masters, of course, needed to know which parts of the river were too low to navigate so that they could keep from beaching on a sand bar - or even to stay tied down in dock until the river levels were higher. The Department of Agriculture needed to have this information to measure topsoil erosion (which not surprisingly ended up as silt in the river) so they could warn farmers who were in most danger to put up wind breaks or similar erosion prevention measures. The National Weather Service needed this information during flood times to issue alerts, FEMA needed it to help coordinate disaster relief when those floods occurred, and insurance companies would ultimately use this information in myriad ways.

Those who work in IT have tended over the last couple of decades to focus on private sector data - business information, marketing of users for social networking sites, stock and bond trading information and so forth. This information is obviously useful, yet as the shift is made from stand-alone applications to Internet services, one of the increasingly pressing questions being asked by IT departments is "what kind of data can be monetized?"

The reality is that, for all of the data that the private sector produces, outside of a fairly limited scope that data is actually fairly useless for monetization purposes. There are only so many ways you can slice online demographic data, and there's a certain irony in that, despite concerns for privacy, the actual value of individual marketing data is close to zero, simply because it is measured by so many different competing agents.

On the other hand, one of the critical roles of any government is to monitor the state of its region of jurisdiction (and the regions it interacts with). This isn't usually glamorous work. The cost of measuring the level of a river is generally higher than any immediate gain to be made by monetizing that data, so it's not something that most private companies will rush to do, though in the aggregate that data has immense value to a wide number of consumers.

Gov 2.0 Summit, a new government technology conference co-produced by O'Reilly Media and TechWeb, capitalizes on the momentum for change and broad engagement, creating a non-partisan forum for addressing the monumental challenges our nation faces. Happening Sept. 9-10 in Washington, DC. Request an invitation today!



And therein lies an important point - the matrix of engineers, researchers, auditors, court recorders, census takers, and so forth form the nervous system of the body politic. Together, they provide a huge amount of information about the state of the system, which ultimately is useful for everyone, and they do so not because of any immediate short term profit motive, but rather for insuring the much longer term health and welfare of the people who live in the country/state/county, etc.

Yet for a variety of reasons, a disconcertingly large amount of that information is inaccessible to the Internet. The last decade has been a real struggle for the people who work in this network, especially when ideological concerns trumped the need for gathering information, leading to severe underfunding of many of these agencies. A big push to privatize government functions that started in the 1970s and early 1980s often resulted in companies making what had been public information proprietary, which had the effect of frequently limiting access to this information only to the most well heeled.

Paradoxically, this also often proved fairly disastrous to both the companies and the organization who authorized the privatization (usually to save costs), as it gave these companies local monopoly power which was frequently abused, resulted in reduced service as the privatizing companies realized just how diffuse and low-profit the information gathering was (especially a concern when you need to show quarterly profits), and in many cases ended with angry citizens filing lawsuits in order to recover and resuscitate damaged systems.

Beyond that, this all occurred against the backdrop of rapid technological change. The amount of information that a country like the United States or Canada produces is staggering, but in all too many cases, that information ends up in databases that become increasingly antiquated and typically existed outside of any real network. Limited manpower and low IT budgets limit the ability to make that data available, even when putative political mandates existed to do so, in great part because the databases were simply not set up to provide that data except in a very limited way.

What Obama's Data.gov initiative will do is both simple in concept and stunning in implication. It is data housekeeping. It is a set of requirements, established by the Fed CIO Vivek Kundra, that will make it possible to establish a web services infrastructure to expose at least partial representations of these databases as streams of XML. This isn't just about making the political process more transparent - it is about making the entire information gathering apparatus of the United States more transparent.

It's my hope (and looking at the initial reports I'm feeling optimistic about this) that most of these interfaces will be RESTful in nature - exposed via HTTP GET, POST, PUT and DELETE operations - in essence, making such information directly available as XML and JSON content, as appropriate. For now, the GET operations will likely predominate - providing the ability to see this information in the first place, perhaps to query it to better filter the content to just that which is appropriate. Even just with GET, this will "light up" the government's own noosphere to the Internet, making it possible to create "mashups" that correlate various data feeds into various applications.

For instance, its not hard to take the "Illinois River data", retrieved perhaps as an geo-encoded Atom feed, and transform that into a Google Earth KML file to be able to "tour" potential trouble points for barge navigators. Ag Dept. erosion data as XML can be transformed in order to create overlays showing erosion patterns, providing a clear visual to farmers and urban planners about where potential problems are (such as where mudslides or earth settling could have significant problems). Land Use permits filed with city or county Zoning authorities could also be tied in to show where wetland problems may exist or where such actions could increase erosion - not just for those officials who are evaluating these, but for people in the region who may be affected by such actions.

Yet for all the value that GET operations have here, the ability to POST to those URLs in order to create new records will be even more important. While it is possible to lay out a network of sensors to perform certain measurements, its worth noting that the vast majority of information that the government collects was recorded by people. Here in Victoria, BC, there are two very popular programs - a count of flowers by type in the region and a count of birds by type.

The bird count in particular should be mentioned because, beyond making people more aware of the avian diversity in the region, it provides a critical snapshot of the biological system of the area, information that can be used to study wildlife trends and highlight potential problems. Suppose that you enabled an army of such volunteers to do this count while working with GPS enabled phones connected to a government web service, so that people could record the exact position, species, number, actions, and habitat of each of those sightings. Made freely available as XML data, such a database could do everything from show where recovery programs for endangered species are or aren't working to set up an early warning system for avian flu.

In many respects, this is the real power of participatory government. An initiative such as Data.gov harnesses the power of people to provide both the data and the means for using this data. It provides information that companies can use to build new businesses on solving real problems that can be solved with private sector ingenuity, rather than simply being gatekeepers that make money by keeping such data scarce and expensive. It enables a nuanced view of the world that's informed by context, making it easier to avoid building up unbalanced situations that can cripple an economy or cause a disaster, and it can help in the allocation of scarce resources at the planning stages, rather than when the project's already well underway and changes are costly.

As IT professionals (especially for those of us in the XML community) this should also be seen as a call to arms. This will not be an easy process to achieve - it requires hard won expertise and a commitment to both open data and open standards, and at least in the short term should be seen not as a chance to line pockets but rather as a once in a lifetime opportunity to fundamentally shape the world which our children and grandchildren will grow up in for the better.

Kurt Cagle is Online Editor for O'Reilly media, and is managing editor for XMLToday.org. XMLToday Atom Feed, O'Reilly Atom Feed, Twitter.


You might also be interested in:

14 Comments

When the U.K. Govt. showed a GIS application (pre Google Earth) that could report the soil type, nearest drain, and regional rainfall for every mile of road, they could almost predict where the potholes would be and what drains were going to stop up. So smart people got together here to create the Data Interchange for Geotechnical and Geoenvironmental Specialists (URL from back in the day that Modeling Language was the moniker)

If we don't run out of money, we too may get those potholes filled faster.

--Hank

http://www.diggsml.com/

This article was too long for me to gain anything from it. Can you get to the point?

Nice written article .I like the way you presented but it is quite descriptive. Thanks for sharing.
regards
GIS data conversion
Georectification

I enjoy your articles. I find them informative and thought provoking.

The title is right, but meaning of "accessible" might give the wrong impression. The real danger is that data will only be accessible via some search query, and only tiny bits can be retrieved at a time. I see this happen all too often, e.g. take campaign finance data. In one example, you can make a query and extract a spreadsheet for a certain kind of data from one candidate. What you can't get is all new transactions that have been filed in the last day, even though this could easily be written out in a tab-delimited text file. The data is hard to get (you need to download everything in thousands of queries then sort out new stuff), and hard to parse.

XML, JSON, fancy gets and puts are no solution-- they could
be a diverson. Even better than XML and API is a gzipped tab-delimited text file with standardized columns, and downloadable via normal HTTP/FTP protocols. No CGIs to write, and trivial software to read. It's most important to
provide raw data and incremental data (so changes can be
derived), then let others implement the best GUI. Unfortunately, data suppliers try to do the GUI and then block data access.

The other major problem I've seen is that the names of organizations are typically abbreviated and spelled differently, so you can't correlate one data file with another. For example, if there is a bill with budget data for an agency, you can't match that up with the budget data of the agency itself because it's not spelled the same, and the hierarchic organization might be undefined so you can't see how it fits into the parent organization's budget.

Nice article.

Making the data GET-able will be a significant step, but it leaves a big gap.

Suppose I want to ask a new question, such as "what CDC departments and people work on possible long term health problems resulting from Katrina?".

There's no way all such possible questions can be anticipated so that programs can be written to answer them using traditional coding.

On the other hand, there is some new technology that can support an evolutionary, social network approach. It's a free Wiki for writing, in executable English, the knowledge needed to link the questions to the data to answer the questions.

Google: "executable English".

Where is the part of the Illinois River shown on the picture? It looks very, very nice...

Where is the part of the Illinois River shown on the picture? It looks very, very nice...

Ugi,

I'm not precisely sure. I think it's the bluffs just north of Peoria, there are portions that might look look like this from the right view. I pulled the picture from an image service, but there are definitely areas that look similar to this.

The pic is from the Illinois river in Oregon. My mom showed me this so I immediately started looking into the topo maps to see the location and it doesn’t match up. Also the water color isn't from a modern day midwest meandering river.. It’d be nice to have a slice of Oregon here but can’t complain with the Illinois river in Illinois : )

Do you think, that now ten months in data.gov has lived up to expectations?

I ask as I am currently writting a short analysis of data.gov, what it is aims to do etc.
It would be helpful to see peoples views...

I'm interested too. Is it now RESTful in nature? Plain Web Services?

I'm not so sure about the above comment "Even better than XML and API is a gzipped tab-delimited text file with standardized columns". The structure provided in xml takes up space but definitly serves a purpose for more complex data.

I'm wondering where exactly in the state of Illinois you took your picture of the Illinois River? The terrain and trees don't look consistent for the midwest. In fact, the image is reminding me a lot of the Illinois River in Oregon.

The picture does not look like the Illinois River in Illinois. Is not that picture taken in Oregon?

News Topics

Recommended for You

Got a Question?