Visualizing Stackoverflow's data dump

By Andrew Odewahn
November 18, 2009 | Comments: 1

Stackoverflow releases a monthly XML data dump (CC-licensed) of all the data in their system. (You can read about this at Stack Overflow Creative Commons Data Dump). Unlike a lot of other data sets that just reflects what developers are buying, this data reflects what developers are actually using and asking questions about, which is pretty cool. I used this dataset to create a topic map that reflects the relationships among the top topics (based on how frequently the topic was used as a tag on a post) for the month of October, 2009.

To build it, I first downloaded the the raw data from Bit Torrent, and then wrote a parser to suck the data into SQLite. Next, I wrote a script to count the number of times topcs were used together. So, if someone tagged a post "javascript ajax jquery", there would be 3 links counted: (javascript, ajax), (javascript, jquery), and (ajax, jquery). I then stuffed that stuff into graphviz, which made an image out of it. (NB: you might have to zoom in a bit to read the fonts.)

It required a bit of filtering to reduce clutter, but what emerged was this picture of the main technology stacks (ha!) for a variety of key areas. For example, Python is strongly tied to Django and google-app-engine, and Java to spring, hibernate, eclipse, swing, and (oddly), homework.

Here it is (you'll need to click it to zoom out):


There was a lot of interesting stuff in here. For example, I love the fact that plain old "regex" is the main link between "php" and "c#." Also, it's interesting that only PHP has a direct link to various database topics; I would have expected databases to be more central.

That is really interesting to see. Thanks for generating that.

