Greetings from the DC's kitchen. Mind you don't bump your heads on the hanging pans. I know. So much clutter.
Back in the days of the Y2K scare, as I emerged from the sous-chef ranks, I lost count of how often I heard "so this COBOL business, how can we be so beholden to such a crufty, 40 year old missing link of a language?" A few times in my career I've had a glimpse into pantries packed with towering stacks of statistical data in labs and agencies all over the world. Recently I've had the unfortunate "stump the chef" challenge of dealing with just such a towering stack , and hacking it into something resembling a data format for the modern palate. I spent all that time grumbling to myself "so this SPSS business, how can we be so beholden to such a crufty, 50 year old missing link of a language?"
One benefit to SPSS's pedigree (like many a big-name wine, it's all terroir, no taste) is that there is a huge amount of tools and tips for dealing with the data format. Unfortunately, most of it is written for SPSS users with licensed copies to use. For example, there are many tips for using Python with SPSS, now that the developers picked Python as its core scripting language rather then its ancient, legacy BASIC. The problem is that most of those tips tend to begin with "import spss" or "import SpssClient". You guessed it. You need a licensed copy to do that. Strictly by leave of the maître d'hôtel.
I did manage to serve up the fruit of a few diverse sets of SPSS files, and knowing how widespread this legacy is, here are a few tips, and a pointer to the recipe I ended up using.
The first thing is that you should make sure you have an SPSS portable format file (usually ending in .por). The default SPSS save format (.sps or .sav) is much more tightly wedded to the version and platform of the original SPSS program. I was also able to recover some useful information from .sps files, so I'd say it's best if you have both formats for a given data set, though for my recipe you don't need the .dat files that usually have the meat of the data set referenced in the .sps.
The first thing I tried was GNU PSPP, but it was able to open none of the .por and .sps files I had. Apparently, had it been able, I'd have a nice cream of CSV as simply as:
...select variables here...
But I was having no such luck, and moved swiftly on to the next utensil. In the end, the R project suited well. I've heard that R's SPSS import code originally came from PSPP, but it seems to have evolved far beyond its origins. I installed the needed R bits on Debian as follows:
aptitude install littler
aptitude install r-cran-hmisc
There are several tools, official, and from across the community, for importing and exporting data to/from R. I started out trying the
read.spss routine in the foreign library, but soon settled on the
spss.get routine in the Hmisc library. There is an XML import/export module, but to be honest, it didn't look like anything I'd want to touch with a ten-foot pizza peel, and I'm an XML geek.
The incantation I settled on was:
mydata <- spss.get(file='foo.por')
Which puts out a very nice bouillon of semi-colon-delimited "CSV".
Unfortunately, neither spss.get nor read.spss seem to capture VARIABLE LABEL and VALUE LABEL sections from SPSS files, so the output can be a bit cryptic. I gave up trying to prod these out of the .por, and ended up just applying a Python/regex recipe get the labels by brute force from the .sps. An excerpt:
VAR_PAT = re.compile('VARIABLE\s+LABELS\s+(((\w+)\s+"([^"]+)"\s*)+)\.')
VALUE_PAT = re.compile('VALUE\s+LABELS\s+((/(\w+)\s+(\'(\w+)\'\s+"([^"]+)"\s*)+)+)\.')
Getting each definition takes a few additional steps, because with regex groups that match multiple substrings, you can only access the last match. I ended up with an Akara module to put this all together as a RESTful service. The heart of the module is the
parse_spss function, which takes .por content and optional .sps content, returning basic Python data structures. You can trim any imports not needed by that function, if you choose.
If you want to give Akara a try (it's currently in very early release stage, but I am successfully using it on a few projects at work), you can drop in the spsstools.py Akara module, allowing you to convert SPSS to JSON as follows:
curl -F "PORemail@example.com" -F "SPSSfirstname.lastname@example.org" http://localhost:8880/spss.json
There are probably other recipes for tenderizing the toughest SPSS, but eccovi, friends, this is mine.
❧ ❧ ❧
 I got involved in some of the Y2k projects in those crazy days, and was constantly having to balance sensible caution with a "hold the hype" attitude. Indeed my one moment of infamy in the NY Times (alas the Business section, not Dining & Wine) comes from a slightly distorted quote of my admission that I'm not entirely above date-related bugs in my own programming. ("Beyond 2000: Further Troubles Lurk in the Future of Computing", The New York Times, July 19, 1999).
 Including our work for the Library of Congress.
 Like the Guinness PLC that became Diageo, I wonder whether the SPSS pedigree will survive their renaming it to PASW. I surely hope not.