Data chef: SPSS Tripe Consommé

By Uche Ogbuji
April 30, 2009 | Comments: 8

Greetings from the DC's kitchen. Mind you don't bump your heads on the hanging pans. I know. So much clutter.

Back in the days of the Y2K scare, as I emerged from the sous-chef ranks[1], I lost count of how often I heard "so this COBOL business, how can we be so beholden to such a crufty, 40 year old missing link of a language?" A few times in my career I've had a glimpse into pantries packed with towering stacks of statistical data in labs and agencies all over the world. Recently I've had the unfortunate "stump the chef" challenge of dealing with just such a towering stack [2], and hacking it into something resembling a data format for the modern palate. I spent all that time grumbling to myself "so this SPSS business, how can we be so beholden to such a crufty, 50 year old missing link of a language?"

One benefit to SPSS's pedigree (like many a big-name wine, it's all terroir, no taste[3]) is that there is a huge amount of tools and tips for dealing with the data format. Unfortunately, most of it is written for SPSS users with licensed copies to use. For example, there are many tips for using Python with SPSS, now that the developers picked Python as its core scripting language rather then its ancient, legacy BASIC. The problem is that most of those tips tend to begin with "import spss" or "import SpssClient". You guessed it. You need a licensed copy to do that. Strictly by leave of the maître d'hôtel.

I did manage to serve up the fruit of a few diverse sets of SPSS files, and knowing how widespread this legacy is, here are a few tips, and a pointer to the recipe I ended up using.

The first thing is that you should make sure you have an SPSS portable format file (usually ending in .por). The default SPSS save format (.sps or .sav) is much more tightly wedded to the version and platform of the original SPSS program. I was also able to recover some useful information from .sps files, so I'd say it's best if you have both formats for a given data set, though for my recipe you don't need the .dat files that usually have the meat of the data set referenced in the .sps.

The first thing I tried was GNU PSPP, but it was able to open none of the .por and .sps files I had. Apparently, had it been able, I'd have a nice cream of CSV as simply as:


WRITE OUTFILE='spam.csv'
...select variables here...
EXECUTE.

But I was having no such luck, and moved swiftly on to the next utensil. In the end, the R project suited well. I've heard that R's SPSS import code originally came from PSPP, but it seems to have evolved far beyond its origins. I installed the needed R bits on Debian as follows:


aptitude install littler
aptitude install r-cran-hmisc

There are several tools, official, and from across the community, for importing and exporting data to/from R. I started out trying the read.spss routine in the foreign library, but soon settled on the spss.get routine in the Hmisc library. There is an XML import/export module, but to be honest, it didn't look like anything I'd want to touch with a ten-foot pizza peel, and I'm an XML geek.

The incantation I settled on was:


library(Hmisc)
mydata <- spss.get(file='foo.por')
write.csv2(mydata)

Which puts out a very nice bouillon of semi-colon-delimited "CSV".

Unfortunately, neither spss.get nor read.spss seem to capture VARIABLE LABEL and VALUE LABEL sections from SPSS files, so the output can be a bit cryptic. I gave up trying to prod these out of the .por, and ended up just applying a Python/regex recipe get the labels by brute force from the .sps. An excerpt:


VAR_PAT = re.compile('VARIABLE\s+LABELS\s+(((\w+)\s+"([^"]+)"\s*)+)\.')
VALUE_PAT = re.compile('VALUE\s+LABELS\s+((/(\w+)\s+(\'(\w+)\'\s+"([^"]+)"\s*)+)+)\.')

Getting each definition takes a few additional steps, because with regex groups that match multiple substrings, you can only access the last match. I ended up with an Akara module to put this all together as a RESTful service. The heart of the module is the parse_spss function, which takes .por content and optional .sps content, returning basic Python data structures. You can trim any imports not needed by that function, if you choose.

If you want to give Akara a try (it's currently in very early release stage, but I am successfully using it on a few projects at work), you can drop in the spsstools.py Akara module, allowing you to convert SPSS to JSON as follows:


curl -F "POR=@foo.por" -F "SPSS=@foo.sps" http://localhost:8880/spss.json

There are probably other recipes for tenderizing the toughest SPSS, but eccovi, friends, this is mine.

❧ ❧ ❧
[1] I got involved in some of the Y2k projects in those crazy days, and was constantly having to balance sensible caution with a "hold the hype" attitude. Indeed my one moment of infamy in the NY Times (alas the Business section, not Dining & Wine) comes from a slightly distorted quote of my admission that I'm not entirely above date-related bugs in my own programming. ("Beyond 2000: Further Troubles Lurk in the Future of Computing", The New York Times, July 19, 1999).

[2] Including our work for the Library of Congress.

[3] Like the Guinness PLC that became Diageo, I wonder whether the SPSS pedigree will survive their renaming it to PASW. I surely hope not.


You might also be interested in:

8 Comments

I have some familiarity with this problem.

SPSS is the dominant software supplier in an industry with many smaller players. Almost all of these have support for an open standard for representing survey data and metadata, called triple-s (http://www.triple-s.org/).
The benefit of triple-s, apart from the fact that the statistical package you wish to use may well support it already, is that the metadata are represented in XML and the case data in ASCII, either fixed column format or delimited files, and this is therefore all very easy to handle.

In the first quarter of this year I too had to deal with processing SPSS data as a source for a package I was developing (http://www.risk-e.net/). I also did not wish to impose the requirement of an SPSS license on users.

The .sav format, although undocumented and version dependent, is the one most often available, and has the most information in it. Fortunately it has not changed much in recent years, and its internal binary format is quite well documented - in the PSPP documentation as it happens - http://www.gnu.org/software/pspp/pspp-dev/html_node/System-File-Format.html#System-File-Format

If we wanted to actually create .sav files, it would be necessary to know much more of the format, but for reading them we can make do with extracting only what we understand.

I wrote a python module to parse .sav files and convert them to triple-s format, which I have used with success on several real-life data sets.

There are still some features of the format which I do not understand, and I don't feel able to commmercialize the conversion program as I can't warrant the reliability of something derived from another persons reverse-engineering of the format.

Therefore I have been considering hosting the module as open-source, on Sourceforge perhaps. Then others with perhaps greater knowledge of details of the format can validate or improve it.

It sounds as if I am not the only person irritated with the determination of SPSS to appropriate their customers data. If feedback suggests it's a worthwhile exercise to publish my code I'll take the time to do it.

I would be *very* interested in the .sav parsing code! I currently use a command-line invocation of pspp to dump the data and parse it back in, but I wouldn't mind getting rid of that.

Thanks for the insights, Iain.

I think it would be very valuable for you to release your Python code, and I'd be happy to help test and review it. SourceForge, Google Code, or whatever are fine, or if it's just a couple of .py files, consider something even simpler, such as Github or Bitbucket:

http://github.com/
http://bitbucket.org/

Although it is only available for Windows currently, there is a free, no-SPSS license-required i/o module for reading and writing SPSS sav files. You can get it from SPSS Developer Central, www.spss.com/devcentral in the Downloads section.

You have to have a login to download, but those are free to create.

The sav file format is pretty complicated, but the i/o module hides all that.

Which module do I need to download specifically in order to read the sav file?

This apparently does the job. Does it also require Python?

SPSS_IO_Module64bitv17 OR the 32 bit version of SPSS Developer tools.

Hello, Can I ask a question? How to convert sps document to sav document with SPSS? Thanks

Hey, this was useful! :D

Though in my case, I had a ".sav" file and SPSS solved my problem. It must have synced or evolved during the last year.

Thanks,

ale
~~

News Topics

Recommended for You

Got a Question?