Simplify business research with Google Ajax Search API

By Andrew Odewahn
April 13, 2009 | Comments: 3

Business research usually starts with a list -- brands, competitors, people, products, whatever. Most people begin with Google, trying to discover commonalities, gather basic information, or just find out what all these things are. Then, there's a marathon session of cut and paste, with a hopefully tidy but usually messy document to show for the hours of work.

This post describes a quick Python script that uses the Google Search API to automate the routine parts of the task, giving you more time to analyze and understand the results. Here's what it does:

  • Accepts a list of terms you want to research. These can be whatever you like, as long as there is one term per line.
  • Uses the Google Search API to return the most likely URL and Description that Google thinks matches the term. (This is sort of like the "I'm feeling lucky" button, so you'll still need to double check the results.)
  • Outputs the results to tab delimited format so that you can use it in other documents (or scripts)

This diagram should give you the basic idea of what the script does:

ano_search_api.png

Before you Start
You'll need the following stuff to run this script:


  • Python. If you're on a Mac, it's built-in. If you're in Windows, you can get it from ActiveState.

  • A JSON processing module for Python. I'm using simplejson in this script.

  • A list of terms you want to research

Overview of the Google Ajax Search API
The Google Search API has a simple REST interface -- you provide a URL with an encoded parameter (called "q") that has your term, and Google returns a JSON structure representing the results. For example, here's how you'd search for "Python":


http://ajax.googleapis.com/ajax/services/search/web?v=1.0&q=python

If all goes well, you'll get back a JSON structure that looks something like this -- the interesting stuff is in the "results" structure (I've removed a lot of other stuff for clarity) :


{
      ...
      "results": [
            {
                "GsearchResultClass": "GwebSearch",
                "cacheUrl": "http://www.google.com/search?q=cache:YSBN_oGSAEYJ:www.python.org",
                "content": "Home page for Python, an interpreted, interactive, 
                    object-oriented, extensible   programming language. It provides an 
                    extraordinary combination of clarity and   ...",
                "title": "Python Programming Language -- Official Website",
                "titleNoFormatting": "Python Programming Language -- Official Website",
                "unescapedUrl": "http://www.python.org/",
                "url": "http://www.python.org/",
                "visibleUrl": "www.python.org"
            },
            ...
        ]
    },
   ...
}

Results are in order by pagerank, so the first result is the one that probably fits the search term. Obviously, this can break down in a lot of ways, but it works surprisingly well for popular brands or products.

The Code
Now that we've gotten the basics out of the way, here's the code for a script called term2url.py:


#
# This is a quick and dirty script to pull the most likely url and description
# for a list of terms.  Here's how you use it:
#
# python term2url.py < {a txt file with a list of terms} > {a tab delimited file of results}
#
# You'll must install the simplejson module to use it
#
import urllib
import urllib2
import simplejson
import sys

# Read the terms we want to convert into URL from info redirected from the command line
terms = sys.stdin.readlines()

#Now loop through each term in the list and return the highest ranking result
for term in terms:

# Define the query to pass to Google Search API
query = urllib.urlencode({'q' : term.rstrip("\n")})
url = "http://ajax.googleapis.com/ajax/services/search/web?v=1.0&%s" % (query)

# Fetch the results and create JSON structure
search_results = urllib2.urlopen(url)
json = simplejson.loads(search_results.read())

# Process the results by pulling the first record, which has the best match
results = json['responseData']['results']
for r in results[:1]:
url = r['url']
desc = r['content'].encode('ascii', 'replace')

# Print the results to stdout. Use redirect to capture the output
print "%s\t%s\t%s" % (term.rstrip("\n"), url, desc)

You can download a formatted version from the resources section at the bottom of the post.

Running the Code
Assuming you've gotten Python and the simplejson module installed correctly, running the script is a snap. From the command line, type:


python term2url.py < input file name > output file name

In my example, the input file is called brands.txt and the output file is called brands_data.txt.

Once you've run the script (it may take a while), you can open the data file with a spreadsheet and format it.

Resources
Here are some resources used in this post:

What Else?
This is a pretty universal technique that you could use for lots of other tasks. What are some of the common research problems you have, and what techniques or tools do you use that you find effective?


You might also be interested in:

3 Comments

Hi Andrew,

I am not a programmer but am trying to use your instructions to get a list of URL's for a list of companies. I have installed the activePython and simpleJSON but I don't know how to tell Python that the simpleJSON module is there to use.

When I input the info you directed into python, i get the following. both .txt files have been created and are in the documents folder.

>>> python term2url.py brands_data.txt
Traceback ( File "", line 1
python term2url.py brands_data.txt
^
SyntaxError: invalid syntax
>>>

Thank you for your help on this!

Jim Q

Hi, Jim. Looks like you didn't put the "<" between the name of the program (term2url.py) and the file that has the list of terms (brands_data.txt). Try this from the command line:

python term2url.py < brands_data.txt

Google API solution gets outperformed by Mapping specialist

I am a web developer and had a 2 month project testing both solutions after reading the outcome of the IMFA regarding European business mapping providers.

I noted that the free Google solution took twice as long to develop, had only basic Geocoding and everything else had to be developed from scratch I.e. criteria search, database management. Still Google business customers both paid (up to £7800) a year and free (if the solution will not be re sold (i.e. vehicle tracking) have no access to the UK postcode data from the royal mail as Google are no licensed (hence the often appalling accuracy) with only 4 digit postcode verification.

In a positive, Google is a pretty basic platform and for the most part is free to use and widely available and recognised.

The API platform from ViaMichelin (used a mixture of javascript skills) was offered to me on a free trial for 45 days and took only a few weeks to complete, Geocoding for address verification was included (so ideal for store finder, reserve and collect, etc and gave me access to live human support (to see what else I could do with their api). They provided me a platform with full Europe coverage and geocoded Ireland which Google could not offer for a price cheaper than the Google enterprise and premier.

Bing fell behind when it came to customer support as it was non existent and the former multimap owned company owned by microsoft took just over the 2 months to get back to me.

Like for like, The new ViaMichelin API solution wins, For a basic solution use Google, for business's looking for real quality use Viamichelin

I still want to see speed bumps and low bridge notification on maps as a option.

News Topics

Recommended for You

Got a Question?