PyMOTW: robotparser

By Doug Hellmann
June 21, 2009

robotparser – Internet spider access control

Purpose:Parse robots.txt file used to control Internet spiders
Python Version:2.1.3 and later

robotparser implements a parser for the robots.txt file format, including a simple function for checking if a given user agent can access a resource. It is intended for use in well-behaved spiders or other crawler applications that need to either be throttled or otherwise restricted.

Note

The robotparser module has been renamed urllib.robotparser in Python 3.0. Existing code using robotparser can be updated using 2to3.

robots.txt

The robots.txt file format is a simple text-based access control system for computer programs that automatically access web resources (“spiders”, “crawlers”, etc.). The file is made up of records that specify the user agent identifier for the program followed by a list of URLs (or URL prefixes) the agent may not access.

This is the robots.txt file for http://www.doughellmann.com/:

User-agent: *
Disallow: /admin/
Disallow: /downloads/
Disallow: /media/
Disallow: /static/
Disallow: /codehosting/

It prevents access to some of the expensive parts of my site that would overload the server if a search engine tried to index them. For a more complete set of examples, refer to The Web Robots Page.

Simple Example

Using the data above, a simple crawler can test whether it is allowed to download a page using the RobotFileParser‘s can_fetch() method.

import robotparser
import urlparse

AGENT_NAME = 'PyMOTW'
URL_BASE = 'http://www.doughellmann.com/'
parser = robotparser.RobotFileParser()
parser.set_url(urlparse.urljoin(URL_BASE, 'robots.txt'))
parser.read()

PATHS = [
'/',
'/PyMOTW/',
'/admin/',
'/downloads/PyMOTW-1.92.tar.gz',
]

for path in PATHS:
print '%6s : %s' % (parser.can_fetch(AGENT_NAME, path), path)
url = urlparse.urljoin(URL_BASE, path)
print '%6s : %s' % (parser.can_fetch(AGENT_NAME, url), url)
print

The URL argument to can_fetch() can be a path relative to the root of the site, or full URL.

$ python robotparser_simple.py
True : /
True : http://www.doughellmann.com/

True : /PyMOTW/
True : http://www.doughellmann.com/PyMOTW/

False : /admin/
False : http://www.doughellmann.com/admin/

False : /downloads/PyMOTW-1.92.tar.gz
False : http://www.doughellmann.com/downloads/PyMOTW-1.92.tar.gz


Long-lived Spiders

An application that takes a long time to process the resources it downloads or that is throttled to pause between downloads may want to check for new robots.txt files periodically based on the age of the content it has downloaded already. The age is not managed automatically, but there are convenience methods to make tracking it easier.

import robotparser
import time
import urlparse

AGENT_NAME = 'PyMOTW'
parser = robotparser.RobotFileParser()
# Using the local copy
parser.set_url('robots.txt')
parser.read()
parser.modified()

PATHS = [
'/',
'/PyMOTW/',
'/admin/',
'/downloads/PyMOTW-1.92.tar.gz',
]

for n, path in enumerate(PATHS):
print
age = int(time.time() - parser.mtime())
print 'age:', age,
if age > 1:
print 're-reading robots.txt'
parser.read()
parser.modified()
else:
print
print '%6s : %s' % (parser.can_fetch(AGENT_NAME, path), path)
# Simulate a delay in processing
time.sleep(1)

This extreme example downloads a new robots.txt file if the one it has is more than 1 second old.

$ python robotparser_longlived.py

age: 0
True : /

age: 1
True : /PyMOTW/

age: 2 re-reading robots.txt
False : /admin/

age: 1
False : /downloads/PyMOTW-1.92.tar.gz


A “nicer” version of the long-lived application might request the modification time for the file before downloading the entire thing. On the other hand, robots.txt files are usually fairly small, so it isn’t that much more expensive to just grab the entire document again.

See also

robotparser
The standard library documentation for this module.
The Web Robots Page
Description of robots.txt format.

PyMOTW Home

The canonical version of this article


You might also be interested in:

News Topics

Recommended for You

Got a Question?