PyMOTW: urllib2

By Doug Hellmann
July 19, 2009 | Comments: 2

urllib2 – Library for opening URLs.

Purpose:A library for opening URLs that can be extended by defining custom protocol handlers.
Python Version:2.1

The urllib2 module provides an updated API for using internet resources identified by URLs. It is designed to be extended by individual applications to support new protocols or add variations to existing protocols (such as handling HTTP basic authentication).

HTTP GET

Note

The test server for these examples is in BaseHTTPServer_GET.py, from the PyMOTW examples for BaseHTTPServer. Start the server in one terminal window, then run these examples in another.

As with urllib, an HTTP GET operation is the simplest use of urllib2. Simply pass the URL to urlopen() to get a “file-like” handle to the remote data.

import urllib2

response = urllib2.urlopen('http://localhost:8080/')
print 'RESPONSE:', response
print 'URL :', response.geturl()

headers = response.info()
print 'DATE :', headers['date']
print 'HEADERS :'
print '---------'
print headers

data = response.read()
print 'LENGTH :', len(data)
print 'DATA :'
print '---------'
print data

The example server accepts the incoming values and formats a plain text response
to send back. The return value from urlopen() gives access to the headers from
the HTTP server through the info() method, and the data for the remote
resource via methods like read() and readlines().

$ python urllib2_urlopen.py
RESPONSE: <addinfourl at 11940488 whose fp = <socket._fileobject object at 0xb573f0>>
URL : http://localhost:8080/
DATE : Sun, 19 Jul 2009 14:01:31 GMT
HEADERS :
---------
Server: BaseHTTP/0.3 Python/2.6.2
Date: Sun, 19 Jul 2009 14:01:31 GMT

LENGTH : 349
DATA :
---------
CLIENT VALUES:
client_address=('127.0.0.1', 55836) (localhost)
command=GET
path=/
real path=/
query=
request_version=HTTP/1.1

SERVER VALUES:
server_version=BaseHTTP/0.3
sys_version=Python/2.6.2
protocol_version=HTTP/1.0

HEADERS RECEIVED:
accept-encoding=identity
connection=close
host=localhost:8080
user-agent=Python-urllib/2.6


The file-like object returned by urlopen() is iterable:

import urllib2

response = urllib2.urlopen('http://localhost:8080/')
for line in response:
print line.rstrip()

This example strips the trailing newlines and carriage returns before printing the output.

$ python urllib2_urlopen_iterator.py
CLIENT VALUES:
client_address=('127.0.0.1', 55840) (localhost)
command=GET
path=/
real path=/
query=
request_version=HTTP/1.1

SERVER VALUES:
server_version=BaseHTTP/0.3
sys_version=Python/2.6.2
protocol_version=HTTP/1.0

HEADERS RECEIVED:
accept-encoding=identity
connection=close
host=localhost:8080
user-agent=Python-urllib/2.6


Encoding Arguments

Arguments can be passed to the server by encoding them with urllib.urlencode() and
appending them to the URL.

import urllib
import urllib2

query_args = { 'q':'query string', 'foo':'bar' }
encoded_args = urllib.urlencode(query_args)
print 'Encoded:', encoded_args

url = 'http://localhost:8080/?' + encoded_args
print urllib2.urlopen(url).read()

The list of client values returned in the example output contains the encoded query
arguments.

$ python urllib2_http_get_args.py
Encoded: q=query+string&foo=bar
CLIENT VALUES:
client_address=('127.0.0.1', 55849) (localhost)
command=GET
path=/?q=query+string&foo=bar
real path=/
query=q=query+string&foo=bar
request_version=HTTP/1.1

SERVER VALUES:
server_version=BaseHTTP/0.3
sys_version=Python/2.6.2
protocol_version=HTTP/1.0

HEADERS RECEIVED:
accept-encoding=identity
connection=close
host=localhost:8080
user-agent=Python-urllib/2.6


HTTP POST

Note

The test server for these examples is in BaseHTTPServer_POST.py, from the
PyMOTW examples for the BaseHTTPServer. Start the server in one
terminal window, then run these examples in another.

To POST form-encoded data to the remote server, instead of using GET, simply pass the encoded
query arguments as data to urlopen().

import urllib
import urllib2

query_args = { 'q':'query string', 'foo':'bar' }
encoded_args = urllib.urlencode(query_args)
url = 'http://localhost:8080/'
print urllib2.urlopen(url, encoded_args).read()

The server can decode the form data and access the individual values by name.

$ python urllib2_urlopen_post.py
Client: ('127.0.0.1', 55943)
User-agent: Python-urllib/2.6
Path: /
Form data:
q=query string
foo=bar

Working with Requests Directly

urlopen() is a convenience function that hides some of the details of how the request is made
and handled for you. For more precise control, you may want to instantiate and use a Request
object directly.

Adding Outgoing Headers

As the examples above illustrate, the default User-agent header value is made up of the
constant Python-urllib, followed by the Python interpreter version. If you are creating
an application that will access other people’s web resources, it is a courtesy to include
real user agent information in your requests, so they can identify the source of the hits
more easily. Using a custom agent also allows them to control crawlers using a robots.txt
file (see robotparser).

import urllib2

request = urllib2.Request('http://localhost:8080/')
request.add_header('User-agent', 'PyMOTW (http://www.doughellmann.com/PyMOTW/)')

response = urllib2.urlopen(request)
data = response.read()
print data

After creating a Request object, it is easy to use add_header() to set the user agent
value before opening the request. The last line of the output shows our custom
value.

$ python urllib2_request_header.py
CLIENT VALUES:
client_address=('127.0.0.1', 55876) (localhost)
command=GET
path=/
real path=/
query=
request_version=HTTP/1.1

SERVER VALUES:
server_version=BaseHTTP/0.3
sys_version=Python/2.6.2
protocol_version=HTTP/1.0

HEADERS RECEIVED:
accept-encoding=identity
connection=close
host=localhost:8080
user-agent=PyMOTW (http://www.doughellmann.com/PyMOTW/)



Posting Form Data

You can set the outgoing data on the Request to post it to the server.

import urllib
import urllib2

query_args = { 'q':'query string', 'foo':'bar' }

request = urllib2.Request('http://localhost:8080/')
print 'Request method before data:', request.get_method()

request.add_data(urllib.urlencode(query_args))
print 'Request method after data :', request.get_method()
request.add_header('User-agent', 'PyMOTW (http://www.doughellmann.com/PyMOTW/)')

print
print 'OUTGOING DATA:'
print request.get_data()

print
print 'SERVER RESPONSE:'
print urllib2.urlopen(request).read()

The HTTP method used by the Request changes from GET to POST after the data is added.

$ python urllib2_request_post.py
Request method before data: GET
Request method after data : POST

OUTGOING DATA:
q=query+string&foo=bar

SERVER RESPONSE:
Client: ('127.0.0.1', 56044)
User-agent: PyMOTW (http://www.doughellmann.com/PyMOTW/)
Path: /
Form data:
q=query string
foo=bar


Note

Although the method is add_data(), its effect is not cumulative. Each call
replaces the previous data.

Uploading Files

Encoding files for upload requires a little more work than simple forms. A complete MIME
message needs to be constructed in the body of the request, so that the server can
distinguish incoming form fields from uploaded files.

import itertools
import mimetools
import mimetypes
from cStringIO import StringIO
import urllib
import urllib2

class MultiPartForm(object):
"""Accumulate the data to be used when posting a form."""

def __init__(self):
self.form_fields = []
self.files = []
self.boundary = mimetools.choose_boundary()
return

def get_content_type(self):
return 'multipart/form-data; boundary=%s' % self.boundary

def add_field(self, name, value):
"""Add a simple field to the form data."""
self.form_fields.append((name, value))
return

def add_file(self, fieldname, filename, fileHandle, mimetype=None):
"""Add a file to be uploaded."""
body = fileHandle.read()
if mimetype is None:
mimetype = mimetypes.guess_type(filename)[0] or 'application/octet-stream'
self.files.append((fieldname, filename, mimetype, body))
return

def __str__(self):
"""Return a string representing the form data, including attached files."""
# Build a list of lists, each containing "lines" of the
# request. Each part is separated by a boundary string.
# Once the list is built, return a string where each
# line is separated by '\r\n'.
parts = []
part_boundary = '--' + self.boundary

# Add the form fields
parts.extend(
[ part_boundary,
'Content-Disposition: form-data; name="%s"' % name,
'',
value,
]
for name, value in self.form_fields
)

# Add the files to upload
parts.extend(
[ part_boundary,
'Content-Disposition: file; name="%s"; filename="%s"' % \
(field_name, filename),
'Content-Type: %s' % content_type,
'',
body,
]
for field_name, filename, content_type, body in self.files
)

# Flatten the list and add closing boundary marker,
# then return CR+LF separated data
flattened = list(itertools.chain(*parts))
flattened.append('--' + self.boundary + '--')
flattened.append('')
return '\r\n'.join(flattened)

if __name__ == '__main__':
# Create the form with simple fields
form = MultiPartForm()
form.add_field('firstname', 'Doug')
form.add_field('lastname', 'Hellmann')

# Add a fake file
form.add_file('biography', 'bio.txt',
fileHandle=StringIO('Python developer and blogger.'))

# Build the request
request = urllib2.Request('http://localhost:8080/')
request.add_header('User-agent', 'PyMOTW (http://www.doughellmann.com/PyMOTW/)')
body = str(form)
request.add_header('Content-type', form.get_content_type())
request.add_header('Content-length', len(body))
request.add_data(body)

print
print 'OUTGOING DATA:'
print request.get_data()

print
print 'SERVER RESPONSE:'
print urllib2.urlopen(request).read()

The MultiPartForm class can represent an arbitrary form as a multi-part MIME message
with attached files.

$ python urllib2_upload_files.py

OUTGOING DATA:
--192.168.1.17.527.30074.1248020372.206.1
Content-Disposition: form-data; name="firstname"

Doug
--192.168.1.17.527.30074.1248020372.206.1
Content-Disposition: form-data; name="lastname"

Hellmann
--192.168.1.17.527.30074.1248020372.206.1
Content-Disposition: file; name="biography"; filename="bio.txt"
Content-Type: text/plain

Python developer and blogger.
--192.168.1.17.527.30074.1248020372.206.1--


SERVER RESPONSE:
Client: ('127.0.0.1', 57126)
User-agent: PyMOTW (http://www.doughellmann.com/PyMOTW/)
Path: /
Form data:
lastname=Hellmann
Uploaded biography as "bio.txt" (29 bytes)
firstname=Doug


Custom Protocol Handlers

urllib2 has built-in support for HTTP(S), FTP, and local file access. If you need to add
support for other URL types, you can register your own protocol handler to be invoked as
needed. For example, if you want to support URLs pointing to arbitrary files on remote NFS
servers, without requiring your users to mount the path manually, would create a
class derived from BaseHandler and with a method nfs_open().

The protocol open method takes a single argument, the Request instance, and it should return
an object with a read() method that can be used to read the data, an info() method to return
the response headers, and geturl() to return the actual URL of the file being read. A simple
way to achieve that is to create an instance of urllib.addurlinfo, passing the headers,
URL, and open file handle in to the constructor.

import mimetypes
import os
import tempfile
import urllib
import urllib2

class NFSFile(file):
def __init__(self, tempdir, filename):
self.tempdir = tempdir
file.__init__(self, filename, 'rb')
def close(self):
print
print 'NFSFile:'
print ' unmounting %s' % self.tempdir
print ' when %s is closed' % os.path.basename(self.name)
return file.close(self)

class FauxNFSHandler(urllib2.BaseHandler):

def __init__(self, tempdir):
self.tempdir = tempdir

def nfs_open(self, req):
url = req.get_selector()
directory_name, file_name = os.path.split(url)
server_name = req.get_host()
print
print 'FauxNFSHandler simulating mount:'
print ' Remote path: %s' % directory_name
print ' Server : %s' % server_name
print ' Local path : %s' % tempdir
print ' File name : %s' % file_name
local_file = os.path.join(tempdir, file_name)
fp = NFSFile(tempdir, local_file)
content_type = mimetypes.guess_type(file_name)[0] or 'application/octet-stream'
stats = os.stat(local_file)
size = stats.st_size
headers = { 'Content-type': content_type,
'Content-length': size,
}
return urllib.addinfourl(fp, headers, req.get_full_url())

if __name__ == '__main__':
tempdir = tempfile.mkdtemp()
try:
# Populate the temporary file for the simulation
with open(os.path.join(tempdir, 'file.txt'), 'wt') as f:
f.write('Contents of file.txt')

# Construct an opener with our NFS handler
# and register it as the default opener.
opener = urllib2.build_opener(FauxNFSHandler(tempdir))
urllib2.install_opener(opener)

# Open the file through a URL.
response = urllib2.urlopen('nfs://remote_server/path/to/the/file.txt')
print
print 'READ CONTENTS:', response.read()
print 'URL :', response.geturl()
print 'HEADERS:'
for name, value in sorted(response.info().items()):
print ' %-15s = %s' % (name, value)
response.close()
finally:
os.remove(os.path.join(tempdir, 'file.txt'))
os.removedirs(tempdir)

The FauxNFSHandler and NFSFile classes print messages to illustrate where a real
implementation would add mount and unmount calls. Since this is just a simulation,
FauxNFSHandler is primed with the name of a temporary directory where it should look for all
of its files.

$ python urllib2_nfs_handler.py

FauxNFSHandler simulating mount:
Remote path: /path/to/the
Server : remote_server
Local path : /var/folders/9R/9R1t+tR02Raxzk+F71Q50U+++Uw/-Tmp-/tmppv5Efn
File name : file.txt

READ CONTENTS: Contents of file.txt
URL : nfs://remote_server/path/to/the/file.txt
HEADERS:
Content-length = 20
Content-type = text/plain

NFSFile:
unmounting /var/folders/9R/9R1t+tR02Raxzk+F71Q50U+++Uw/-Tmp-/tmppv5Efn
when file.txt is closed


See also

urllib2
The standard library documentation for this module.
urllib
Original URL handling library.
urlparse
Work with the URL string itself.
urllib2 – The Missing Manual
Michael Foord’s write-up on using urllib2.
Upload Scripts
Example scripts from Michael Foord that illustrate how to upload a file
using HTTP and then receive the data on the server.
HTTP client to POST using multipart/form-data
Python cookbook recipe showing how to encode and post data, including files,
over HTTP.
Form content types
W3C specification for posting files or large amounts of data via HTTP forms.
mimetypes
Map filenames to mimetype.
mimetools
Tools for parsing MIME messages.

PyMOTW Home

The canonical version of this article


You might also be interested in:

2 Comments

I'm attempting to grab the header information from a urllib2 response. I have the following:


resp = urllib2.urlopen(httprequest)
docHeaders = resp.info()

But anytime I refer to docHeaders, even with just a print, I crash. I'm running this from mod_python so I'm not seeing the crash info, just the fact that it crashed. And it's python 2.4.2. Any thoughts?

You could try trapping the exception and printing a useful error message yourself. I'm sure there's a way to get mod_python to log the error -- is it going to Apache's error_log file?

News Topics

Recommended for You

Got a Question?