urllib2 – The Missing Manual

January 15, 2011
urllib2is aPythonmodule for fetching URLs (Uniform Resource Locators). It offers a very simple interface, in the form of theurlopenfunction. This is capable of fetching URLs using a variety of different protocols . It also offers a slightly more complex interface for handling common situations – like basic authentication, cookies, proxies and so on. These are provided by objects called handlers and openers. urllib2 supports fetching URLs for many “URL schemes” (identified by the string before the “:” in URL – for example “ftp” is the URL scheme of “ftp://python.org/”) using their associated network protocols (eg FTP, HTTP). This tutorial focuses on the most common case, HTTP. For straightforward situationsurlopenis very easy to use. But as soon as you encounter errors or non-trivial cases when opening HTTP URLs, you will need some understanding of the HyperText Transfer Protocol. The most comprehensive and authoritative reference to HTTP isRFC 2616. This is a technical document and not intended to be easy to read. This HOWTO aims to illustrate usingurllib2, with enough detail about HTTP to help you through. It is not intended to replace theurllib2 docs, but is supplementary to them. The simplest way to use urllib2 is as follows: Many uses of urllib2 will be that simple (note that instead of an ttp: URL we could have used an URL starting with tp: ile: etc.). However, it the purpose of this tutorial to explain the more complicated cases, concentrating on HTTP. HTTP is based on requests and responses – the client makes requests and servers send responses. urllib2 mirrors this with aRequestobject which represents the HTTP request you are making. In its simplest form you create a Request object that specifies the URL you want to fetch. Callingurlopenwith this Request object returns a response object for the URL requested. This response is a file-like object, which means you can for example call. read () on the response : Note that urllib2 makes use of the same Request interface to handle all URL schemes. For example, you can make an FTP request like so: In the case of HTTP, there are two extra things that Request objects allow you to do: First, you can pass data to be sent to the server. Second, you can pass extra information (“metadata”) aboutthe data or the about request itself, to the server – this information is sent as HTTP “headers”. Let look at each of these in turn. Sometimes you want to send data to a URL (often the URL will refer to a CGI (Common Gateway Interface) script [1] or other web application). With HTTP, this is often done using what known as aPOSTrequest. This is often what your browser does when you submit a HTML form that you filled in on the web. Not all POSTs have to come from forms: you can use a POST to transmit arbitrary data to your own application. In the common case of HTML forms, the data needs to be encoded in a standard way, and then passed to the Request object as thedataargument. The encoding is done using a function from theurlliblibrarynotfromurllib2. import urllib
import urllib2
url = ttp://www.someserver.com/cgi-bin/register.cgi
values ??= { name Michael Foord
location Northampton
language Python
data = urllib.urlencode (values)
req = urllib2.Request (url, data)
response = urllib2.urlopen (req)
the_page = response.read ()
Note that other encodings are sometimes required (eg for file upload from HTML forms – seeHTML Specification, Form Submissionfor more details). If you do not pass thedataargument, urllib2 uses aGETrequest. One way in which GET and POST requests differ is that POST requests often have “side-effects”: they change the state of the system in some way (for example by placing an order with the website for a hundredweight of tinned spam to be delivered to your door). Though the HTTP standard makes it clear that POSTs are intended toalwayscause side-effects, and GET requestsneverto cause side-effects, nothing prevents a GET request from having side-effects, nor a POST requests from having no side-effects. Data can also be passed in an HTTP GET request by encoding it in the URL itself. This is done as follows. Notice that the full URL is created by adding a? to the URL, followed by the encoded values. We l discuss here one particular HTTP header, to illustrate how to add headers to your HTTP request. Some websites (like google for example) dislike being browsed by programs, or send different versions to different browsers [2]. By default urllib2 identifies itself asPython-urllib / xy (wherexandyare the major and minor version numbers of the Python release, egPython-urllib/2.5), which may confuse the site, or just plain not work. The way a browser identifies itself is through theUser-Agentheader [3]. When you create a Request object you can pass a dictionary of headers in. The following example makes the same request as above, but identifies itself as a version of Internet Explorer [4]. import urllib
import urllib2
url = ttp://www.someserver.com/cgi-bin/register.cgi
user_agent = ozilla/4.0 (compatible; MSIE 5.5; Windows NT)
values ??= { ame ichael Foord
ocation orthampton
language Python
headers = { User-Agent user_agent}
data = urllib.urlencode (values)
req = urllib2.Request (url, data, headers)
response = urllib2.urlopen (req)
the_page = response . read ()
The response also has two useful methods. See the section oninfo and geturlwhich comes after we have a look at what happens when things go wrong. urlopenraisesURLErrorwhen it cannot handle a response (though as usual with Python APIs, builtin exceptions such as ValueError, TypeError etc. may also be raised). HTTPErroris the subclass ofURLErrorraised in the specific case of HTTP URLs. URLError
Often, URLError is raised because there is no network connection (no route to the specified server), or the specified server doesn exist. In this case, the exception raised will have a eason attribute, which is a tuple containing an error code and a text error message. eg> >> req = urllib2.Request ( ttp://www.pretend_server.org >>> try: urllib2.urlopen (req)>>> except URLError, e:>>> print e.reason>>> ( 4, etaddrinfo failed
Every HTTP response from the server contains a numeric “status code”. Sometimes the status code indicates that the server is unable to fulfil the request. The default handlers will handle some of these responses for you (for example, if the response is a “redirection” that requests the client fetch the document from a different URL, urllib2 will handle that for you). For those it can handle, urlopen will raise anHTTPError. Typical errors include 04 page not found), 03 (request forbidden), and 01 authentication required). See section 10 of RFC 2616 for a reference on all the HTTP error codes. TheHTTPErrorinstance raised will have an integer ode attribute, which corresponds to the error sent by the server. Because the default handlers handle redirects (codes in the 300 range), and codes in the 100-299 range indicate success, you will usually only see error codes in the 400-599 range. BaseHTTPServer.BaseHTTPRequestHandler.responsesis a useful dictionary of response codes in that shows all the response codes used by RFC 2616. The dictionary is reproduced here for convenience: # Table mapping response codes to messages; entries have the

# form {code: (shortmessage, longmessage)}.
responses = {
100: ( ontinue equest received, please continue ),
101: ( witching Protocols
Switching to new protocol; obey Upgrade header ,
200: ( OK Request fulfilled, document follows ,
201: ( reated ocument created, URL follows ,
202: ( Accepted ,
equest accepted, processing continues off-line ,
203: ( on-Authoritative Information equest fulfilled from cache ,

204: ( o Content equest fulfilled, nothing follows ,
205: ( eset Content lear input form for further input. , < br />
206: ( artial Content artial content follows. ,
300: ( ultiple Choices
Object has several resources – see URI list ,
301: ( Moved Permanently Object moved permanently – see URI list ,
302: ( Found Object moved temporarily – see URI list ,
303: ( See Other Object moved – see Method and URL list ,
304: ( ot Modified
ocument has not changed since given time ,
305: ( se Proxy
< br /> ou must use proxy specified in Location to access this
esource. ,
307: ( emporary Redirect
< br /> bject moved temporarily – see URI list ,
400: ( ad Request
ad request syntax or unsupported method , < br />
401: ( nauthorized
o permission – see authorization schemes ,
402: ( ayment Required < br />
o payment – see charging schemes ,
403: ( Forbidden
Request forbidden – authorization will not help ,
404: ( Not Found Nothing matches the given URI ,
405: ( Method Not Allowed
< br /> pecified method is invalid for this server. ,
406: ( ot Acceptable RI not available in preferred format. ,
407: ( roxy Authentication Required ou must authenticate with
his proxy before proceeding. ,
408: ( equest Timeout equest timed out; try again later. ,
409: ( Conflict Request conflict. ,
410: ( Gone

RI no longer exists and has been permanently removed. ,
411: ( ength Required lient must specify Content-Length. ,
412: ( recondition Failed recondition in headers is false. ,
413: ( equest Entity Too Large ntity is too large. , < br />
414: ( equest-URI Too Long RI is too long. ,
415: ( nsupported Media Type ntity body in unsupported format . ,
416: ( Requested Range Not Satisfiable
Cannot satisfy request range. ,
417: ( xpectation Failed
xpect condition could not be satisfied. ,
500: ( nternal Server Error erver got itself in trouble ,
501: ( ot Implemented
erver does not support this operation ,
502: ( ad Gateway nvalid responses from another server / proxy. ,
503: ( ervice Unavailable
he server cannot process the request due to a high load ,
504: ( ateway Timeout
he gateway server did not receive a timely response ,
505: ( HTTP Version Not Supported Cannot fulfill request. ,
}
When an error is raised the server responds by returning an HTTP error codeandan error page. You can use theHTTPErrorinstance as a response on the page returned. This means that as well as the code attribute, it also has read, geturl, and info, methods. So if you want to be prepared forHTTPErrororURLErrorthere are two basic approaches. I prefer the second approach . from urllib2 import Request, urlopen, URLError, HTTPError
req = Request (someurl)
try :
response = urlopen (req)
except HTTPError, e:
print he server couldn \ t fulfill the request. lt;br />
print Error code: e.code

except URLError, e:
print e failed to reach a server.
print eason: e.reason
else :
# everything is fine
< strong> from urllib2 import Request, urlopen, URLError
req = Request (someurl)
try :
response = urlopen (req)
except URLError, e:
if hasattr (e, eason :
print e failed to reach a server.
print eason: e.reason
elif hasattr (e, ode :
print he server couldn \ t fulfill the request. lt;br />
print Error code: e.code
else :
# everything is fine
Note URLErroris a subclass of the built-in exceptionIOError. This means that you can avoid importingURLErrorand use: from urllib2 import Request, urlopen
req = Request (someurl)
try :
response = urlopen (req)
except IOError, e:
if < / strong> hasattr (e, eason :
print e failed to reach a server.
print < / strong> eason: e.reason
elif hasattr (e, ode :
print he server couldn \ t fulfill the request. lt;br />
print Error code: e.code
else :
# everything is fine
Under rare circumstancesurllib2can raisesocket.error. BadStatusLine and HttpException
There are one or two cases where an exception that doesn inherit fromIOErrorcan be raised. One of these is theBadStatusLineexception defined in thehttplib module. This exception can be raised when, for example, the requested page is entirely blank. It doesn inherit fromIOErrorbut instead fromHttpException (again defined inhttpliband inheriting directly fromException). There may be other circumstances when these exceptions can leak through to users ofurllib2. You can either import these exception types fromhttplibto catch them directly or have a atch-all exception clause (catchingException) to handle anything that may go wrong. info and geturl
The response returned by urlopen (or theHTTPErrorinstance) has two useful methodsinfoandgeturl. geturl-this returns the real URL of the page fetched. This is useful becauseurlopen (or the opener object used ) may have followed a redirect. The URL of the page fetched may not be the same as the URL requested. info-this returns a dictionary-like object that describes the page fetched, particularly the headers sent by the server. It is currently anhttplib . HTTPMessageinstance. Typical headers include ontent-length ontent-type and so on. See theQuick Reference to HTTP Headersfor a useful listing of HTTP headers with brief explanations of their meaning and use. Openers and Handlers
When you fetch a URL you use an opener (an instance of the perhaps confusingly-namedurllib2.OpenerDirector). Normally we have been using the default opener – viaurlopen-but you can create custom openers. Openers use handlers. All the “heavy lifting” is done by the handlers. Each handler knows how to open URLs for a particular URL scheme (http, ftp, etc.), or how to handle an aspect of URL opening, for example HTTP redirections or HTTP cookies. You will want to create openers if you want to fetch URLs with specific handlers installed, for example to get an opener that handles cookies, or to get an opener that does not handle redirections. To create an opener, instantiate an OpenerDirector, and then call. add_handler (some_handler_instance) repeatedly. Alternatively, you can usebuild_opener, which is a convenience function for creating opener objects with a single function call.build_openeradds several handlers by default, but provides a quick way to add more and / or override the default handlers. Other sorts of handlers you might want to can handle proxies, authentication, and other common but slightly specialised situations. install_openercan be used to make anopenerobject the (global) default opener. This means that calls tourlopenwill use the opener you have installed. Opener objects have anopenmethod , which can be called directly to fetch urls in the same way as theurlopenfunction: there no need to callinstall_opener, except as a convenience. To illustrate creating and installing a handler we will use theHTTPBasicAuthHandler. For a more detailed discussion of this subject – including an explanation of how Basic Authentication works – see theBasic Authentication Tutorial. When authentication is required, the server sends a header (as well as the 401 error code) requesting authentication. This specifies the authentication scheme and a ealm The header looks like: Www-authenticate: SCHEMErealm = “REALM”. eg The client should then retry the request with the appropriate name and password for the realm included as a header in the request. This is asic authentication In order to simplify this process we can create an instance ofHTTPBasicAuthHandlerand an opener to use this handler. TheHTTPBasicAuthHandleruses an object called a password manager to handle the mapping of URLs and realms to passwords and usernames. If you know what the realm is (from the authentication header sent by the server), then you can use aHTTPPasswordMgr. Frequently one doesn care what the realm is. In that case, it is convenient to useHTTPPasswordMgrWithDefaultRealm. This allows you to specify a default username and password for a URL. This will be supplied in the absence of you providing an alternative combination for a specific realm. We indicate this by providingNoneas the realm argument to theadd_passwordmethod. The top-level URL is the first URL that requires authentication. URLs “deeper” than the URL you pass to. add_password () will also match. # create a password manager
password_mgr = urllib2.HTTPPasswordMgrWithDefaultRealm ()
# Add the username and password.
# If we knew the realm, we could use it instead of “ None “.
top_level_url = “http://example.com/foo/”
password_mgr.add_password (None, top_level_url, username, password)
handler = urllib2.HTTPBasicAuthHandler (password_mgr)
# create “opener” (OpenerDirector instance)
opener = urllib2.build_opener (handler)
# use the opener to fetch a URL
opener.open (a_url)
# Install the opener.
# Now all calls to urllib2.urlopen use our opener.
urllib2.install_opener (opener)
top_level_url is in facteithera full URL (including the ttp: cheme component and the hostname and optionally the port number) eg “http://example.com/” oran “authority” (ie the hostname, optionally including the port number) eg “example.com” or “example.com: 8080 “(the latter example includes a port number). The authority, if present, must NOT contain the” userinfo “component – for example” joe @ password: example.com “is not correct. urllib2will auto-detect your proxy settings and use those. This is through theProxyHandlerwhich is part of the normal handler chain. Normally that a good thing, but there are occasions when it may not be helpful [5]. One way to do this is to setup our ownProxyHandler, with no proxies defined . This is done using similar steps to setting up aBasic Authenticationhandler: The Python support for fetching resources from the web is layered. urllib2 uses the httplib library, which in turn uses the socket library. As of Python 2.3 you can specify how long a socket should wait for a response before timing out. This can be useful in applications which have to fetch web pages. By default the socket module hasno timeoutand can hang. Currently, the socket timeout is not exposed at the httplib or urllib2 levels. However, you can set the default timeout globally for all sockets using: import socket
import urllib2
# timeout in seconds < br />
timeout = 10
socket.setdefaulttimeout (timeout)
# this call to urllib2.urlopen now uses the default timeout
# we have set in the socket module
req = urllib2.Request ( ttp://www.voidspace.org.uk
response = urllib2. urlopen (req)

Comments are currently closed.