Note: this page describes the new >=0.4.3a interface. For old docs, see the source packages here.
import ClientCookie response = ClientCookie.urlopen("http://foo.bar.com/")
This function behaves identically to urllib2.urlopen
, except
that it deals with cookies automatically. That's probably all you need to
know.
Here is a more complicated example, involving Request
objects
(useful if you want to pass Request
s around, add headers to them,
etc.):
import ClientCookie import urllib2 request = urllib2.Request("http://www.acme.com/") # note we're using the urlopen from ClientCookie, not urllib2 response = ClientCookie.urlopen(request) # let's say this next request requires a cookie that was set in response request2 = urllib2.Request("http://www.acme.com/flying_machines.html") response2 = ClientCookie.urlopen(request2) print response2.geturl() print response2.info() # headers print response2.read() # body (readline and readlines work too)
In these examples, the workings are hidden inside the
ClientCookie.urlopen
method, which is an extension of
urllib2.urlopen
. Redirects, proxies and cookies are handled
automatically by this function. Cookie processing (etc.) is handled by
processor objects, which are similar to urllib2
's handlers:
HTTPCookieProcessor
, HTTPRefererProcessor
,
SeekableProcessor
etc. To use, simply pass processors to
build_opener
as if they were handlers. Processor-aware versions
of HTTPHandler
and HTTPSHandler
(if your Python
installation has HTTPS support) are also included, along with a bugfixed
HTTPRedirectHandler
is also included (the bug, related to
redirection, is fixed in 2.3).
An example at a slightly lower level shows how the module processes cookies more clearly:
# Don't copy this blindly! You probably want to follow the examples # above, not this one. import ClientCookie import urllib2 request = urllib2.Request("http://www.acme.com/") response = urllib2.urlopen(request) c = ClientCookie.CookieJar() c.extract_cookies(response, request) # let's say this next request requires a cookie that was set in response request2 = urllib2.Request("http://www.acme.com/flying_machines.html") c.add_cookie_header(request2) response2 = urllib2.urlopen(request2)
The CookieJar
class does all the work. There are essentially
two operations: extract_cookies
extracts HTTP cookies from
Set-Cookie
(the original Netscape cookie
standard) and Set-Cookie2
(RFC 2965) headers from a
response if and only if they should be set given the request, and
add_cookie_header
adds Cookie
headers if and only if
they are appropriate for a particular HTTP request. Incoming cookies are
checked for acceptability based on the host name, etc. Cookies are only set on
outgoing requests if they match the request's host name, path, etc. Cookies
may be also be saved to and loaded from a file. The subclass
MozillaCookieJar
differs from CookieJar
only in
storing cookies using a different, Mozilla/Netscape-compatible, file format.
This Mozilla-compatible ('cookies.txt'
) format loses some
information when you save cookies to a file. Note that lynx also uses the
Mozilla file format. The subclass MSIECookieJar
can load (but not
save, yet) from Microsoft Internet Explorer's cookie files (on Windows).
Note that if you're using ClientCookie.urlopen
(or if you're
using ClientCookie.HTTPCookieProcessor
by some other means), you
don't need to call extract_cookies
or add_cookie
header yourself. If, on the other hand, you don't want to use
urllib2
, you will need to use this pair of methods. You can make
your own request
and response
objects, which must
support the interfaces described in the docstrings of
extract_cookies
and add_cookie_header
.
Only use names you can import directly from the ClientCookie package, and that don't start with a single underscore. Everything else is subject to change or disappearance without notice.
The subclass MozillaCookieJar
differs from
CookieJar
only in storing cookies using a different,
Netscape/Mozilla-compatible, file format. This Netscape-compatible format
can't store RFC 2965 cookies, so they are downgraded to Netscape cookies on
saving. CookieJar
itself uses a libwww-perl specific format
(`Set-Cookie3'). Python and Netscape/Mozilla should be able to share a cookies
file (note that the file location here will differ on non-unix OSes):
WARNING: you may want to backup your browser's cookies file
if you use MozillaCookieJar
to save cookies. I think it
works, but there have been bugs in the past!
import os, ClientCookie cookies = ClientCookie.MozillaCookieJar() cookies.load(os.path.join(os.environ["HOME"], "/.netscape/cookies.txt")) # see also the save and revert methods
Note that cookies saved while Mozilla is running will get clobbered by
Mozilla - see MozillaCookieJar.__doc__
.
MSIECookieJar
does the same for Microsoft Internet Explorer
(MSIE) 5.x and 6.x on Windows, but does not allow saving cookies in this
format. In future, the Windows API calls might be used to load and save
(though the index has to be read directly, since there is no API for that,
AFAIK).
import ClientCookie c = ClientCookie.MSIECookieJar(delayload=True) c.load_from_registry() # finds cookie index file from registry
A true delayload
argument speeds things up.
On Windows 9x (win 95, win 98, win ME), you need to supply a username to the
load_from_registry
method:
c.load_from_registry(username="jbloggs")
You might want to do this to use your
browser's cookies, to customize CookieJar
's behaviour by
passing constructor arguments, or to be able to get at the cookies it will hold
(for example, for saving cookies between sessions and for debugging).
If you're using the higher-level urllib2
-like interface
(urlopen
, etc), you'll have to let it know what
CookieJar
it should use:
import ClientCookie cookies = ClientCookie.CookieJar() # build_opener adds standard handlers and processors (such as HTTPHandler # and HTTPCookieProcessor) by default. The cookie processor we supply # will replace the default one. opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cookies)) r = opener.open("http://acme.com/") # GET r = opener.open("http://acme.com/", data) # POST
The urlopen
function uses a global OpenerDirector
instance to do its work, so if you want to use urlopen
with your
own CookieJar
, install the OpenerDirector
you built
with build_opener
using the
ClientCookie.install_opener
function, then proceed as usual:
ClientCookie.install_opener(opener)
r = urlopen("http://www.acme.com/")
Of course, everyone using urlopen
is using the same global
CookieJar
instance!
You can set a policy object (must satisfy interface defined by
ClientCookie.CookiePolicy
), which determines which cookies
are allowed to be set and returned. Use the policy argument to the
CookieJar
constructor, or just set the policy attribute
directly. The default implementation has some useful switches:
from ClientCookie import CookieJar, DefaultCookiePolicy as Policy cookies = CookieJar() # turn off RFC 2965 cookies, be more strict about domains when setting and # returning Netscape cookies, and block some domains from setting cookies # or having them returned (read the DefaultCookiePolicy docstring for the # domain matching rules here) policy = Policy(rfc2965=False, strict_ns_domain=Policy.DomainStrict, blocked_domains=["ads.net", ".ads.net"]) cookies.policy = policy
These are implemented as processor classes. Processors are identical in use
to urllib2
's handlers: you just pass them to
build_opener
(example code below).
HTTPEquivProcessor
The <META HTTP-EQUIV>
tag is a way of including data
in HTML to be treated as if it were part of the HTTP headers. ClientCookie can
automatically read these tags and add the HTTP-EQUIV
headers to
the response object's real HTTP headers. The HTML is left unchanged.
HTTPRefreshProcessor
The Refresh
HTTP header is a non-standard header which is
widely used. It requests that the user-agent follow a URL after a specified
time delay. ClientCookie can treat these headers (which may have been set in
<META HTTP-EQUIV>
tags) as if they were 302 redirections.
Exactly when and how Refresh
headers are handled is configurable
using the constructor arguments.
SeekableProcessor
This makes ClientCookie's response objects seek()
able.
Seeking is done lazily (ie. the response object only reads from the socket as
necessary, rather than slurping in all the data before the response is returned
to you). XXX only works for HTTP ATM, I think
HTTPRefererProcessor
The Referer
HTTP header lets the server know which URL
you've just visited. Some servers use this header as state information, and
don't like it if this is not present. It's a chore to add this header by hand
every time you make a request. This adds it automatically.
NOTE: this only makes sense if you use each processor for a
single chain of HTTP requests (so, for example, if you use a single
HTTPRefererProcessor to fetch a series of URLs extracted from a single page,
this will break).
import ClientCookie cookies = ClientCookie.CookieJar() opener = ClientCookie.build_opener(ClientCookie.HTTPRefererProcessor, ClientCookie.HTTPEquivProcessor, ClientCookie.HTTPRefreshProcessor, ClientCookie.SeekableProcessor) opener.open("http://www.rhubarb.com/")
Adding headers is done like so:
import ClientCookie, urllib2 req = urllib2.Request("http://foobar.com/") req.add_header("Referer", "http://wwwsearch.sourceforge.net/ClientCookie/") r = ClientCookie.urlopen(req)
You can also use the headers argument to the urllib2.Request
constructor.
urllib2
(in fact, ClientCookie takes over this task from
urllib2
) adds some headers to Request
objects
automatically - see the next section for details.
OpenerDirector
automatically adds a User-Agent
header to every Request
.
To change this, use your own OpenerDirector
:
import ClientCookie cookies = ClientCookie.CookieJar() opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cookies)) opener.addheaders = [("User-agent", "Mozilla/5.0")]
Again, to use urlopen
, install your
OpenerDirector
globally:
ClientCookie.install_opener(opener)
r = ClientCookie.urlopen("http://acme.com/")
Also, a few standard headers (Content-Length
,
Content-Type
and Host
) are added when the
Request
is passed to urlopen
(or
OpenerDirector.open
). ClientCookie explictly adds these (and
User-Agent
) to the Request
object, unlike
urllib2
. You shouldn't need to change these headers, but since
this is done by HTTPStandardHeadersProcessor
, you can change the
way it works by replacing that processor or adding a new processor that sets
these headers.
ClientCookie knows that redirected transactions are unverifiable, so it'll handle that on its own.
If you want to initiate an unverifiable transaction yourself (which you
should if, for example, you're downloading the images from a page, and 'the
user' hasn't explicitly OKed those URLs), you need to set a true
request.unverifiable
on your Request
instance, and
also set request.origin_req_host
to the request-host of the origin
transaction (eg. the URL of the page containing the images). If
unverifiable
is present and true, but origin_req_host
is not present, you'll get an AttributeError
. XXX None of this is
very nice...
First, a few common problems. The most frequent mistake people seem to make
is to use ClientCookie.urlopen
, and the
extract_cookies
and add_cookie_header
methods on a
cookie object themselves. If you use ClientCookie.urlopen
(or
OpenerDirector.open
), the module handles extraction and adding of
cookies by itself, so you should not call extract_cookies
or
add_cookie_header
.
If things don't seem to be working as expected, the first thing to try is to switch off RFC 2965 handling. This is because few browsers implement it, so it is likely that some servers incorrectly implement it.
Are you sure the server is sending you any cookies in the first place?
Maybe the server is keeping track of state in some other way
(HIDDEN
HTML form entries (possibly in a separate page referenced
by a frame), URL-encoded session keys, IP address, HTTP Referer
headers)? Perhaps some embedded script in the HTML is setting cookies (see
below)? Maybe you messed up your request, and the server is sending you some
standard failure page (even if the page doesn't appear to indicate any
failure). Sometimes, a server wants particular headers set to the values it
expects, or it won't play nicely. The most frequent offenders here are the
Referer
[sic] and / or User-Agent
HTTP
headers (see above for how to set these). The
User-Agent
header may need to be set to a value like that of a
popular browser. The Referer
header may need to be set to the URL
that the server expects you to have followed a link from. Occasionally, it may
even be that operators deliberately configure a server to insist on precisely
the headers that the popular browsers (MS Internet Explorer, Netscape/Mozilla,
Opera) generate, but remember that incompetence (possibly on your part) is more
probable than deliberate sabotage.
When you save
to or load
/revert
from
a file, single-session cookies will expire unless you explicitly request
otherwise with the ignore_discard
argument. This may be your
problem if you find cookies are going away after saving and loading.
If none of the advice above solves your problem quickly, try comparing the
headers and data that you are sending out with those that a browser emits.
Often this will give you the clue you need. Of course, you'll want to check
that the browser is able to do manually what you're trying to achieve
programatically before minutely examining the headers. Make sure that what you
do manually is exactly the same as what you're trying to do from
Python - you may simply be hitting a server bug that only gets revealed if you
view pages in a particular order, for example. In order to see what your
browser is sending to the server (even if HTTPS is in use), see the General FAQ page. If nothing is obviously wrong
with the requests your program is sending and you're out of ideas, you can try
the last resort of good old brute force binary-search debugging. Temporarily
switch to sending HTTP headers (with httplib
). Start by copying
Netscape/Mozilla or IE slavishly (apart from session IDs, etc., of course),
then begin the tedious process of mutating your headers and data until they
match what your higher-level code was sending. This will at least reliably
find your problem.
You can globally turn on display of HTTP headers:
import ClientCookie
ClientCookie.HTTP_DEBUG = True
(Note that doing this won't work:
from ClientCookie import HTTP_DEBUG HTTP_DEBUG = True
If you don't understand that, you've misunderstood what the =
operator does.)
Alternatively, you can examine your individual request and response objects
to see what's going on. ClientCookie's responses can be made
seek()
able using SeekableProcessor
. It's often
useful to use the seek
method like this during debugging:
... response = ClientCookie.urlopen("http://spam.eggs.org/") print response.read() response.seek(0) # rest of code continues as if you'd never .read() the response ...
If you would like to see what is going on in ClientCookie's tiny mind, do this:
ClientCookie.CLIENTCOOKIE_DEBUG = True
This can actually be quite useful, as it explains why particular cookies are
accepted or rejected and why they are or are not returned. It also defeats the
couple of catch-all except:
statements in the code, which would
otherwise be very confusing.
Also, note the ClientCookie.REDIRECT_DEBUG
switch (which prints
information about redirections) and HTTPResponseDebugProcessor
(which prints out all response bodies, including those that are read during
redirections).
It is possible to embed script in HTML pages (sandwiched between
<SCRIPT>here</SCRIPT>
tags, and in
javascript:
URLs) - JavaScript / ECMAScript, VBScript, or even
Python - that causes cookies to be set in a browser. If you come across this
in a page you want to automate, you have three options. Here they are, roughly
in order of simplicity. First, you can simply figure out what the embedded
script is doing and imitate it by manually adding cookies to your
CookieJar
instance. Second, if you're working on a Windows
machine (or another platform where the MSHTML COM library is available) you
could give up the fight and automate Microsoft Internet Explorer (MSIE) with
COM. XXX Mozilla automation & XPCOM / PyXPCOM, Konqueror & KParts / PyKDE?
Third, you could get ambitious and delegate the work to an appropriate
interpreter (Mozilla's JavaScript interpreter, for instance). I'm working on
that approach at the moment.
A function named str2time
is provided by the package,
which may be useful for parsing dates in HTTP headers.
str2time
is intended to be liberal, since HTTP date/time
formats are poorly standardised in practice. There is no need to use this
function in normal operations: CookieJar
instances keep track
of cookie lifetimes automatically. This function will stay around in some
form, though the supported date/time formats may change.
The various cookie standards and their history form a case study of the terrible things that can happen to a protocol. The long-suffering David Kristol has written a paper about it, if you want to know the gory details.
Here is a summary.
The Netscape protocol (cookie_spec.html) is still the only standard supported by most browsers (including Internet Explorer and Netscape). Be aware that cookie_spec.html is not, and never was, actually followed to the letter (or anything close) by anyone (including Netscape, IE and ClientCookie): the Netscape protocol standard is really defined by the behaviour of Netscape (and now IE). Netscape cookies are also known as V0 cookies, to distinguish them from RFC 2109 or RFC 2965 cookies, which have a version cookie-attribute with a value of 1.
RFC 2109 was introduced
to fix some problems identified with the Netscape protocol, while still keeping
the same HTTP headers (Cookie
and Set-Cookie
). The
most prominent of these problems is the 'third-party' cookie issue, which was
an accidental feature of the Netscape protocol. When one visits www.bland.org,
one doesn't expect to get a cookie from www.lurid.com, a site one has never
visited. Depending on browser configuration, this can still happen, because
the unreconstructed Netscape protocol is happy to accept cookies from, say, an
image in a webpage (www.bland.org) that's included by linking to an
advertiser's server (www.lurid.com). This kind of event, where your browser
talks to a server that you haven't explicitly okayed by some means, is what the
RFCs call an 'unverifiable transaction'. In addition to the potential for
embarrassment caused by the presence of lurid.com's cookies on one's machine,
this may also be used to track your movements on the web, because advertising
agencies like doubleclick.net place ads on many sites. RFC 2109 tried to
change this by requiring cookies to be turned off during unverifiable
transactions with third-party servers - unless the user explicitly asks them to
be turned on. This clashed with the business model of advertisers like
doubleclick.net, who had started to take advantage of the third-party cookies
'bug'. Since the browser vendors were more interested in the advertisers'
concerns than those of the browser users, this arguably doomed both RFC 2109
and its successor, RFC 2965, from the start. Other problems than the
third-party cookie issue were also fixed by 2109. However, even ignoring the
advertising issue, 2109 was stillborn, because Internet Explorer and Netscape
behaved differently in response to its extended Set-Cookie
headers. This was not really RFC 2109's fault: it worked the way it did to
keep compatibility with the Netscape protocol as implemented by Netscape.
Microsoft Internet Explorer (MSIE) was very new when the standard was designed,
but was starting to be very popular when the standard was finalised. XXX P3P,
and MSIE & Mozilla options
XXX Apparently MSIE implements bits of RFC 2109 - but not very compliant
(surprise). Presumably other browsers do too, as a result. ClientCookie
already does allow Netscape cookies to have max-age
and
port
cookie-attributes, and as far as I know that's the extent of
the support present in MSIE. I haven't tested, though!
RFC 2965 attempted to fix
the compatibility problem by introducing two new headers,
Set-Cookie2
and Cookie2
. Unlike the
Cookie
header, Cookie2
does not carry
cookies to the server - rather, it simply advertises to the server that RFC
2965 is understood. Set-Cookie2
does carry cookies, from
server to client: the new header means that both IE and Netscape completely
ignore these cookies. This prevents breakage, but introduces a chicken-egg
problem that means 2965 may never be widely adopted, especially since Microsoft
shows no interest in it. XXX Rumour has it that the European Union is unhappy
with P3P, and might introduce legislation that requires something better,
forming a gap that RFC 2965 might fill - any truth in this? Opera is the only
browser I know of that supports the standard. On the server side, Apache's
mod_usertrack
supports it. One confusing point to note about RFC
2965 is that it uses the same value (1) of the Version attribute in HTTP
headers as does RFC 2109.
Recently, it was discovered that RFC 2965 does not fully take account of issues arising when 2965 and Netscape cookies coexist. At the time of writing (August 2003), the resulting errata are still being thrashed out (actually, the list traffic seems to have died for the moment).
Because Netscape cookies are so poorly specified, the general philosophy of the module's Netscape cookie implementation is to start with RFC 2965 and open holes where required for Netscape protocol-compatibility. RFC 2965 cookies are always treated as RFC 2965 requires, of course!
from ClientCookie import CookieJar print CookieJar.extract_cookies.__doc__ print CookieJar.add_cookie_header.__doc__
I believe so, but it's not been tested yet.
On Windows, yes. MSIECookieJar
does allow loading of
cookies from MSIE. Saving may be added in future.
Yes. Use MozillaCookieJar
. Note that any cookies you save
while the browser is running will get clobbered by Mozilla / Netscape.
You have to stop the browser before saving. Also, I recommend backing up
your cookies.txt
file if you want to save cookies to your
browser's file!
The module docstrings are worth reading if you want to do something unusual.
Just call read
or readline
methods on your
response object as many times as you need. The seek
method
(which will only be there if you're using SeekableProcessor
)
still works, because SeekableProcessor
's response objects
cache read data.
urllib2
used "handlers", but not these
"processors".
See this Python library RFE.
HIDDEN
HTML form controls, for example). Maybe you
messed up your request, and the server is sending you some standard
failure page (even if the page doesn't appear to indicate any failure).
urlopen
and the
extract_cookies
/ add_cookie_header methods
?
Don't do that. Pick one or the other. You probably just want to use
urlopen
(or OpenerDirector.open
), unless you
don't have urllib2
, in which case you are forced to use
the 'manual' mode (extract_cookies
/
add_cookie_header
).
ignore_discard
argument (the same goes for the load
and revert
methods); ignore_expires
does the same for expiry:
import ClientCookie cookies = ClientCookie.CookieJar() opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cookies)) ClientCookie.install_opener(opener) r = ClientCookie.urlopen("http://foobar.com/") cookies.save("/some/file", ignore_discard=True, ignore_expires=True)
import ClientCookie cookies = ClientCookie.CookieJar() cookies.policy.rfc2965 = False opener = ClientCookie.build_opener(ClientCookie.HTTPCookieProcessor(cookies)) ClientCookie.install_opener(opener) r = ClientCookie.urlopen("http://foobar.com/")
Referer
[sic] and
/ or User-Agent
HTTP headers to be set to appropriate values before it'll send you
any cookies.
John J. Lee, November 2003.