Review the doc changes for the urllib package creation.
This commit is contained in:
parent
aca8fd7a9d
commit
0f7ede4569
@ -1,12 +1,12 @@
|
|||||||
*****************************************************
|
***********************************************************
|
||||||
HOWTO Fetch Internet Resources Using urllib package
|
HOWTO Fetch Internet Resources Using The urllib Package
|
||||||
*****************************************************
|
***********************************************************
|
||||||
|
|
||||||
:Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml>`_
|
:Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml>`_
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
There is an French translation of an earlier revision of this
|
There is a French translation of an earlier revision of this
|
||||||
HOWTO, available at `urllib2 - Le Manuel manquant
|
HOWTO, available at `urllib2 - Le Manuel manquant
|
||||||
<http://www.voidspace.org.uk/python/articles/urllib2_francais.shtml>`_.
|
<http://www.voidspace.org.uk/python/articles/urllib2_francais.shtml>`_.
|
||||||
|
|
||||||
@ -94,8 +94,8 @@ your browser does when you submit a HTML form that you filled in on the web. Not
|
|||||||
all POSTs have to come from forms: you can use a POST to transmit arbitrary data
|
all POSTs have to come from forms: you can use a POST to transmit arbitrary data
|
||||||
to your own application. In the common case of HTML forms, the data needs to be
|
to your own application. In the common case of HTML forms, the data needs to be
|
||||||
encoded in a standard way, and then passed to the Request object as the ``data``
|
encoded in a standard way, and then passed to the Request object as the ``data``
|
||||||
argument. The encoding is done using a function from the ``urllib.parse`` library
|
argument. The encoding is done using a function from the :mod:`urllib.parse`
|
||||||
*not* from ``urllib.request``. ::
|
library. ::
|
||||||
|
|
||||||
import urllib.parse
|
import urllib.parse
|
||||||
import urllib.request
|
import urllib.request
|
||||||
@ -115,7 +115,7 @@ forms - see `HTML Specification, Form Submission
|
|||||||
<http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more
|
<http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more
|
||||||
details).
|
details).
|
||||||
|
|
||||||
If you do not pass the ``data`` argument, urllib.request uses a **GET** request. One
|
If you do not pass the ``data`` argument, urllib uses a **GET** request. One
|
||||||
way in which GET and POST requests differ is that POST requests often have
|
way in which GET and POST requests differ is that POST requests often have
|
||||||
"side-effects": they change the state of the system in some way (for example by
|
"side-effects": they change the state of the system in some way (for example by
|
||||||
placing an order with the website for a hundredweight of tinned spam to be
|
placing an order with the website for a hundredweight of tinned spam to be
|
||||||
@ -182,13 +182,15 @@ which comes after we have a look at what happens when things go wrong.
|
|||||||
Handling Exceptions
|
Handling Exceptions
|
||||||
===================
|
===================
|
||||||
|
|
||||||
*urllib.error* raises ``URLError`` when it cannot handle a response (though as usual
|
*urlopen* raises ``URLError`` when it cannot handle a response (though as usual
|
||||||
with Python APIs, builtin exceptions such as ValueError, TypeError etc. may also
|
with Python APIs, builtin exceptions such as ValueError, TypeError etc. may also
|
||||||
be raised).
|
be raised).
|
||||||
|
|
||||||
``HTTPError`` is the subclass of ``URLError`` raised in the specific case of
|
``HTTPError`` is the subclass of ``URLError`` raised in the specific case of
|
||||||
HTTP URLs.
|
HTTP URLs.
|
||||||
|
|
||||||
|
The exception classes are exported from the :mod:`urllib.error` module.
|
||||||
|
|
||||||
URLError
|
URLError
|
||||||
--------
|
--------
|
||||||
|
|
||||||
@ -214,7 +216,7 @@ Every HTTP response from the server contains a numeric "status code". Sometimes
|
|||||||
the status code indicates that the server is unable to fulfil the request. The
|
the status code indicates that the server is unable to fulfil the request. The
|
||||||
default handlers will handle some of these responses for you (for example, if
|
default handlers will handle some of these responses for you (for example, if
|
||||||
the response is a "redirection" that requests the client fetch the document from
|
the response is a "redirection" that requests the client fetch the document from
|
||||||
a different URL, urllib.request will handle that for you). For those it can't handle,
|
a different URL, urllib will handle that for you). For those it can't handle,
|
||||||
urlopen will raise an ``HTTPError``. Typical errors include '404' (page not
|
urlopen will raise an ``HTTPError``. Typical errors include '404' (page not
|
||||||
found), '403' (request forbidden), and '401' (authentication required).
|
found), '403' (request forbidden), and '401' (authentication required).
|
||||||
|
|
||||||
@ -380,7 +382,7 @@ info and geturl
|
|||||||
|
|
||||||
The response returned by urlopen (or the ``HTTPError`` instance) has two useful
|
The response returned by urlopen (or the ``HTTPError`` instance) has two useful
|
||||||
methods ``info`` and ``geturl`` and is defined in the module
|
methods ``info`` and ``geturl`` and is defined in the module
|
||||||
``urllib.response``.
|
:mod:`urllib.response`.
|
||||||
|
|
||||||
**geturl** - this returns the real URL of the page fetched. This is useful
|
**geturl** - this returns the real URL of the page fetched. This is useful
|
||||||
because ``urlopen`` (or the opener object used) may have followed a
|
because ``urlopen`` (or the opener object used) may have followed a
|
||||||
@ -388,7 +390,7 @@ redirect. The URL of the page fetched may not be the same as the URL requested.
|
|||||||
|
|
||||||
**info** - this returns a dictionary-like object that describes the page
|
**info** - this returns a dictionary-like object that describes the page
|
||||||
fetched, particularly the headers sent by the server. It is currently an
|
fetched, particularly the headers sent by the server. It is currently an
|
||||||
``http.client.HTTPMessage`` instance.
|
:class:`http.client.HTTPMessage` instance.
|
||||||
|
|
||||||
Typical headers include 'Content-length', 'Content-type', and so on. See the
|
Typical headers include 'Content-length', 'Content-type', and so on. See the
|
||||||
`Quick Reference to HTTP Headers <http://www.cs.tut.fi/~jkorpela/http.html>`_
|
`Quick Reference to HTTP Headers <http://www.cs.tut.fi/~jkorpela/http.html>`_
|
||||||
@ -508,7 +510,7 @@ not correct.
|
|||||||
Proxies
|
Proxies
|
||||||
=======
|
=======
|
||||||
|
|
||||||
**urllib.request** will auto-detect your proxy settings and use those. This is through
|
**urllib** will auto-detect your proxy settings and use those. This is through
|
||||||
the ``ProxyHandler`` which is part of the normal handler chain. Normally that's
|
the ``ProxyHandler`` which is part of the normal handler chain. Normally that's
|
||||||
a good thing, but there are occasions when it may not be helpful [#]_. One way
|
a good thing, but there are occasions when it may not be helpful [#]_. One way
|
||||||
to do this is to setup our own ``ProxyHandler``, with no proxies defined. This
|
to do this is to setup our own ``ProxyHandler``, with no proxies defined. This
|
||||||
@ -528,8 +530,8 @@ is done using similar steps to setting up a `Basic Authentication`_ handler : ::
|
|||||||
Sockets and Layers
|
Sockets and Layers
|
||||||
==================
|
==================
|
||||||
|
|
||||||
The Python support for fetching resources from the web is layered.
|
The Python support for fetching resources from the web is layered. urllib uses
|
||||||
urllib.request uses the http.client library, which in turn uses the socket library.
|
the :mod:`http.client` library, which in turn uses the socket library.
|
||||||
|
|
||||||
As of Python 2.3 you can specify how long a socket should wait for a response
|
As of Python 2.3 you can specify how long a socket should wait for a response
|
||||||
before timing out. This can be useful in applications which have to fetch web
|
before timing out. This can be useful in applications which have to fetch web
|
||||||
@ -573,9 +575,9 @@ This document was reviewed and revised by John Lee.
|
|||||||
`Quick Reference to HTTP Headers`_.
|
`Quick Reference to HTTP Headers`_.
|
||||||
.. [#] In my case I have to use a proxy to access the internet at work. If you
|
.. [#] In my case I have to use a proxy to access the internet at work. If you
|
||||||
attempt to fetch *localhost* URLs through this proxy it blocks them. IE
|
attempt to fetch *localhost* URLs through this proxy it blocks them. IE
|
||||||
is set to use the proxy, which urllib2 picks up on. In order to test
|
is set to use the proxy, which urllib picks up on. In order to test
|
||||||
scripts with a localhost server, I have to prevent urllib2 from using
|
scripts with a localhost server, I have to prevent urllib from using
|
||||||
the proxy.
|
the proxy.
|
||||||
.. [#] urllib2 opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe
|
.. [#] urllib opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe
|
||||||
<http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/456195>`_.
|
<http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/456195>`_.
|
||||||
|
|
||||||
|
@ -98,9 +98,9 @@ Functions provided:
|
|||||||
And lets you write code like this::
|
And lets you write code like this::
|
||||||
|
|
||||||
from contextlib import closing
|
from contextlib import closing
|
||||||
import urllib.request
|
from urllib.request import urlopen
|
||||||
|
|
||||||
with closing(urllib.request.urlopen('http://www.python.org')) as page:
|
with closing(urlopen('http://www.python.org')) as page:
|
||||||
for line in page:
|
for line in page:
|
||||||
print(line)
|
print(line)
|
||||||
|
|
||||||
|
@ -13,8 +13,7 @@
|
|||||||
|
|
||||||
This module defines classes which implement the client side of the HTTP and
|
This module defines classes which implement the client side of the HTTP and
|
||||||
HTTPS protocols. It is normally not used directly --- the module
|
HTTPS protocols. It is normally not used directly --- the module
|
||||||
:mod:`urllib.request`
|
:mod:`urllib.request` uses it to handle URLs that use HTTP and HTTPS.
|
||||||
uses it to handle URLs that use HTTP and HTTPS.
|
|
||||||
|
|
||||||
.. note::
|
.. note::
|
||||||
|
|
||||||
|
@ -1,73 +0,0 @@
|
|||||||
|
|
||||||
:mod:`robotparser` --- Parser for robots.txt
|
|
||||||
=============================================
|
|
||||||
|
|
||||||
.. module:: robotparser
|
|
||||||
:synopsis: Loads a robots.txt file and answers questions about
|
|
||||||
fetchability of other URLs.
|
|
||||||
.. sectionauthor:: Skip Montanaro <skip@pobox.com>
|
|
||||||
|
|
||||||
|
|
||||||
.. index::
|
|
||||||
single: WWW
|
|
||||||
single: World Wide Web
|
|
||||||
single: URL
|
|
||||||
single: robots.txt
|
|
||||||
|
|
||||||
This module provides a single class, :class:`RobotFileParser`, which answers
|
|
||||||
questions about whether or not a particular user agent can fetch a URL on the
|
|
||||||
Web site that published the :file:`robots.txt` file. For more details on the
|
|
||||||
structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html.
|
|
||||||
|
|
||||||
|
|
||||||
.. class:: RobotFileParser()
|
|
||||||
|
|
||||||
This class provides a set of methods to read, parse and answer questions
|
|
||||||
about a single :file:`robots.txt` file.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: set_url(url)
|
|
||||||
|
|
||||||
Sets the URL referring to a :file:`robots.txt` file.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: read()
|
|
||||||
|
|
||||||
Reads the :file:`robots.txt` URL and feeds it to the parser.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: parse(lines)
|
|
||||||
|
|
||||||
Parses the lines argument.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: can_fetch(useragent, url)
|
|
||||||
|
|
||||||
Returns ``True`` if the *useragent* is allowed to fetch the *url*
|
|
||||||
according to the rules contained in the parsed :file:`robots.txt`
|
|
||||||
file.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: mtime()
|
|
||||||
|
|
||||||
Returns the time the ``robots.txt`` file was last fetched. This is
|
|
||||||
useful for long-running web spiders that need to check for new
|
|
||||||
``robots.txt`` files periodically.
|
|
||||||
|
|
||||||
|
|
||||||
.. method:: modified()
|
|
||||||
|
|
||||||
Sets the time the ``robots.txt`` file was last fetched to the current
|
|
||||||
time.
|
|
||||||
|
|
||||||
The following example demonstrates basic use of the RobotFileParser class. ::
|
|
||||||
|
|
||||||
>>> import robotparser
|
|
||||||
>>> rp = robotparser.RobotFileParser()
|
|
||||||
>>> rp.set_url("http://www.musi-cal.com/robots.txt")
|
|
||||||
>>> rp.read()
|
|
||||||
>>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
|
|
||||||
False
|
|
||||||
>>> rp.can_fetch("*", "http://www.musi-cal.com/")
|
|
||||||
True
|
|
||||||
|
|
@ -2,47 +2,47 @@
|
|||||||
==================================================================
|
==================================================================
|
||||||
|
|
||||||
.. module:: urllib.error
|
.. module:: urllib.error
|
||||||
:synopsis: Next generation URL opening library.
|
:synopsis: Exception classes raised by urllib.request.
|
||||||
.. moduleauthor:: Jeremy Hylton <jhylton@users.sourceforge.net>
|
.. moduleauthor:: Jeremy Hylton <jhylton@users.sourceforge.net>
|
||||||
.. sectionauthor:: Senthil Kumaran <orsenthil@gmail.com>
|
.. sectionauthor:: Senthil Kumaran <orsenthil@gmail.com>
|
||||||
|
|
||||||
|
|
||||||
The :mod:`urllib.error` module defines exception classes raise by
|
The :mod:`urllib.error` module defines the exception classes for exceptions
|
||||||
urllib.request. The base exception class is URLError, which inherits from
|
raised by :mod:`urllib.request`. The base exception class is :exc:`URLError`,
|
||||||
IOError.
|
which inherits from :exc:`IOError`.
|
||||||
|
|
||||||
The following exceptions are raised by :mod:`urllib.error` as appropriate:
|
The following exceptions are raised by :mod:`urllib.error` as appropriate:
|
||||||
|
|
||||||
|
|
||||||
.. exception:: URLError
|
.. exception:: URLError
|
||||||
|
|
||||||
The handlers raise this exception (or derived exceptions) when they run into a
|
The handlers raise this exception (or derived exceptions) when they run into
|
||||||
problem. It is a subclass of :exc:`IOError`.
|
a problem. It is a subclass of :exc:`IOError`.
|
||||||
|
|
||||||
.. attribute:: reason
|
.. attribute:: reason
|
||||||
|
|
||||||
The reason for this error. It can be a message string or another exception
|
The reason for this error. It can be a message string or another
|
||||||
instance (:exc:`socket.error` for remote URLs, :exc:`OSError` for local
|
exception instance (:exc:`socket.error` for remote URLs, :exc:`OSError`
|
||||||
URLs).
|
for local URLs).
|
||||||
|
|
||||||
|
|
||||||
.. exception:: HTTPError
|
.. exception:: HTTPError
|
||||||
|
|
||||||
Though being an exception (a subclass of :exc:`URLError`), an :exc:`HTTPError`
|
Though being an exception (a subclass of :exc:`URLError`), an
|
||||||
can also function as a non-exceptional file-like return value (the same thing
|
:exc:`HTTPError` can also function as a non-exceptional file-like return
|
||||||
that :func:`urlopen` returns). This is useful when handling exotic HTTP
|
value (the same thing that :func:`urlopen` returns). This is useful when
|
||||||
errors, such as requests for authentication.
|
handling exotic HTTP errors, such as requests for authentication.
|
||||||
|
|
||||||
.. attribute:: code
|
.. attribute:: code
|
||||||
|
|
||||||
An HTTP status code as defined in `RFC 2616 <http://www.faqs.org/rfcs/rfc2616.html>`_.
|
An HTTP status code as defined in `RFC 2616
|
||||||
This numeric value corresponds to a value found in the dictionary of
|
<http://www.faqs.org/rfcs/rfc2616.html>`_. This numeric value corresponds
|
||||||
codes as found in :attr:`http.server.BaseHTTPRequestHandler.responses`.
|
to a value found in the dictionary of codes as found in
|
||||||
|
:attr:`http.server.BaseHTTPRequestHandler.responses`.
|
||||||
|
|
||||||
.. exception:: ContentTooShortError(msg[, content])
|
.. exception:: ContentTooShortError(msg[, content])
|
||||||
|
|
||||||
This exception is raised when the :func:`urlretrieve` function detects that the
|
This exception is raised when the :func:`urlretrieve` function detects that
|
||||||
amount of the downloaded data is less than the expected amount (given by the
|
the amount of the downloaded data is less than the expected amount (given by
|
||||||
*Content-Length* header). The :attr:`content` attribute stores the downloaded
|
the *Content-Length* header). The :attr:`content` attribute stores the
|
||||||
(and supposedly truncated) data.
|
downloaded (and supposedly truncated) data.
|
||||||
|
|
||||||
|
@ -26,7 +26,6 @@ following URL schemes: ``file``, ``ftp``, ``gopher``, ``hdl``, ``http``,
|
|||||||
|
|
||||||
The :mod:`urllib.parse` module defines the following functions:
|
The :mod:`urllib.parse` module defines the following functions:
|
||||||
|
|
||||||
|
|
||||||
.. function:: urlparse(urlstring[, default_scheme[, allow_fragments]])
|
.. function:: urlparse(urlstring[, default_scheme[, allow_fragments]])
|
||||||
|
|
||||||
Parse a URL into six components, returning a 6-tuple. This corresponds to the
|
Parse a URL into six components, returning a 6-tuple. This corresponds to the
|
||||||
@ -92,11 +91,11 @@ The :mod:`urllib.parse` module defines the following functions:
|
|||||||
|
|
||||||
.. function:: urlunparse(parts)
|
.. function:: urlunparse(parts)
|
||||||
|
|
||||||
Construct a URL from a tuple as returned by ``urlparse()``. The *parts* argument
|
Construct a URL from a tuple as returned by ``urlparse()``. The *parts*
|
||||||
can be any six-item iterable. This may result in a slightly different, but
|
argument can be any six-item iterable. This may result in a slightly
|
||||||
equivalent URL, if the URL that was parsed originally had unnecessary delimiters
|
different, but equivalent URL, if the URL that was parsed originally had
|
||||||
(for example, a ? with an empty query; the RFC states that these are
|
unnecessary delimiters (for example, a ``?`` with an empty query; the RFC
|
||||||
equivalent).
|
states that these are equivalent).
|
||||||
|
|
||||||
|
|
||||||
.. function:: urlsplit(urlstring[, default_scheme[, allow_fragments]])
|
.. function:: urlsplit(urlstring[, default_scheme[, allow_fragments]])
|
||||||
@ -140,19 +139,19 @@ The :mod:`urllib.parse` module defines the following functions:
|
|||||||
|
|
||||||
.. function:: urlunsplit(parts)
|
.. function:: urlunsplit(parts)
|
||||||
|
|
||||||
Combine the elements of a tuple as returned by :func:`urlsplit` into a complete
|
Combine the elements of a tuple as returned by :func:`urlsplit` into a
|
||||||
URL as a string. The *parts* argument can be any five-item iterable. This may
|
complete URL as a string. The *parts* argument can be any five-item
|
||||||
result in a slightly different, but equivalent URL, if the URL that was parsed
|
iterable. This may result in a slightly different, but equivalent URL, if the
|
||||||
originally had unnecessary delimiters (for example, a ? with an empty query; the
|
URL that was parsed originally had unnecessary delimiters (for example, a ?
|
||||||
RFC states that these are equivalent).
|
with an empty query; the RFC states that these are equivalent).
|
||||||
|
|
||||||
|
|
||||||
.. function:: urljoin(base, url[, allow_fragments])
|
.. function:: urljoin(base, url[, allow_fragments])
|
||||||
|
|
||||||
Construct a full ("absolute") URL by combining a "base URL" (*base*) with
|
Construct a full ("absolute") URL by combining a "base URL" (*base*) with
|
||||||
another URL (*url*). Informally, this uses components of the base URL, in
|
another URL (*url*). Informally, this uses components of the base URL, in
|
||||||
particular the addressing scheme, the network location and (part of) the path,
|
particular the addressing scheme, the network location and (part of) the
|
||||||
to provide missing components in the relative URL. For example:
|
path, to provide missing components in the relative URL. For example:
|
||||||
|
|
||||||
>>> from urllib.parse import urljoin
|
>>> from urllib.parse import urljoin
|
||||||
>>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
|
>>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
|
||||||
@ -178,10 +177,10 @@ The :mod:`urllib.parse` module defines the following functions:
|
|||||||
|
|
||||||
.. function:: urldefrag(url)
|
.. function:: urldefrag(url)
|
||||||
|
|
||||||
If *url* contains a fragment identifier, returns a modified version of *url*
|
If *url* contains a fragment identifier, return a modified version of *url*
|
||||||
with no fragment identifier, and the fragment identifier as a separate string.
|
with no fragment identifier, and the fragment identifier as a separate
|
||||||
If there is no fragment identifier in *url*, returns *url* unmodified and an
|
string. If there is no fragment identifier in *url*, return *url* unmodified
|
||||||
empty string.
|
and an empty string.
|
||||||
|
|
||||||
.. function:: quote(string[, safe])
|
.. function:: quote(string[, safe])
|
||||||
|
|
||||||
@ -195,9 +194,10 @@ The :mod:`urllib.parse` module defines the following functions:
|
|||||||
|
|
||||||
.. function:: quote_plus(string[, safe])
|
.. function:: quote_plus(string[, safe])
|
||||||
|
|
||||||
Like :func:`quote`, but also replaces spaces by plus signs, as required for
|
Like :func:`quote`, but also replace spaces by plus signs, as required for
|
||||||
quoting HTML form values. Plus signs in the original string are escaped unless
|
quoting HTML form values. Plus signs in the original string are escaped
|
||||||
they are included in *safe*. It also does not have *safe* default to ``'/'``.
|
unless they are included in *safe*. It also does not have *safe* default to
|
||||||
|
``'/'``.
|
||||||
|
|
||||||
|
|
||||||
.. function:: unquote(string)
|
.. function:: unquote(string)
|
||||||
@ -209,7 +209,7 @@ The :mod:`urllib.parse` module defines the following functions:
|
|||||||
|
|
||||||
.. function:: unquote_plus(string)
|
.. function:: unquote_plus(string)
|
||||||
|
|
||||||
Like :func:`unquote`, but also replaces plus signs by spaces, as required for
|
Like :func:`unquote`, but also replace plus signs by spaces, as required for
|
||||||
unquoting HTML form values.
|
unquoting HTML form values.
|
||||||
|
|
||||||
|
|
||||||
@ -254,7 +254,6 @@ The result objects from the :func:`urlparse` and :func:`urlsplit` functions are
|
|||||||
subclasses of the :class:`tuple` type. These subclasses add the attributes
|
subclasses of the :class:`tuple` type. These subclasses add the attributes
|
||||||
described in those functions, as well as provide an additional method:
|
described in those functions, as well as provide an additional method:
|
||||||
|
|
||||||
|
|
||||||
.. method:: ParseResult.geturl()
|
.. method:: ParseResult.geturl()
|
||||||
|
|
||||||
Return the re-combined version of the original URL as a string. This may differ
|
Return the re-combined version of the original URL as a string. This may differ
|
||||||
@ -279,13 +278,12 @@ described in those functions, as well as provide an additional method:
|
|||||||
|
|
||||||
The following classes provide the implementations of the parse results::
|
The following classes provide the implementations of the parse results::
|
||||||
|
|
||||||
|
|
||||||
.. class:: BaseResult
|
.. class:: BaseResult
|
||||||
|
|
||||||
Base class for the concrete result classes. This provides most of the attribute
|
Base class for the concrete result classes. This provides most of the
|
||||||
definitions. It does not provide a :meth:`geturl` method. It is derived from
|
attribute definitions. It does not provide a :meth:`geturl` method. It is
|
||||||
:class:`tuple`, but does not override the :meth:`__init__` or :meth:`__new__`
|
derived from :class:`tuple`, but does not override the :meth:`__init__` or
|
||||||
methods.
|
:meth:`__new__` methods.
|
||||||
|
|
||||||
|
|
||||||
.. class:: ParseResult(scheme, netloc, path, params, query, fragment)
|
.. class:: ParseResult(scheme, netloc, path, params, query, fragment)
|
||||||
|
@ -7,9 +7,9 @@
|
|||||||
.. sectionauthor:: Moshe Zadka <moshez@users.sourceforge.net>
|
.. sectionauthor:: Moshe Zadka <moshez@users.sourceforge.net>
|
||||||
|
|
||||||
|
|
||||||
The :mod:`urllib.request` module defines functions and classes which help in opening
|
The :mod:`urllib.request` module defines functions and classes which help in
|
||||||
URLs (mostly HTTP) in a complex world --- basic and digest authentication,
|
opening URLs (mostly HTTP) in a complex world --- basic and digest
|
||||||
redirections, cookies and more.
|
authentication, redirections, cookies and more.
|
||||||
|
|
||||||
The :mod:`urllib.request` module defines the following functions:
|
The :mod:`urllib.request` module defines the following functions:
|
||||||
|
|
||||||
@ -180,7 +180,7 @@ The following classes are provided:
|
|||||||
the ``User-Agent`` header, which is used by a browser to identify itself --
|
the ``User-Agent`` header, which is used by a browser to identify itself --
|
||||||
some HTTP servers only allow requests coming from common browsers as opposed
|
some HTTP servers only allow requests coming from common browsers as opposed
|
||||||
to scripts. For example, Mozilla Firefox may identify itself as ``"Mozilla/5.0
|
to scripts. For example, Mozilla Firefox may identify itself as ``"Mozilla/5.0
|
||||||
(X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"``, while :mod:`urllib2`'s
|
(X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"``, while :mod:`urllib`'s
|
||||||
default user agent string is ``"Python-urllib/2.6"`` (on Python 2.6).
|
default user agent string is ``"Python-urllib/2.6"`` (on Python 2.6).
|
||||||
|
|
||||||
The final two arguments are only of interest for correct handling of third-party
|
The final two arguments are only of interest for correct handling of third-party
|
||||||
@ -1005,10 +1005,11 @@ HTTPErrorProcessor Objects
|
|||||||
|
|
||||||
For non-200 error codes, this simply passes the job on to the
|
For non-200 error codes, this simply passes the job on to the
|
||||||
:meth:`protocol_error_code` handler methods, via :meth:`OpenerDirector.error`.
|
:meth:`protocol_error_code` handler methods, via :meth:`OpenerDirector.error`.
|
||||||
Eventually, :class:`urllib2.HTTPDefaultErrorHandler` will raise an
|
Eventually, :class:`HTTPDefaultErrorHandler` will raise an
|
||||||
:exc:`HTTPError` if no other handler handles the error.
|
:exc:`HTTPError` if no other handler handles the error.
|
||||||
|
|
||||||
.. _urllib2-examples:
|
|
||||||
|
.. _urllib-request-examples:
|
||||||
|
|
||||||
Examples
|
Examples
|
||||||
--------
|
--------
|
||||||
@ -1180,15 +1181,18 @@ The following example uses no proxies at all, overriding environment settings::
|
|||||||
using the :mod:`ftplib` module, subclassing :class:`FancyURLOpener`, or changing
|
using the :mod:`ftplib` module, subclassing :class:`FancyURLOpener`, or changing
|
||||||
*_urlopener* to meet your needs.
|
*_urlopener* to meet your needs.
|
||||||
|
|
||||||
|
|
||||||
|
|
||||||
:mod:`urllib.response` --- Response classes used by urllib.
|
:mod:`urllib.response` --- Response classes used by urllib.
|
||||||
===========================================================
|
===========================================================
|
||||||
|
|
||||||
.. module:: urllib.response
|
.. module:: urllib.response
|
||||||
:synopsis: Response classes used by urllib.
|
:synopsis: Response classes used by urllib.
|
||||||
|
|
||||||
The :mod:`urllib.response` module defines functions and classes which define a
|
The :mod:`urllib.response` module defines functions and classes which define a
|
||||||
minimal file like interface, including read() and readline(). The typical
|
minimal file like interface, including ``read()`` and ``readline()``. The
|
||||||
response object is an addinfourl instance, which defines and info() method and
|
typical response object is an addinfourl instance, which defines and ``info()``
|
||||||
that returns headers and a geturl() method that returns the url.
|
method and that returns headers and a ``geturl()`` method that returns the url.
|
||||||
Functions defined by this module are used internally by the
|
Functions defined by this module are used internally by the
|
||||||
:mod:`urllib.request` module.
|
:mod:`urllib.request` module.
|
||||||
|
|
||||||
|
@ -1,9 +1,8 @@
|
|||||||
|
|
||||||
:mod:`urllib.robotparser` --- Parser for robots.txt
|
:mod:`urllib.robotparser` --- Parser for robots.txt
|
||||||
====================================================
|
====================================================
|
||||||
|
|
||||||
.. module:: urllib.robotparser
|
.. module:: urllib.robotparser
|
||||||
:synopsis: Loads a robots.txt file and answers questions about
|
:synopsis: Load a robots.txt file and answer questions about
|
||||||
fetchability of other URLs.
|
fetchability of other URLs.
|
||||||
.. sectionauthor:: Skip Montanaro <skip@pobox.com>
|
.. sectionauthor:: Skip Montanaro <skip@pobox.com>
|
||||||
|
|
||||||
@ -25,42 +24,37 @@ structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html.
|
|||||||
This class provides a set of methods to read, parse and answer questions
|
This class provides a set of methods to read, parse and answer questions
|
||||||
about a single :file:`robots.txt` file.
|
about a single :file:`robots.txt` file.
|
||||||
|
|
||||||
|
|
||||||
.. method:: set_url(url)
|
.. method:: set_url(url)
|
||||||
|
|
||||||
Sets the URL referring to a :file:`robots.txt` file.
|
Sets the URL referring to a :file:`robots.txt` file.
|
||||||
|
|
||||||
|
|
||||||
.. method:: read()
|
.. method:: read()
|
||||||
|
|
||||||
Reads the :file:`robots.txt` URL and feeds it to the parser.
|
Reads the :file:`robots.txt` URL and feeds it to the parser.
|
||||||
|
|
||||||
|
|
||||||
.. method:: parse(lines)
|
.. method:: parse(lines)
|
||||||
|
|
||||||
Parses the lines argument.
|
Parses the lines argument.
|
||||||
|
|
||||||
|
|
||||||
.. method:: can_fetch(useragent, url)
|
.. method:: can_fetch(useragent, url)
|
||||||
|
|
||||||
Returns ``True`` if the *useragent* is allowed to fetch the *url*
|
Returns ``True`` if the *useragent* is allowed to fetch the *url*
|
||||||
according to the rules contained in the parsed :file:`robots.txt`
|
according to the rules contained in the parsed :file:`robots.txt`
|
||||||
file.
|
file.
|
||||||
|
|
||||||
|
|
||||||
.. method:: mtime()
|
.. method:: mtime()
|
||||||
|
|
||||||
Returns the time the ``robots.txt`` file was last fetched. This is
|
Returns the time the ``robots.txt`` file was last fetched. This is
|
||||||
useful for long-running web spiders that need to check for new
|
useful for long-running web spiders that need to check for new
|
||||||
``robots.txt`` files periodically.
|
``robots.txt`` files periodically.
|
||||||
|
|
||||||
|
|
||||||
.. method:: modified()
|
.. method:: modified()
|
||||||
|
|
||||||
Sets the time the ``robots.txt`` file was last fetched to the current
|
Sets the time the ``robots.txt`` file was last fetched to the current
|
||||||
time.
|
time.
|
||||||
|
|
||||||
The following example demonstrates basic use of the RobotFileParser class. ::
|
|
||||||
|
The following example demonstrates basic use of the RobotFileParser class.
|
||||||
|
|
||||||
>>> import urllib.robotparser
|
>>> import urllib.robotparser
|
||||||
>>> rp = urllib.robotparser.RobotFileParser()
|
>>> rp = urllib.robotparser.RobotFileParser()
|
||||||
|
@ -150,8 +150,8 @@ There are a number of modules for accessing the internet and processing internet
|
|||||||
protocols. Two of the simplest are :mod:`urllib.request` for retrieving data
|
protocols. Two of the simplest are :mod:`urllib.request` for retrieving data
|
||||||
from urls and :mod:`smtplib` for sending mail::
|
from urls and :mod:`smtplib` for sending mail::
|
||||||
|
|
||||||
>>> import urllib.request
|
>>> from urllib.request import urlopen
|
||||||
>>> for line in urllib.request.urlopen('http://tycho.usno.navy.mil/cgi-bin/timer.pl'):
|
>>> for line in urlopen('http://tycho.usno.navy.mil/cgi-bin/timer.pl'):
|
||||||
... if 'EST' in line or 'EDT' in line: # look for Eastern Time
|
... if 'EST' in line or 'EDT' in line: # look for Eastern Time
|
||||||
... print(line)
|
... print(line)
|
||||||
|
|
||||||
|
Loading…
x
Reference in New Issue
Block a user