Review the doc changes for the urllib package creation.
This commit is contained in:
parent
aca8fd7a9d
commit
0f7ede4569
@ -1,12 +1,12 @@
|
||||
*****************************************************
|
||||
HOWTO Fetch Internet Resources Using urllib package
|
||||
*****************************************************
|
||||
***********************************************************
|
||||
HOWTO Fetch Internet Resources Using The urllib Package
|
||||
***********************************************************
|
||||
|
||||
:Author: `Michael Foord <http://www.voidspace.org.uk/python/index.shtml>`_
|
||||
|
||||
.. note::
|
||||
|
||||
There is an French translation of an earlier revision of this
|
||||
There is a French translation of an earlier revision of this
|
||||
HOWTO, available at `urllib2 - Le Manuel manquant
|
||||
<http://www.voidspace.org.uk/python/articles/urllib2_francais.shtml>`_.
|
||||
|
||||
@ -18,7 +18,7 @@ Introduction
|
||||
.. sidebar:: Related Articles
|
||||
|
||||
You may also find useful the following article on fetching web resources
|
||||
with Python :
|
||||
with Python:
|
||||
|
||||
* `Basic Authentication <http://www.voidspace.org.uk/python/articles/authentication.shtml>`_
|
||||
|
||||
@ -94,8 +94,8 @@ your browser does when you submit a HTML form that you filled in on the web. Not
|
||||
all POSTs have to come from forms: you can use a POST to transmit arbitrary data
|
||||
to your own application. In the common case of HTML forms, the data needs to be
|
||||
encoded in a standard way, and then passed to the Request object as the ``data``
|
||||
argument. The encoding is done using a function from the ``urllib.parse`` library
|
||||
*not* from ``urllib.request``. ::
|
||||
argument. The encoding is done using a function from the :mod:`urllib.parse`
|
||||
library. ::
|
||||
|
||||
import urllib.parse
|
||||
import urllib.request
|
||||
@ -115,7 +115,7 @@ forms - see `HTML Specification, Form Submission
|
||||
<http://www.w3.org/TR/REC-html40/interact/forms.html#h-17.13>`_ for more
|
||||
details).
|
||||
|
||||
If you do not pass the ``data`` argument, urllib.request uses a **GET** request. One
|
||||
If you do not pass the ``data`` argument, urllib uses a **GET** request. One
|
||||
way in which GET and POST requests differ is that POST requests often have
|
||||
"side-effects": they change the state of the system in some way (for example by
|
||||
placing an order with the website for a hundredweight of tinned spam to be
|
||||
@ -182,13 +182,15 @@ which comes after we have a look at what happens when things go wrong.
|
||||
Handling Exceptions
|
||||
===================
|
||||
|
||||
*urllib.error* raises ``URLError`` when it cannot handle a response (though as usual
|
||||
*urlopen* raises ``URLError`` when it cannot handle a response (though as usual
|
||||
with Python APIs, builtin exceptions such as ValueError, TypeError etc. may also
|
||||
be raised).
|
||||
|
||||
``HTTPError`` is the subclass of ``URLError`` raised in the specific case of
|
||||
HTTP URLs.
|
||||
|
||||
The exception classes are exported from the :mod:`urllib.error` module.
|
||||
|
||||
URLError
|
||||
--------
|
||||
|
||||
@ -214,7 +216,7 @@ Every HTTP response from the server contains a numeric "status code". Sometimes
|
||||
the status code indicates that the server is unable to fulfil the request. The
|
||||
default handlers will handle some of these responses for you (for example, if
|
||||
the response is a "redirection" that requests the client fetch the document from
|
||||
a different URL, urllib.request will handle that for you). For those it can't handle,
|
||||
a different URL, urllib will handle that for you). For those it can't handle,
|
||||
urlopen will raise an ``HTTPError``. Typical errors include '404' (page not
|
||||
found), '403' (request forbidden), and '401' (authentication required).
|
||||
|
||||
@ -380,7 +382,7 @@ info and geturl
|
||||
|
||||
The response returned by urlopen (or the ``HTTPError`` instance) has two useful
|
||||
methods ``info`` and ``geturl`` and is defined in the module
|
||||
``urllib.response``.
|
||||
:mod:`urllib.response`.
|
||||
|
||||
**geturl** - this returns the real URL of the page fetched. This is useful
|
||||
because ``urlopen`` (or the opener object used) may have followed a
|
||||
@ -388,7 +390,7 @@ redirect. The URL of the page fetched may not be the same as the URL requested.
|
||||
|
||||
**info** - this returns a dictionary-like object that describes the page
|
||||
fetched, particularly the headers sent by the server. It is currently an
|
||||
``http.client.HTTPMessage`` instance.
|
||||
:class:`http.client.HTTPMessage` instance.
|
||||
|
||||
Typical headers include 'Content-length', 'Content-type', and so on. See the
|
||||
`Quick Reference to HTTP Headers <http://www.cs.tut.fi/~jkorpela/http.html>`_
|
||||
@ -508,7 +510,7 @@ not correct.
|
||||
Proxies
|
||||
=======
|
||||
|
||||
**urllib.request** will auto-detect your proxy settings and use those. This is through
|
||||
**urllib** will auto-detect your proxy settings and use those. This is through
|
||||
the ``ProxyHandler`` which is part of the normal handler chain. Normally that's
|
||||
a good thing, but there are occasions when it may not be helpful [#]_. One way
|
||||
to do this is to setup our own ``ProxyHandler``, with no proxies defined. This
|
||||
@ -528,8 +530,8 @@ is done using similar steps to setting up a `Basic Authentication`_ handler : ::
|
||||
Sockets and Layers
|
||||
==================
|
||||
|
||||
The Python support for fetching resources from the web is layered.
|
||||
urllib.request uses the http.client library, which in turn uses the socket library.
|
||||
The Python support for fetching resources from the web is layered. urllib uses
|
||||
the :mod:`http.client` library, which in turn uses the socket library.
|
||||
|
||||
As of Python 2.3 you can specify how long a socket should wait for a response
|
||||
before timing out. This can be useful in applications which have to fetch web
|
||||
@ -573,9 +575,9 @@ This document was reviewed and revised by John Lee.
|
||||
`Quick Reference to HTTP Headers`_.
|
||||
.. [#] In my case I have to use a proxy to access the internet at work. If you
|
||||
attempt to fetch *localhost* URLs through this proxy it blocks them. IE
|
||||
is set to use the proxy, which urllib2 picks up on. In order to test
|
||||
scripts with a localhost server, I have to prevent urllib2 from using
|
||||
is set to use the proxy, which urllib picks up on. In order to test
|
||||
scripts with a localhost server, I have to prevent urllib from using
|
||||
the proxy.
|
||||
.. [#] urllib2 opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe
|
||||
.. [#] urllib opener for SSL proxy (CONNECT method): `ASPN Cookbook Recipe
|
||||
<http://aspn.activestate.com/ASPN/Cookbook/Python/Recipe/456195>`_.
|
||||
|
||||
|
@ -98,9 +98,9 @@ Functions provided:
|
||||
And lets you write code like this::
|
||||
|
||||
from contextlib import closing
|
||||
import urllib.request
|
||||
from urllib.request import urlopen
|
||||
|
||||
with closing(urllib.request.urlopen('http://www.python.org')) as page:
|
||||
with closing(urlopen('http://www.python.org')) as page:
|
||||
for line in page:
|
||||
print(line)
|
||||
|
||||
|
@ -13,8 +13,7 @@
|
||||
|
||||
This module defines classes which implement the client side of the HTTP and
|
||||
HTTPS protocols. It is normally not used directly --- the module
|
||||
:mod:`urllib.request`
|
||||
uses it to handle URLs that use HTTP and HTTPS.
|
||||
:mod:`urllib.request` uses it to handle URLs that use HTTP and HTTPS.
|
||||
|
||||
.. note::
|
||||
|
||||
|
@ -1,73 +0,0 @@
|
||||
|
||||
:mod:`robotparser` --- Parser for robots.txt
|
||||
=============================================
|
||||
|
||||
.. module:: robotparser
|
||||
:synopsis: Loads a robots.txt file and answers questions about
|
||||
fetchability of other URLs.
|
||||
.. sectionauthor:: Skip Montanaro <skip@pobox.com>
|
||||
|
||||
|
||||
.. index::
|
||||
single: WWW
|
||||
single: World Wide Web
|
||||
single: URL
|
||||
single: robots.txt
|
||||
|
||||
This module provides a single class, :class:`RobotFileParser`, which answers
|
||||
questions about whether or not a particular user agent can fetch a URL on the
|
||||
Web site that published the :file:`robots.txt` file. For more details on the
|
||||
structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html.
|
||||
|
||||
|
||||
.. class:: RobotFileParser()
|
||||
|
||||
This class provides a set of methods to read, parse and answer questions
|
||||
about a single :file:`robots.txt` file.
|
||||
|
||||
|
||||
.. method:: set_url(url)
|
||||
|
||||
Sets the URL referring to a :file:`robots.txt` file.
|
||||
|
||||
|
||||
.. method:: read()
|
||||
|
||||
Reads the :file:`robots.txt` URL and feeds it to the parser.
|
||||
|
||||
|
||||
.. method:: parse(lines)
|
||||
|
||||
Parses the lines argument.
|
||||
|
||||
|
||||
.. method:: can_fetch(useragent, url)
|
||||
|
||||
Returns ``True`` if the *useragent* is allowed to fetch the *url*
|
||||
according to the rules contained in the parsed :file:`robots.txt`
|
||||
file.
|
||||
|
||||
|
||||
.. method:: mtime()
|
||||
|
||||
Returns the time the ``robots.txt`` file was last fetched. This is
|
||||
useful for long-running web spiders that need to check for new
|
||||
``robots.txt`` files periodically.
|
||||
|
||||
|
||||
.. method:: modified()
|
||||
|
||||
Sets the time the ``robots.txt`` file was last fetched to the current
|
||||
time.
|
||||
|
||||
The following example demonstrates basic use of the RobotFileParser class. ::
|
||||
|
||||
>>> import robotparser
|
||||
>>> rp = robotparser.RobotFileParser()
|
||||
>>> rp.set_url("http://www.musi-cal.com/robots.txt")
|
||||
>>> rp.read()
|
||||
>>> rp.can_fetch("*", "http://www.musi-cal.com/cgi-bin/search?city=San+Francisco")
|
||||
False
|
||||
>>> rp.can_fetch("*", "http://www.musi-cal.com/")
|
||||
True
|
||||
|
@ -2,47 +2,47 @@
|
||||
==================================================================
|
||||
|
||||
.. module:: urllib.error
|
||||
:synopsis: Next generation URL opening library.
|
||||
:synopsis: Exception classes raised by urllib.request.
|
||||
.. moduleauthor:: Jeremy Hylton <jhylton@users.sourceforge.net>
|
||||
.. sectionauthor:: Senthil Kumaran <orsenthil@gmail.com>
|
||||
|
||||
|
||||
The :mod:`urllib.error` module defines exception classes raise by
|
||||
urllib.request. The base exception class is URLError, which inherits from
|
||||
IOError.
|
||||
The :mod:`urllib.error` module defines the exception classes for exceptions
|
||||
raised by :mod:`urllib.request`. The base exception class is :exc:`URLError`,
|
||||
which inherits from :exc:`IOError`.
|
||||
|
||||
The following exceptions are raised by :mod:`urllib.error` as appropriate:
|
||||
|
||||
|
||||
.. exception:: URLError
|
||||
|
||||
The handlers raise this exception (or derived exceptions) when they run into a
|
||||
problem. It is a subclass of :exc:`IOError`.
|
||||
The handlers raise this exception (or derived exceptions) when they run into
|
||||
a problem. It is a subclass of :exc:`IOError`.
|
||||
|
||||
.. attribute:: reason
|
||||
|
||||
The reason for this error. It can be a message string or another exception
|
||||
instance (:exc:`socket.error` for remote URLs, :exc:`OSError` for local
|
||||
URLs).
|
||||
The reason for this error. It can be a message string or another
|
||||
exception instance (:exc:`socket.error` for remote URLs, :exc:`OSError`
|
||||
for local URLs).
|
||||
|
||||
|
||||
.. exception:: HTTPError
|
||||
|
||||
Though being an exception (a subclass of :exc:`URLError`), an :exc:`HTTPError`
|
||||
can also function as a non-exceptional file-like return value (the same thing
|
||||
that :func:`urlopen` returns). This is useful when handling exotic HTTP
|
||||
errors, such as requests for authentication.
|
||||
Though being an exception (a subclass of :exc:`URLError`), an
|
||||
:exc:`HTTPError` can also function as a non-exceptional file-like return
|
||||
value (the same thing that :func:`urlopen` returns). This is useful when
|
||||
handling exotic HTTP errors, such as requests for authentication.
|
||||
|
||||
.. attribute:: code
|
||||
|
||||
An HTTP status code as defined in `RFC 2616 <http://www.faqs.org/rfcs/rfc2616.html>`_.
|
||||
This numeric value corresponds to a value found in the dictionary of
|
||||
codes as found in :attr:`http.server.BaseHTTPRequestHandler.responses`.
|
||||
An HTTP status code as defined in `RFC 2616
|
||||
<http://www.faqs.org/rfcs/rfc2616.html>`_. This numeric value corresponds
|
||||
to a value found in the dictionary of codes as found in
|
||||
:attr:`http.server.BaseHTTPRequestHandler.responses`.
|
||||
|
||||
.. exception:: ContentTooShortError(msg[, content])
|
||||
|
||||
This exception is raised when the :func:`urlretrieve` function detects that the
|
||||
amount of the downloaded data is less than the expected amount (given by the
|
||||
*Content-Length* header). The :attr:`content` attribute stores the downloaded
|
||||
(and supposedly truncated) data.
|
||||
This exception is raised when the :func:`urlretrieve` function detects that
|
||||
the amount of the downloaded data is less than the expected amount (given by
|
||||
the *Content-Length* header). The :attr:`content` attribute stores the
|
||||
downloaded (and supposedly truncated) data.
|
||||
|
||||
|
@ -20,13 +20,12 @@ to an absolute URL given a "base URL."
|
||||
The module has been designed to match the Internet RFC on Relative Uniform
|
||||
Resource Locators (and discovered a bug in an earlier draft!). It supports the
|
||||
following URL schemes: ``file``, ``ftp``, ``gopher``, ``hdl``, ``http``,
|
||||
``https``, ``imap``, ``mailto``, ``mms``, ``news``, ``nntp``, ``prospero``,
|
||||
``rsync``, ``rtsp``, ``rtspu``, ``sftp``, ``shttp``, ``sip``, ``sips``,
|
||||
``snews``, ``svn``, ``svn+ssh``, ``telnet``, ``wais``.
|
||||
``https``, ``imap``, ``mailto``, ``mms``, ``news``, ``nntp``, ``prospero``,
|
||||
``rsync``, ``rtsp``, ``rtspu``, ``sftp``, ``shttp``, ``sip``, ``sips``,
|
||||
``snews``, ``svn``, ``svn+ssh``, ``telnet``, ``wais``.
|
||||
|
||||
The :mod:`urllib.parse` module defines the following functions:
|
||||
|
||||
|
||||
.. function:: urlparse(urlstring[, default_scheme[, allow_fragments]])
|
||||
|
||||
Parse a URL into six components, returning a 6-tuple. This corresponds to the
|
||||
@ -92,11 +91,11 @@ The :mod:`urllib.parse` module defines the following functions:
|
||||
|
||||
.. function:: urlunparse(parts)
|
||||
|
||||
Construct a URL from a tuple as returned by ``urlparse()``. The *parts* argument
|
||||
can be any six-item iterable. This may result in a slightly different, but
|
||||
equivalent URL, if the URL that was parsed originally had unnecessary delimiters
|
||||
(for example, a ? with an empty query; the RFC states that these are
|
||||
equivalent).
|
||||
Construct a URL from a tuple as returned by ``urlparse()``. The *parts*
|
||||
argument can be any six-item iterable. This may result in a slightly
|
||||
different, but equivalent URL, if the URL that was parsed originally had
|
||||
unnecessary delimiters (for example, a ``?`` with an empty query; the RFC
|
||||
states that these are equivalent).
|
||||
|
||||
|
||||
.. function:: urlsplit(urlstring[, default_scheme[, allow_fragments]])
|
||||
@ -140,19 +139,19 @@ The :mod:`urllib.parse` module defines the following functions:
|
||||
|
||||
.. function:: urlunsplit(parts)
|
||||
|
||||
Combine the elements of a tuple as returned by :func:`urlsplit` into a complete
|
||||
URL as a string. The *parts* argument can be any five-item iterable. This may
|
||||
result in a slightly different, but equivalent URL, if the URL that was parsed
|
||||
originally had unnecessary delimiters (for example, a ? with an empty query; the
|
||||
RFC states that these are equivalent).
|
||||
Combine the elements of a tuple as returned by :func:`urlsplit` into a
|
||||
complete URL as a string. The *parts* argument can be any five-item
|
||||
iterable. This may result in a slightly different, but equivalent URL, if the
|
||||
URL that was parsed originally had unnecessary delimiters (for example, a ?
|
||||
with an empty query; the RFC states that these are equivalent).
|
||||
|
||||
|
||||
.. function:: urljoin(base, url[, allow_fragments])
|
||||
|
||||
Construct a full ("absolute") URL by combining a "base URL" (*base*) with
|
||||
another URL (*url*). Informally, this uses components of the base URL, in
|
||||
particular the addressing scheme, the network location and (part of) the path,
|
||||
to provide missing components in the relative URL. For example:
|
||||
particular the addressing scheme, the network location and (part of) the
|
||||
path, to provide missing components in the relative URL. For example:
|
||||
|
||||
>>> from urllib.parse import urljoin
|
||||
>>> urljoin('http://www.cwi.nl/%7Eguido/Python.html', 'FAQ.html')
|
||||
@ -178,10 +177,10 @@ The :mod:`urllib.parse` module defines the following functions:
|
||||
|
||||
.. function:: urldefrag(url)
|
||||
|
||||
If *url* contains a fragment identifier, returns a modified version of *url*
|
||||
with no fragment identifier, and the fragment identifier as a separate string.
|
||||
If there is no fragment identifier in *url*, returns *url* unmodified and an
|
||||
empty string.
|
||||
If *url* contains a fragment identifier, return a modified version of *url*
|
||||
with no fragment identifier, and the fragment identifier as a separate
|
||||
string. If there is no fragment identifier in *url*, return *url* unmodified
|
||||
and an empty string.
|
||||
|
||||
.. function:: quote(string[, safe])
|
||||
|
||||
@ -195,9 +194,10 @@ The :mod:`urllib.parse` module defines the following functions:
|
||||
|
||||
.. function:: quote_plus(string[, safe])
|
||||
|
||||
Like :func:`quote`, but also replaces spaces by plus signs, as required for
|
||||
quoting HTML form values. Plus signs in the original string are escaped unless
|
||||
they are included in *safe*. It also does not have *safe* default to ``'/'``.
|
||||
Like :func:`quote`, but also replace spaces by plus signs, as required for
|
||||
quoting HTML form values. Plus signs in the original string are escaped
|
||||
unless they are included in *safe*. It also does not have *safe* default to
|
||||
``'/'``.
|
||||
|
||||
|
||||
.. function:: unquote(string)
|
||||
@ -209,7 +209,7 @@ The :mod:`urllib.parse` module defines the following functions:
|
||||
|
||||
.. function:: unquote_plus(string)
|
||||
|
||||
Like :func:`unquote`, but also replaces plus signs by spaces, as required for
|
||||
Like :func:`unquote`, but also replace plus signs by spaces, as required for
|
||||
unquoting HTML form values.
|
||||
|
||||
|
||||
@ -254,7 +254,6 @@ The result objects from the :func:`urlparse` and :func:`urlsplit` functions are
|
||||
subclasses of the :class:`tuple` type. These subclasses add the attributes
|
||||
described in those functions, as well as provide an additional method:
|
||||
|
||||
|
||||
.. method:: ParseResult.geturl()
|
||||
|
||||
Return the re-combined version of the original URL as a string. This may differ
|
||||
@ -279,13 +278,12 @@ described in those functions, as well as provide an additional method:
|
||||
|
||||
The following classes provide the implementations of the parse results::
|
||||
|
||||
|
||||
.. class:: BaseResult
|
||||
|
||||
Base class for the concrete result classes. This provides most of the attribute
|
||||
definitions. It does not provide a :meth:`geturl` method. It is derived from
|
||||
:class:`tuple`, but does not override the :meth:`__init__` or :meth:`__new__`
|
||||
methods.
|
||||
Base class for the concrete result classes. This provides most of the
|
||||
attribute definitions. It does not provide a :meth:`geturl` method. It is
|
||||
derived from :class:`tuple`, but does not override the :meth:`__init__` or
|
||||
:meth:`__new__` methods.
|
||||
|
||||
|
||||
.. class:: ParseResult(scheme, netloc, path, params, query, fragment)
|
||||
|
@ -7,9 +7,9 @@
|
||||
.. sectionauthor:: Moshe Zadka <moshez@users.sourceforge.net>
|
||||
|
||||
|
||||
The :mod:`urllib.request` module defines functions and classes which help in opening
|
||||
URLs (mostly HTTP) in a complex world --- basic and digest authentication,
|
||||
redirections, cookies and more.
|
||||
The :mod:`urllib.request` module defines functions and classes which help in
|
||||
opening URLs (mostly HTTP) in a complex world --- basic and digest
|
||||
authentication, redirections, cookies and more.
|
||||
|
||||
The :mod:`urllib.request` module defines the following functions:
|
||||
|
||||
@ -180,7 +180,7 @@ The following classes are provided:
|
||||
the ``User-Agent`` header, which is used by a browser to identify itself --
|
||||
some HTTP servers only allow requests coming from common browsers as opposed
|
||||
to scripts. For example, Mozilla Firefox may identify itself as ``"Mozilla/5.0
|
||||
(X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"``, while :mod:`urllib2`'s
|
||||
(X11; U; Linux i686) Gecko/20071127 Firefox/2.0.0.11"``, while :mod:`urllib`'s
|
||||
default user agent string is ``"Python-urllib/2.6"`` (on Python 2.6).
|
||||
|
||||
The final two arguments are only of interest for correct handling of third-party
|
||||
@ -1005,10 +1005,11 @@ HTTPErrorProcessor Objects
|
||||
|
||||
For non-200 error codes, this simply passes the job on to the
|
||||
:meth:`protocol_error_code` handler methods, via :meth:`OpenerDirector.error`.
|
||||
Eventually, :class:`urllib2.HTTPDefaultErrorHandler` will raise an
|
||||
Eventually, :class:`HTTPDefaultErrorHandler` will raise an
|
||||
:exc:`HTTPError` if no other handler handles the error.
|
||||
|
||||
.. _urllib2-examples:
|
||||
|
||||
.. _urllib-request-examples:
|
||||
|
||||
Examples
|
||||
--------
|
||||
@ -1180,15 +1181,18 @@ The following example uses no proxies at all, overriding environment settings::
|
||||
using the :mod:`ftplib` module, subclassing :class:`FancyURLOpener`, or changing
|
||||
*_urlopener* to meet your needs.
|
||||
|
||||
|
||||
|
||||
:mod:`urllib.response` --- Response classes used by urllib.
|
||||
===========================================================
|
||||
|
||||
.. module:: urllib.response
|
||||
:synopsis: Response classes used by urllib.
|
||||
|
||||
The :mod:`urllib.response` module defines functions and classes which define a
|
||||
minimal file like interface, including read() and readline(). The typical
|
||||
response object is an addinfourl instance, which defines and info() method and
|
||||
that returns headers and a geturl() method that returns the url.
|
||||
minimal file like interface, including ``read()`` and ``readline()``. The
|
||||
typical response object is an addinfourl instance, which defines and ``info()``
|
||||
method and that returns headers and a ``geturl()`` method that returns the url.
|
||||
Functions defined by this module are used internally by the
|
||||
:mod:`urllib.request` module.
|
||||
|
||||
|
@ -1,9 +1,8 @@
|
||||
|
||||
:mod:`urllib.robotparser` --- Parser for robots.txt
|
||||
====================================================
|
||||
|
||||
.. module:: urllib.robotparser
|
||||
:synopsis: Loads a robots.txt file and answers questions about
|
||||
:synopsis: Load a robots.txt file and answer questions about
|
||||
fetchability of other URLs.
|
||||
.. sectionauthor:: Skip Montanaro <skip@pobox.com>
|
||||
|
||||
@ -25,42 +24,37 @@ structure of :file:`robots.txt` files, see http://www.robotstxt.org/orig.html.
|
||||
This class provides a set of methods to read, parse and answer questions
|
||||
about a single :file:`robots.txt` file.
|
||||
|
||||
|
||||
.. method:: set_url(url)
|
||||
|
||||
Sets the URL referring to a :file:`robots.txt` file.
|
||||
|
||||
|
||||
.. method:: read()
|
||||
|
||||
Reads the :file:`robots.txt` URL and feeds it to the parser.
|
||||
|
||||
|
||||
.. method:: parse(lines)
|
||||
|
||||
Parses the lines argument.
|
||||
|
||||
|
||||
.. method:: can_fetch(useragent, url)
|
||||
|
||||
Returns ``True`` if the *useragent* is allowed to fetch the *url*
|
||||
according to the rules contained in the parsed :file:`robots.txt`
|
||||
file.
|
||||
|
||||
|
||||
.. method:: mtime()
|
||||
|
||||
Returns the time the ``robots.txt`` file was last fetched. This is
|
||||
useful for long-running web spiders that need to check for new
|
||||
``robots.txt`` files periodically.
|
||||
|
||||
|
||||
.. method:: modified()
|
||||
|
||||
Sets the time the ``robots.txt`` file was last fetched to the current
|
||||
time.
|
||||
|
||||
The following example demonstrates basic use of the RobotFileParser class. ::
|
||||
|
||||
The following example demonstrates basic use of the RobotFileParser class.
|
||||
|
||||
>>> import urllib.robotparser
|
||||
>>> rp = urllib.robotparser.RobotFileParser()
|
||||
|
@ -150,8 +150,8 @@ There are a number of modules for accessing the internet and processing internet
|
||||
protocols. Two of the simplest are :mod:`urllib.request` for retrieving data
|
||||
from urls and :mod:`smtplib` for sending mail::
|
||||
|
||||
>>> import urllib.request
|
||||
>>> for line in urllib.request.urlopen('http://tycho.usno.navy.mil/cgi-bin/timer.pl'):
|
||||
>>> from urllib.request import urlopen
|
||||
>>> for line in urlopen('http://tycho.usno.navy.mil/cgi-bin/timer.pl'):
|
||||
... if 'EST' in line or 'EDT' in line: # look for Eastern Time
|
||||
... print(line)
|
||||
|
||||
|
Loading…
x
Reference in New Issue
Block a user