[ http://issues.apache.org/jira/browse/NUTCH-339?page=all
a> ]
Andrzej Bialecki updated NUTCH-339:
------------------------------------
Attachment: patch4-trunk.txt
These patches implement a queue-based Fetcher, where
fetching threads don't spin-wait for blocking entries.
A few comments on the architecture of Fetcher2:
* per-host blocking is disabled in lib-http if plugins are
used with Fetcher2, because in this case the Fetcher2
handles blocking - otherwise the plugin works as before (the
end effect is that you can still use the plain Fetcher with
these patches).
* RobotRules can be obtained now from any protocol, and a
default dummy implementation is provided for those protocols
that normally don't define them.
* fetchlist records are read by a separate thread
(QueueFeeder), and stuffed into a set of queues, based on a
combination of protocol + host name (or host address,
depending on a setting); i.e. for a URL "http://www.
cnn.com/SPORT/index.html" the queueID will be
either "http://www.cnn.com"
or "http://64.236.24.28"
a> . QueueFeeder maintains a fixed total size of the queues
(N * number of fetcher threads), until it exhausts all input
records.
* each proto/host queue keeps its own information about:
- max number of threads (maxThreads) for this proto/host
combination
- crawlDelay (when maxThreads == 1) and minCrawlDelay
(when maxThreads > 1)
- a set of items currently being processed (inProgress)
- time when the last fetch request was finished (endTime)
Items are picked from the queue in a FIFO fashion, if
inProgress.size() < maxThreads and if endTime +
crawlDelay < now. Picked items are recorded in inProgress
set.
* there is one global set of queues in the fetcher, with
some utility methods to keep track of the total number of
queued items, and to get the first eligible item from any
queue.
* FetcherThread-s try to pick up new work items from the
queues, or spin-wait if none are available yet.
* when both the input and the queues are exhausted fetcher
will finish its map operation.
In my limited experiments I didn't notice the previous
effects of thread starvation, because threads don't block if
they can't process current item. However, there are still
issues with very slow sites (most probably we need to
terminate such threads), and in case of slow sites and many
pages from the same host fetch items still tend to
accumulate - so at the end of the fetch the speed may be
still slightly lower.
The advantage of this new architecture is that it's much
much easier to understand how blocking occurs, and also that
reading from input is decoupled from further processing,
which should make it easier to move later on to NIO-based
processing (non-blocking).
Some open issues:
* it was quite difficult to consistently measure the
fetching speed. Due to changing network conditions results
vary even for the same fetchlist, and even with the same
implementation - and differences can be significant (like 15
pages/s for one run vs. 3 pages/s for another run with
exactly same parameters).
* I decided for now not to use NIO. The reason is that
protocol plugins don't support it, so if we switched to
select-based modus operandi we would have to rewrite all
protocol plugins.
Please give it a try - comments, suggestions and patches are
welcome!
> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
> Key: NUTCH-339
> URL: http:/
/issues.apache.org/jira/browse/NUTCH-339
> Project: Nutch
> Issue Type: Task
> Components: fetcher
> Affects Versions: 0.8
> Environment: n/a
> Reporter: Sami Siren
> Assigned To: Sami Siren
> Fix For: 0.9.0
>
> Attachments: patch.txt, patch2.txt, patch3.txt,
patch4-trunk.txt
>
>
> As I (and Stefan?) see it there are two major areas the
current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the
biggest
> problem of current fetcher(together with robots.txt
handling).
> With a simple code changes like replacing it with a
PriorityQueue
> based solution showed very promising results in
increased IO.
> 2. Changing fetcher to use non blocking io (this
requires great amount
> of work as we need to implement the protocols from
scratch again).
> I would like to start with working towards #1 by first
refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to
lib-http
> does not allow other kinds of scheduling strategies to
be implemented
> (it is hardcoded to fetch robots.txt from the same
thread when requesting
> a page from a site from witch it hasn't tried to load
robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current
design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch
core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a
way that none
> of the current functionality is (at least deliberately)
changed leaving
> current functionality as is thus leaving room and
possibility to build
> the next generation fetcher(s) without destroying the
old one at same time.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atl
assian.com/software/jira
|