Fetcher2 should be a great help for me,but seems can't
integrate with Nutch81.
Any advice on how to use it based on .81?
----- Original Message -----
From: "Andrzej Bialecki" <ab getopt.org>
To: <nutch-dev lucene.apache.org>
Sent: Thursday, January 18, 2007 5:18 AM
Subject: Fetcher2
> Hi all,
>
> I just committed a new implementation of venerable
fetcher, called
> Fetcher2. It uses a producer/consumers model with a set
of per-host
> queues. Theoretically it should be able to achieve a
much higher
> throughput, especially for fetchlists with a lot of
contention (many
> urls from the same hosts).
>
> It should be possible to achieve the same fetching rate
with a smaller
> number of threads, and most importantly to avoid the
dreaded "Exceeded
> http.max.delays: retry later" error.
>
> It is available through "bin/nutch fetch2".
>
> From the javadoc:
>
> "A queue-based fetcher.
>
> This fetcher uses a well-known model of one producer (a
QueueFeeder) and
> many consumers (FetcherThread-s).
>
> QueueFeeder reads input fetchlists and populates a set
of
> FetchItemQueue-s, which hold FetchItem-s that describe
the items to be
> fetched. There are as many queues as there are unique
hosts, but at any
> given time the total number of fetch items in all
queues is less than a
> fixed number (currently set to a multiple of the number
of threads).
>
> As items are consumed from the queues, the QueueFeeder
continues to add
> new input items, so that their total count stays fixed
(FetcherThread-s
> may also add new items to the queues e.g. as a results
of redirection) -
> until all input items are exhausted, at which point the
number of items
> in the queues begins to decrease. When this number
reaches 0 fetcher
> will finish.
>
> This fetcher implementation handles per-host blocking
itself, instead of
> delegating this work to protocol-specific plugins. Each
per-host queue
> handles its own "politeness" settings, such
as the maximum number of
> concurrent requests and crawl delay between consecutive
requests - and
> also a list of requests in progress, and the time the
last request was
> finished. As FetcherThread-s ask for new items to be
fetched, queues may
> return eligible items or null if for
"politeness" reasons this host's
> queue is not yet ready.
>
> If there are still unfetched items on the queues, but
none of the items
> are ready, FetcherThread-s will spin-wait until either
some items become
> available, or a timeout is reached (at which point the
Fetcher will
> abort, assuming the task is hung)."
>
>
> --
> Best regards,
> Andrzej Bialecki <><
> ___. ___ ___ ___ _ _
__________________________________
> [__ || __|__/|__||/| Information Retrieval, Semantic
Web
> ___|||__|| | || | Embedded Unix, System
Integration
> http://www.sigram.com
Contact: info at sigram dot com
>
>
> |