List Info

Thread: Re: Fetcher2




Re: Fetcher2
user name
2007-01-22 09:02:42
Fetcher2 should be a great help for me,but seems can't
integrate with Nutch81.
Any advice on how to use it based on .81? 
----- Original Message ----- 
From: "Andrzej Bialecki" <abgetopt.org>
To: <nutch-devlucene.apache.org>
Sent: Thursday, January 18, 2007 5:18 AM
Subject: Fetcher2


> Hi all,
> 
> I just committed a new implementation of venerable
fetcher, called 
> Fetcher2. It uses a producer/consumers model with a set
of per-host 
> queues. Theoretically it should be able to achieve a
much higher 
> throughput, especially for fetchlists with a lot of
contention (many 
> urls from the same hosts).
> 
> It should be possible to achieve the same fetching rate
with a smaller 
> number of threads, and most importantly to avoid the
dreaded "Exceeded 
> http.max.delays: retry later" error.
> 
> It is available through "bin/nutch fetch2".
> 
> From the javadoc:
> 
> "A queue-based fetcher.
> 
> This fetcher uses a well-known model of one producer (a
QueueFeeder) and 
> many consumers (FetcherThread-s).
> 
> QueueFeeder reads input fetchlists and populates a set
of 
> FetchItemQueue-s, which hold FetchItem-s that describe
the items to be 
> fetched. There are as many queues as there are unique
hosts, but at any 
> given time the total number of fetch items in all
queues is less than a 
> fixed number (currently set to a multiple of the number
of threads).
> 
> As items are consumed from the queues, the QueueFeeder
continues to add 
> new input items, so that their total count stays fixed
(FetcherThread-s 
> may also add new items to the queues e.g. as a results
of redirection) - 
> until all input items are exhausted, at which point the
number of items 
> in the queues begins to decrease. When this number
reaches 0 fetcher 
> will finish.
> 
> This fetcher implementation handles per-host blocking
itself, instead of 
> delegating this work to protocol-specific plugins. Each
per-host queue 
> handles its own "politeness" settings, such
as the maximum number of 
> concurrent requests and crawl delay between consecutive
requests - and 
> also a list of requests in progress, and the time the
last request was 
> finished. As FetcherThread-s ask for new items to be
fetched, queues may 
> return eligible items or null if for
"politeness" reasons this host's 
> queue is not yet ready.
> 
> If there are still unfetched items on the queues, but
none of the items 
> are ready, FetcherThread-s will spin-wait until either
some items become 
> available, or a timeout is reached (at which point the
Fetcher will 
> abort, assuming the task is hung)."
> 
> 
> -- 
> Best regards,
> Andrzej Bialecki     <><
> ___. ___ ___ ___ _ _  
__________________________________
> [__ || __|__/|__||/|  Information Retrieval, Semantic
Web
> ___|||__||  |  ||  |  Embedded Unix, System
Integration
> http://www.sigram.com 
Contact: info at sigram dot com
> 
> 
>
Re: Fetcher2
user name
2007-01-22 10:09:05
chee wu wrote:
> Fetcher2 should be a great help for me,but seems can't
integrate with Nutch81.
> Any advice on how to use it based on .81? 
>   

You would have to port it to Nutch 0.8.1 - e.g. change all
Text 
occurences to UTF8, and most likely make other changes too
...

-- 
Best regards,
Andrzej Bialecki     <><
 ___. ___ ___ ___ _ _   __________________________________
[__ || __|__/|__||/|  Information Retrieval, Semantic Web
___|||__||  |  ||  |  Embedded Unix, System Integration
http://www.sigram.com 
Contact: info at sigram dot com



[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )