[ http://issues.apache.org/jira/browse
/NUTCH-339?page=comments#action_12453820 ]
Andrzej Bialecki commented on NUTCH-339:
-----------------------------------------
This looks weird, if anything it rather seems caused by a
bug in Hadoop - are you able to run "readseg
-dump" on this fetchlist?
Another idea: do you have any "lease expired"
messages in your log about that time? It looks like maybe
the underlying input stream has been closed.
> Refactor nutch to allow fetcher improvements
> --------------------------------------------
>
> Key: NUTCH-339
> URL: http:/
/issues.apache.org/jira/browse/NUTCH-339
> Project: Nutch
> Issue Type: Task
> Components: fetcher
> Affects Versions: 0.8
> Environment: n/a
> Reporter: Sami Siren
> Assigned To: Andrzej Bialecki
> Fix For: 0.9.0
>
> Attachments: patch.txt, patch2.txt, patch3.txt,
patch4-fixed.txt, patch4-trunk.txt
>
>
> As I (and Stefan?) see it there are two major areas the
current fetcher could be
> improved (as in speed)
> 1. Politeness code and how it is implemented is the
biggest
> problem of current fetcher(together with robots.txt
handling).
> With a simple code changes like replacing it with a
PriorityQueue
> based solution showed very promising results in
increased IO.
> 2. Changing fetcher to use non blocking io (this
requires great amount
> of work as we need to implement the protocols from
scratch again).
> I would like to start with working towards #1 by first
refactoring
> the current code (plugins actually) in following way:
> 1. Move robots.txt handling away from (lib-http)plugin.
> Even if this is related only to http, leaving it to
lib-http
> does not allow other kinds of scheduling strategies to
be implemented
> (it is hardcoded to fetch robots.txt from the same
thread when requesting
> a page from a site from witch it hasn't tried to load
robots.txt)
> 2. Move code for politeness away from (lib-http)plugin
> It is really usable outside http and also the current
design limits
> changing of the implementation (to queue based)
> Where to move these, well my suggestion is the nutch
core, does anybody
> see problems with this?
> These code refactoring activities are to be done in a
way that none
> of the current functionality is (at least deliberately)
changed leaving
> current functionality as is thus leaving room and
possibility to build
> the next generation fetcher(s) without destroying the
old one at same time.
--
This message is automatically generated by JIRA.
-
If you think it was sent incorrectly contact one of the
administrators: http://issues.apache.org/jira/secure/Administrators.jspa
-
For more information on JIRA, see: http://www.atl
assian.com/software/jira
|