List Info

Thread: Problem with number of urls fetched in nutch-hadoop-dfs environment




Problem with number of urls fetched in nutch-hadoop-dfs environment
user name
2007-10-23 15:08:53
   I've setup the nutch-hadoop-dfs environment for a single
system. This
used only one machine, which is namenode as well as
datanode. And I ran the
"bin/nutch crawl urls -depth 2 -dir crawl_test"
command, and took statistics
on the crawldb folder using "bin/nutch readdb
crawl_test/crawldb -stats"
command, it showed,

TOTAL urls:     375
retry 0:        375
min score:      0.0
avg score:      0.0070
max score:      1.019
status 1 (db_unfetched):        334
status 2 (db_fetched):  38
status 5 (db_redir_perm):       3
CrawlDb statistics: done


And then I've setup the nutch-hadoop-dfs environment with 5
systems
including the namenode. ,And after the same crawl is
performed , the
statistics are taken and are as follows.

TOTAL urls:     141
retry 0:        140
retry 1:        1
min score:      0.0
avg score:      0.015
max score:      1.003
status 1 (db_unfetched):        131
status 2 (db_fetched):  8
status 4 (db_redir_temp):       1
status 5 (db_redir_perm):       1
CrawlDb statistics: done

Someone please explain why there is a difference in the
number of urls
fetched, when the number of datanodes are increased from 1
to 5.
thanks in advance.
[1]

about | contact  Other archives ( Real Estate discussion Medical topics )