I've setup the nutch-hadoop-dfs environment for a single
system. This
used only one machine, which is namenode as well as
datanode. And I ran the
"bin/nutch crawl urls -depth 2 -dir crawl_test"
command, and took statistics
on the crawldb folder using "bin/nutch readdb
crawl_test/crawldb -stats"
command, it showed,
TOTAL urls: 375
retry 0: 375
min score: 0.0
avg score: 0.0070
max score: 1.019
status 1 (db_unfetched): 334
status 2 (db_fetched): 38
status 5 (db_redir_perm): 3
CrawlDb statistics: done
And then I've setup the nutch-hadoop-dfs environment with 5
systems
including the namenode. ,And after the same crawl is
performed , the
statistics are taken and are as follows.
TOTAL urls: 141
retry 0: 140
retry 1: 1
min score: 0.0
avg score: 0.015
max score: 1.003
status 1 (db_unfetched): 131
status 2 (db_fetched): 8
status 4 (db_redir_temp): 1
status 5 (db_redir_perm): 1
CrawlDb statistics: done
Someone please explain why there is a difference in the
number of urls
fetched, when the number of datanodes are increased from 1
to 5.
thanks in advance.
|