List Info

Thread: Nutch site crawling




Nutch site crawling
user name
2006-12-07 10:47:20
Hi,

 

Is it possible to let Nutch crawl a set of documents at a
time?

 

I have set-up Nutch with the following option:

 

topN 20

 

depth 2

 

Therefore I wanted Nutch to crawl my directory and just as
deep as 2 links
from the root directory. Now the root directory itself
contains more than 20
files but my understanding of the topN is to make the
crawler fetch 20
documents and then index. At the next crawl, the it chooses
another 20 files
from the directory and fetches and indexex them.

 

My problem is that when Nutch crawls, it keeps on fetching
the same files
over and over again. That is a severe issue in my case
because I have to run
Nutch on some directory with more than 100 GB of data. It is
more efficient
to crawl a small set of files at a time to index than try to
fetch all the
data before indexing. Can you let me a workaround this? Or
just let me know
what I am doing wrong. 

 

Thanks in advance.

 

Regards,

 

Armel

[1]

about | contact  Other archives ( Real Estate discussion Medical topics )