Hi,
I
have installed Nutch0.9 and crawled news website. I got hits
also.
After that I recrawled the same site. At that time I didn't
get the hits for new pages.
But I saw update urls in the log
file.
EX: I crawled on 17th. Again I recrawled on 23th. I saw
the 23th urls in the log file like.
"indexer.Indexer -
Indexing [http://......./2007.07
.23.html] with analyzer
org.apache.nutch.analysis.NutchDocumentAnalyzer 3c9217
(null)"
Is
there any error on
"org.apache.nutch.analysis.NutchDocumentAnalyzer 3c9217
(null)"?
Please help me how to recrawl any website.
I
have used following code for recrawl
bin/nutch generate
$1/crawldb $1/segments -adddays 5
segment=`ls -d $1/segments/* |
tail -1 | grep "[a-zA-Z0-9/]*"`
bin/nutch fetch
$segment
bin/nutch updatedb $1/crawldb $segment
bin/nutch
generate $1/crawldb $1/segments -adddays 5
s2=`ls -d
$1/segments/2* | tail -1`
bin/nutch fetch $s2
bin/nutch
updatedb $1/crawldb $s2
bin/nutch generate $1/crawldb
$1/segments -adddays 5
s3=`ls -d $1/segments/2* | tail
-1`
bin/nutch fetch $s3
bin/nutch updatedb $1/crawldb $s3
rm
-r $1/indexes
bin/nutch invertlinks $1/linkdb
$1/segments/*
bin/nutch index $1/indexes $1/crawldb $1/linkdb
$1/segments/*
Thanks in advance.
Regards,
Anuradha.
Why delete messages? Unlimited storage is just a click
away. Go to http://help.yahoo.com/l/in/yahoo/mail/yahooma
il/tools/tools-08.html |