|
List Info
Thread: Re: Nutch recrawl script for 0.9 doesn't work with trunk. Help
|
|
| Re: Nutch recrawl script for 0.9 doesn't
work with trunk. Help |
  United States |
2007-09-19 14:03:25 |
hmm yea, in my case I merge the indexes into a temp dir,
then I delete
the existing index dir and move mine in there:
rm -rf crawl/MERGEDindexes
$NUTCH_HOME/bin/nutch merge crawl/MERGEDindexes
crawl/NEWindexes
# in nutch-site, hadoop.tmp.dir points to crawl/tmp
rm -rf crawl/tmp/*
# we have to stop tomcat because sometimes it is still
accessing the
index file
sudo /etc/init.d/tomcat5.5 stop
# replace indexes with indexes_merged
rm -rf crawl/OLDindexes
mv --verbose crawl/index crawl/OLDindexes
mv --verbose crawl/MERGEDindexes crawl/index
echo "----- Restarting Tomcat (Step 10 of $steps)
-----"
sudo /etc/init.d/tomcat5.5 start
I also found I have to stop the tomcat service otherwise I
can't delete
the index files. You may not need to do this if you aren't
using
Tomcat.
-Jeff
>>> alexisvotta gmail.com 9/19/2007 10:34
AM >>>
The recrawl script for 0.9 I found in
http://w
iki.apache.org/nutch/IntranetRecrawl is not working. It
works
first time successfully. Second time, it fails with this
error.
merging indexes to: crawl/index
IndexMerger:
org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory crawl/index already exists!
at
org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:
74)
at
org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:14
8)
at
org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at
org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:1
11)
I am trying this with the latest version available in trunk.
Please
help me to rectify this.
|
|
| Re: Nutch recrawl script for 0.9 doesn't
work with trunk. Help |

|
2007-09-19 14:20:25 |
Hi Jeff...
Your block of code comes from Nutch 0.9 crawl script which
is a
different article. http://wiki.apache
.org/nutch/Crawl I am facing the
problem with Nutch 0.9 recrawl script which I found in this
article =>
http://w
iki.apache.org/nutch/IntranetRecrawl
Even if I follow your approach, I am losing index of
previous crawl.
You are merging the new indexes only in this line:
NUTCH_HOME/bin/nutch merge crawl/MERGEDindexes
crawl/NEWindexes
I want to merge the new indexes with the old index which
nutch 0.9
recrawl wants to do but it fails with this error.
merging indexes to: crawl/index
IndexMerger:
org.apache.hadoop.mapred.FileAlreadyExistsException:
Output directory crawl/index already exists!
at
org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:
74)
at
org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:14
8)
at
org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
at
org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:1
11)
Has anyone used the re-crawl script successfully with trunk?
Does it
work for nutch 0.9?
On 9/20/07, Jeff Van Boxtel <jboxtel grpmack.com> wrote:
> hmm yea, in my case I merge the indexes into a temp
dir, then I delete
> the existing index dir and move mine in there:
>
> rm -rf crawl/MERGEDindexes
> $NUTCH_HOME/bin/nutch merge crawl/MERGEDindexes
crawl/NEWindexes
>
> # in nutch-site, hadoop.tmp.dir points to crawl/tmp
> rm -rf crawl/tmp/*
>
> # we have to stop tomcat because sometimes it is still
accessing the
> index file
> sudo /etc/init.d/tomcat5.5 stop
>
> # replace indexes with indexes_merged
> rm -rf crawl/OLDindexes
> mv --verbose crawl/index crawl/OLDindexes
> mv --verbose crawl/MERGEDindexes crawl/index
>
> echo "----- Restarting Tomcat (Step 10 of $steps)
-----"
> sudo /etc/init.d/tomcat5.5 start
>
> I also found I have to stop the tomcat service
otherwise I can't delete
> the index files. You may not need to do this if you
aren't using
> Tomcat.
>
> -Jeff
>
> >>> alexisvotta gmail.com 9/19/2007 10:34
AM >>>
>
> The recrawl script for 0.9 I found in
> http://w
iki.apache.org/nutch/IntranetRecrawl is not working. It
works
> first time successfully. Second time, it fails with
this error.
>
> merging indexes to: crawl/index
> IndexMerger:
org.apache.hadoop.mapred.FileAlreadyExistsException:
> Output directory crawl/index already exists!
> at
>
org.apache.nutch.indexer.IndexMerger.merge(IndexMerger.java:
74)
> at
>
org.apache.nutch.indexer.IndexMerger.run(IndexMerger.java:14
8)
> at
org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
> at
>
org.apache.nutch.indexer.IndexMerger.main(IndexMerger.java:1
11)
>
> I am trying this with the latest version available in
trunk. Please
> help me to rectify this.
>
>
|
|
| Re: Nutch recrawl script for 0.9 doesn't
work with trunk. Help |

|
2007-09-20 01:54:28 |
I've been having problems with the merge portion of the
script too.
My solution was to check the success status of the merge (
$? ), and
if it failed, try again, or wait until next time.
nutch_bin/nutch mergesegs $merged_segment -dir $segments
if [ $? -ne 0 ]
then
echo "merging segments failed, lets abort now
in case of large failure"
echo "this was the main sticking point for some
reason"
echo "removing the merged segment in case of
fire"
rm -r $merged_segment
exit
else
rm -r $segments/*
mv $merged_segment/* $segments/
rm -r $merged_segment
fi
|
|
| Re: Nutch recrawl script for 0.9 doesn't
work with trunk. Help |

|
2007-09-20 05:40:25 |
Hi,
I had the same problem using re-crawl scripts from wiki.
They all work
fine with nutch versions up to 0.9 (0.9 included), but when
using
nutch-1.0-dev (from trunk) they brak at merge of indexes.
Reason is that
merge in nutch-0.9 (from re-crawl scripts):
bin/nutch merge crawl/indexes crawl/NEWindexes
did the merging of old indexes from crawl/indexes and the
new indexes
from crawl/NEWindexes and stored it in crawl/indexes. But
with
nutch-1.0-dev (from trunk) merge requires empty (new) output
folder.
Solution that works (I have tried it) is to do following:
bin/nutch merge crawl/index crawl/indexes crawl/NEWindexes
where crawl/index is new (output) folder, crawl/indexes is
old indexes
and crawl/NEWindexes is the new indexes. It is important to
know that
you can do this with as many indexes you want to merge (as
many
re-crawls), you only have to do:
bin/nutch merge crawl/index crawl/indexes1 crawl/indexes2
...
but crawl/index must not exist (delete it or backup it).
Nutch search web application will use merged index form
crawl/index,
this is from my web application log:
2007-09-09 20:30:58,949 INFO searcher.NutchBean - creating
new bean
2007-09-09 20:30:59,128 INFO searcher.NutchBean - opening
merged index
in /home/nutch/test/trunk/crawl/index
Hope this will help,
Tomislav
On Thu, 2007-09-20 at 14:54 +0800, Lyndon Maydwell wrote:
> /nutch mergesegs $merged_segment -dir $segments
> if [ $? -ne 0 ]
|
|
| Re: Nutch recrawl script for 0.9 doesn't
work with trunk. Help |

|
2007-09-20 06:33:26 |
Hi Tomislav and Nutch users
I could not solve the problem with your instructions.
I crawled two times. In re-crawl. It generated
crawl/NEWindexes.
crawl/indexes was generated in 1st crawl.
I merged ==> bin/nutch merge crawl/index crawl/indexes/
crawl/NEWindexes/
Now search.jsp is showing error.
type Exception report
message
description The server encountered an internal error () that
prevented
it from fulfilling this request.
exception
org.apache.jasper.JasperException:
java.lang.RuntimeException:
java.lang.NullPointerException
org.apache.jasper.servlet.JspServletWrapper.handleJspExcept
ion(JspServletWrapper.java:532)
org.apache.jasper.servlet.JspServletWrapper.service(JspServ
letWrapper.java:426)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServ
let.java:320)
org.apache.jasper.servlet.JspServlet.service(JspServlet.jav
a:266)
javax.servlet.http.HttpServlet.service(HttpServlet.java:803
)
root cause
java.lang.RuntimeException: java.lang.NullPointerException
org.apache.nutch.searcher.FetchedSegments.getSummary(Fetche
dSegments.java:204)
org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.ja
va:342)
org.apache.jsp.search_jsp._jspService(search_jsp.java:247)
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.j
ava:70)
javax.servlet.http.HttpServlet.service(HttpServlet.java:803
)
org.apache.jasper.servlet.JspServletWrapper.service(JspServ
letWrapper.java:384)
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServ
let.java:320)
org.apache.jasper.servlet.JspServlet.service(JspServlet.jav
a:266)
javax.servlet.http.HttpServlet.service(HttpServlet.java:803
)
root cause
java.lang.NullPointerException
org.apache.nutch.searcher.FetchedSegments.getSummary(Fetche
dSegments.java:159)
org.apache.nutch.searcher.FetchedSegments$SummaryThread.run
(FetchedSegments.java:177)
Is there any Crawl guru who can help?
On 9/20/07, Tomislav Poljak <tpoljak gmail.com> wrote:
> Hi,
> I had the same problem using re-crawl scripts from
wiki. They all work
> fine with nutch versions up to 0.9 (0.9 included), but
when using
> nutch-1.0-dev (from trunk) they brak at merge of
indexes. Reason is that
> merge in nutch-0.9 (from re-crawl scripts):
>
> bin/nutch merge crawl/indexes crawl/NEWindexes
>
> did the merging of old indexes from crawl/indexes and
the new indexes
> from crawl/NEWindexes and stored it in crawl/indexes.
But with
> nutch-1.0-dev (from trunk) merge requires empty (new)
output folder.
>
> Solution that works (I have tried it) is to do
following:
>
> bin/nutch merge crawl/index crawl/indexes
crawl/NEWindexes
>
> where crawl/index is new (output) folder, crawl/indexes
is old indexes
> and crawl/NEWindexes is the new indexes. It is
important to know that
> you can do this with as many indexes you want to merge
(as many
> re-crawls), you only have to do:
>
> bin/nutch merge crawl/index crawl/indexes1
crawl/indexes2 ...
>
> but crawl/index must not exist (delete it or backup
it).
>
> Nutch search web application will use merged index form
crawl/index,
> this is from my web application log:
>
> 2007-09-09 20:30:58,949 INFO searcher.NutchBean -
creating new bean
> 2007-09-09 20:30:59,128 INFO searcher.NutchBean -
opening merged index
> in /home/nutch/test/trunk/crawl/index
>
>
> Hope this will help,
>
> Tomislav
>
>
>
> On Thu, 2007-09-20 at 14:54 +0800, Lyndon Maydwell
wrote:
> > /nutch mergesegs $merged_segment -dir $segments
> > if [ $? -ne 0 ]
>
>
|
|
| Re: Nutch recrawl script for 0.9 doesn't
work with trunk. Help |

|
2007-09-20 08:27:13 |
Hi Alexis,
I think that your problem is not so much in index (or
merging indexes)
but in segments, because if you look at the exception you
will see root
cause:
java.lang.RuntimeException: java.lang.NullPointerException
org.apache.nutch.searcher.FetchedSegments.getSummary(Fetched
Segments.java:204)
I guess you have segments from old crawl in one place (dir)
and segments
from re-crawl in other. All segments should be in same place
(I think
so) because web application says (from starting web app
log):
2007-09-09 20:30:59,461 INFO searcher.NutchBean - opening
segments
in /home/nutch/test/trunk/crawl/segments
Tomislav
On Thu, 2007-09-20 at 17:03 +0530, Alexis Votta wrote:
> s showing error.
> type Exception report
|
|
| Re: Nutch recrawl script for 0.9 doesn't
work with trunk. Help |

|
2007-09-20 08:29:58 |
I have merged the old as well as new segments into segments
dir. Still
the same error comes.
On 9/20/07, Tomislav Poljak <tpoljak gmail.com> wrote:
> Hi Alexis,
> I think that your problem is not so much in index (or
merging indexes)
> but in segments, because if you look at the exception
you will see root
> cause:
>
> java.lang.RuntimeException:
java.lang.NullPointerException
>
org.apache.nutch.searcher.FetchedSegments.getSummary(Fetched
Segments.java:204)
>
> I guess you have segments from old crawl in one place
(dir) and segments
> from re-crawl in other. All segments should be in same
place (I think
> so) because web application says (from starting web app
log):
>
> 2007-09-09 20:30:59,461 INFO searcher.NutchBean -
opening segments
> in /home/nutch/test/trunk/crawl/segments
>
> Tomislav
>
>
>
> On Thu, 2007-09-20 at 17:03 +0530, Alexis Votta wrote:
> > s showing error.
> > type Exception report
>
>
|
|
| Re: Nutch recrawl script for 0.9 doesn't
work with trunk. Help |

|
2007-09-20 08:53:59 |
We can do two things to solve this problem.
SOLUTION 'A'
1. Once the 'depth' loop is complete, merge the segments in
'crawl/segments/'. ('crawl/segments/' will have one merged
segment of
the past plus all the segments generated in the depth loop,
one for
each iteration of the loop.) They are now merged as a single
segment
in MERGEDsegments with the following command.
$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments
crawl/segments/*
2. Now replace 'crawl/segments' with
'crawl/MERGEDsegments'.
rm -rf crawl/segments
mv $MVARGS crawl/MERGEDsegments crawl/segments
3. $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb $segment
4. $NUTCH_HOME/bin/nutch index crawl/NEWindexes
crawl/crawldb
crawl/linkdb crawl/segments/*
5. $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes
6. $NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes
7. Delete crawl/NEWindexes. We are done!
I think this is very similar to Jeff's solution. Alexis
argued that:-
> I am losing index of previous crawl.
The thing to notice here is that, we can safely delete
crawl/NEWindexes or (OLDindexes in Jeff's case) because, in
step. 3
the indexes are generated from a merged segment into which
the old
segments have also been merged. So, we are not losing
anything.
SOLUTION 'B'
1. Generate the new segments in another directory, say,
NEWsegments.
$NUTCH_HOME/bin/nutch generate crawl/crawldb
crawl/NEWsegments $topN
-adddays $adddays
2. After the depth loop is over, merge the new segments,
into
crawl/segments. 'crawl/segments' may have multiple merged
segments
(one for each past crawl) if this is not the first crawl.
bin/nutch mergesegs crawl/segments crawl/NEWsegments/*
So, now 'crawl/segments' contains multiple merged segments
(one for
each crawl in the past) and another merged segment from the
current
re-crawl. Now we don't need 'crawl/NEWsegments'. So, we can
delete
'crawl/NEWsegments'.
3. Store the latest merged segment in a variable.
segment=`ls -d crawl/segments/* | tail -1`
(From now onwards, we won't do the remaining operations for
all the
segments like we did in solution 'A'. We will do the
remaining
operations for the merged segment we have just created.)
4. $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb $segment
5. $NUTCH_HOME/bin/nutch index crawl/NEWindexes
crawl/crawldb
crawl/linkdb $segment
6. $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes
(So with steps 3-6, we generated indexes for the new merged
segment
generated with this crawl only.)
7. Let's assume the past indexes were saved as
'crawl/indexes1',
'crawl/indexes2', etc. Now, all of them can be merged as.
$NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes/
crawl/indexes1/ crawl/indexes2
I think this is what Tomislav must have done whereas what
Alexis must
have done is a mix of solution 'A' and solution 'B'.
For example, if you generate the 'crawl/NEWindexes' from the
all the
merged segments (new as well as old) and merge this
NEWindexes (which
is not strictly new) with old indexes, you will probably get
that
error.
To summarize, in solution 'A' we are generating
'crawl/NEWindexes'
from all the merged segments (new as well as old). So it is
not
strictly new. While merging we are merging only this because
it has
everything.
In solution 'B' we are generating 'crawl/NEWindexes' from
the most
recent merged segment. So this is strictly new. So, while
merging, we
are merging NEWindexes with the old indexes into
'crawl/index'.
Regards,
Susam Pal
http://susam.in/
On 9/20/07, Alexis Votta <alexisvotta gmail.com> wrote:
> Hi Tomislav and Nutch users
>
> I could not solve the problem with your instructions.
>
> I crawled two times. In re-crawl. It generated
crawl/NEWindexes.
> crawl/indexes was generated in 1st crawl.
>
> I merged ==> bin/nutch merge crawl/index
crawl/indexes/ crawl/NEWindexes/
>
> Now search.jsp is showing error.
> type Exception report
>
> message
>
> description The server encountered an internal error ()
that prevented
> it from fulfilling this request.
>
> exception
>
> org.apache.jasper.JasperException:
java.lang.RuntimeException:
> java.lang.NullPointerException
>
org.apache.jasper.servlet.JspServletWrapper.handleJspExcepti
on(JspServletWrapper.java:532)
>
org.apache.jasper.servlet.JspServletWrapper.service(JspServl
etWrapper.java:426)
>
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServl
et.java:320)
>
org.apache.jasper.servlet.JspServlet.service(JspServlet.java
:266)
>
javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>
> root cause
>
> java.lang.RuntimeException:
java.lang.NullPointerException
>
org.apache.nutch.searcher.FetchedSegments.getSummary(Fetched
Segments.java:204)
>
org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.jav
a:342)
>
org.apache.jsp.search_jsp._jspService(search_jsp.java:247)
>
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.ja
va:70)
>
javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>
org.apache.jasper.servlet.JspServletWrapper.service(JspServl
etWrapper.java:384)
>
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServl
et.java:320)
>
org.apache.jasper.servlet.JspServlet.service(JspServlet.java
:266)
>
javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>
> root cause
>
> java.lang.NullPointerException
>
org.apache.nutch.searcher.FetchedSegments.getSummary(Fetched
Segments.java:159)
>
org.apache.nutch.searcher.FetchedSegments$SummaryThread.run(
FetchedSegments.java:177)
>
> Is there any Crawl guru who can help?
>
> On 9/20/07, Tomislav Poljak <tpoljak gmail.com> wrote:
> > Hi,
> > I had the same problem using re-crawl scripts from
wiki. They all work
> > fine with nutch versions up to 0.9 (0.9 included),
but when using
> > nutch-1.0-dev (from trunk) they brak at merge of
indexes. Reason is that
> > merge in nutch-0.9 (from re-crawl scripts):
> >
> > bin/nutch merge crawl/indexes crawl/NEWindexes
> >
> > did the merging of old indexes from crawl/indexes
and the new indexes
> > from crawl/NEWindexes and stored it in
crawl/indexes. But with
> > nutch-1.0-dev (from trunk) merge requires empty
(new) output folder.
> >
> > Solution that works (I have tried it) is to do
following:
> >
> > bin/nutch merge crawl/index crawl/indexes
crawl/NEWindexes
> >
> > where crawl/index is new (output) folder,
crawl/indexes is old indexes
> > and crawl/NEWindexes is the new indexes. It is
important to know that
> > you can do this with as many indexes you want to
merge (as many
> > re-crawls), you only have to do:
> >
> > bin/nutch merge crawl/index crawl/indexes1
crawl/indexes2 ...
> >
> > but crawl/index must not exist (delete it or
backup it).
> >
> > Nutch search web application will use merged index
form crawl/index,
> > this is from my web application log:
> >
> > 2007-09-09 20:30:58,949 INFO searcher.NutchBean -
creating new bean
> > 2007-09-09 20:30:59,128 INFO searcher.NutchBean -
opening merged index
> > in /home/nutch/test/trunk/crawl/index
> >
> >
> > Hope this will help,
> >
> > Tomislav
> >
> >
> >
> > On Thu, 2007-09-20 at 14:54 +0800, Lyndon Maydwell
wrote:
> > > /nutch mergesegs $merged_segment -dir
$segments
> > > if [ $? -ne 0 ]
> >
> >
>
|
|
| RE: Nutch recrawl script for 0.9 doesn't
work with trunk. Help |
  United States |
2007-10-18 10:04:59 |
Based on the org.apache.nutch.Crawl class, I've created a
ReCrawl class
that can be run similarly to how a nutch intranet crawl is
run (or how
the recrawl scripts work). I've written the ReCrawl class
to function
in the manner of "SOLUTION 'A'" below. However,
mine doesn't reload
the web application for you since that wasn't something I
needed to
include for my uses. The usage is something like:
bin/nutch recrawl -dir existingCrawlDir -depth i -add
addDays -topN
topN
Is there a reason this functionality wasn't previously built
into
nutch? Once I test this a bit more would the developers
like a patch
with my additions?
Jeff
-----Original Message-----
From: Susam Pal [mailto:susam.pal gmail.com]
Sent: Thursday, September 20, 2007 9:54 AM
To: nutch-user lucene.apache.org
Subject: Re: Nutch recrawl script for 0.9 doesn't work with
trunk. Help
We can do two things to solve this problem.
SOLUTION 'A'
1. Once the 'depth' loop is complete, merge the segments in
'crawl/segments/'. ('crawl/segments/' will have one merged
segment of
the past plus all the segments generated in the depth loop,
one for
each iteration of the loop.) They are now merged as a single
segment
in MERGEDsegments with the following command.
$NUTCH_HOME/bin/nutch mergesegs crawl/MERGEDsegments
crawl/segments/*
2. Now replace 'crawl/segments' with
'crawl/MERGEDsegments'.
rm -rf crawl/segments
mv $MVARGS crawl/MERGEDsegments crawl/segments
3. $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb $segment
4. $NUTCH_HOME/bin/nutch index crawl/NEWindexes
crawl/crawldb
crawl/linkdb crawl/segments/*
5. $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes
6. $NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes
7. Delete crawl/NEWindexes. We are done!
I think this is very similar to Jeff's solution. Alexis
argued that:-
> I am losing index of previous crawl.
The thing to notice here is that, we can safely delete
crawl/NEWindexes or (OLDindexes in Jeff's case) because, in
step. 3
the indexes are generated from a merged segment into which
the old
segments have also been merged. So, we are not losing
anything.
SOLUTION 'B'
1. Generate the new segments in another directory, say,
NEWsegments.
$NUTCH_HOME/bin/nutch generate crawl/crawldb
crawl/NEWsegments $topN
-adddays $adddays
2. After the depth loop is over, merge the new segments,
into
crawl/segments. 'crawl/segments' may have multiple merged
segments
(one for each past crawl) if this is not the first crawl.
bin/nutch mergesegs crawl/segments crawl/NEWsegments/*
So, now 'crawl/segments' contains multiple merged segments
(one for
each crawl in the past) and another merged segment from the
current
re-crawl. Now we don't need 'crawl/NEWsegments'. So, we can
delete
'crawl/NEWsegments'.
3. Store the latest merged segment in a variable.
segment=`ls -d crawl/segments/* | tail -1`
(From now onwards, we won't do the remaining operations for
all the
segments like we did in solution 'A'. We will do the
remaining
operations for the merged segment we have just created.)
4. $NUTCH_HOME/bin/nutch invertlinks crawl/linkdb $segment
5. $NUTCH_HOME/bin/nutch index crawl/NEWindexes
crawl/crawldb
crawl/linkdb $segment
6. $NUTCH_HOME/bin/nutch dedup crawl/NEWindexes
(So with steps 3-6, we generated indexes for the new merged
segment
generated with this crawl only.)
7. Let's assume the past indexes were saved as
'crawl/indexes1',
'crawl/indexes2', etc. Now, all of them can be merged as.
$NUTCH_HOME/bin/nutch merge crawl/index crawl/NEWindexes/
crawl/indexes1/ crawl/indexes2
I think this is what Tomislav must have done whereas what
Alexis must
have done is a mix of solution 'A' and solution 'B'.
For example, if you generate the 'crawl/NEWindexes' from the
all the
merged segments (new as well as old) and merge this
NEWindexes (which
is not strictly new) with old indexes, you will probably get
that
error.
To summarize, in solution 'A' we are generating
'crawl/NEWindexes'
from all the merged segments (new as well as old). So it is
not
strictly new. While merging we are merging only this because
it has
everything.
In solution 'B' we are generating 'crawl/NEWindexes' from
the most
recent merged segment. So this is strictly new. So, while
merging, we
are merging NEWindexes with the old indexes into
'crawl/index'.
Regards,
Susam Pal
http://susam.in/
On 9/20/07, Alexis Votta <alexisvotta gmail.com> wrote:
> Hi Tomislav and Nutch users
>
> I could not solve the problem with your instructions.
>
> I crawled two times. In re-crawl. It generated
crawl/NEWindexes.
> crawl/indexes was generated in 1st crawl.
>
> I merged ==> bin/nutch merge crawl/index
crawl/indexes/
crawl/NEWindexes/
>
> Now search.jsp is showing error.
> type Exception report
>
> message
>
> description The server encountered an internal error ()
that
prevented
> it from fulfilling this request.
>
> exception
>
> org.apache.jasper.JasperException:
java.lang.RuntimeException:
> java.lang.NullPointerException
>
org.apache.jasper.servlet.JspServletWrapper.handleJspExcepti
on(JspServl
etWrapper.java:532)
>
org.apache.jasper.servlet.JspServletWrapper.service(JspServl
etWrapper.j
ava:426)
>
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServl
et.java:320
)
>
org.apache.jasper.servlet.JspServlet.service(JspServlet.java
:266)
>
javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>
> root cause
>
> java.lang.RuntimeException:
java.lang.NullPointerException
>
org.apache.nutch.searcher.FetchedSegments.getSummary(Fetched
Segments.ja
va:204)
>
org.apache.nutch.searcher.NutchBean.getSummary(NutchBean.jav
a:342)
>
org.apache.jsp.search_jsp._jspService(search_jsp.java:247)
>
org.apache.jasper.runtime.HttpJspBase.service(HttpJspBase.ja
va:70)
>
javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>
org.apache.jasper.servlet.JspServletWrapper.service(JspServl
etWrapper.j
ava:384)
>
org.apache.jasper.servlet.JspServlet.serviceJspFile(JspServl
et.java:320
)
>
org.apache.jasper.servlet.JspServlet.service(JspServlet.java
:266)
>
javax.servlet.http.HttpServlet.service(HttpServlet.java:803)
>
> root cause
>
> java.lang.NullPointerException
>
org.apache.nutch.searcher.FetchedSegments.getSummary(Fetched
Segments.ja
va:159)
>
org.apache.nutch.searcher.FetchedSegments$SummaryThread.run(
FetchedSegm
ents.java:177)
>
> Is there any Crawl guru who can help?
>
> On 9/20/07, Tomislav Poljak <tpoljak gmail.com> wrote:
> > Hi,
> > I had the same problem using re-crawl scripts from
wiki. They all
work
> > fine with nutch versions up to 0.9 (0.9 included),
but when using
> > nutch-1.0-dev (from trunk) they brak at merge of
indexes. Reason is
that
> > merge in nutch-0.9 (from re-crawl scripts):
> >
> > bin/nutch merge crawl/indexes crawl/NEWindexes
> >
> > did the merging of old indexes from crawl/indexes
and the new
indexes
> > from crawl/NEWindexes and stored it in
crawl/indexes. But with
> > nutch-1.0-dev (from trunk) merge requires empty
(new) output
folder.
> >
> > Solution that works (I have tried it) is to do
following:
> >
> > bin/nutch merge crawl/index crawl/indexes
crawl/NEWindexes
> >
> > where crawl/index is new (output) folder,
crawl/indexes is old
indexes
> > and crawl/NEWindexes is the new indexes. It is
important to know
that
> > you can do this with as many indexes you want to
merge (as many
> > re-crawls), you only have to do:
> >
> > bin/nutch merge crawl/index crawl/indexes1
crawl/indexes2 ...
> >
> > but crawl/index must not exist (delete it or
backup it).
> >
> > Nutch search web application will use merged index
form
crawl/index,
> > this is from my web application log:
> >
> > 2007-09-09 20:30:58,949 INFO searcher.NutchBean -
creating new
bean
> > 2007-09-09 20:30:59,128 INFO searcher.NutchBean -
opening merged
index
> > in /home/nutch/test/trunk/crawl/index
> >
> >
> > Hope this will help,
> >
> > Tomislav
> >
> >
> >
> > On Thu, 2007-09-20 at 14:54 +0800, Lyndon Maydwell
wrote:
> > > /nutch mergesegs $merged_segment -dir
$segments
> > > if [ $? -ne 0 ]
> >
> >
>
|
|
[1-9]
|
|