|
List Info
Thread: Incremental crawl again ... (Please explain)
|
|
| Incremental crawl again ... (Please
explain) |

|
2006-05-22 15:45:57 |
I am currently using the last nightly nutch-0.8-dev build
and
I am really confused about how to proceed after I have done
two
different "whole web" incremental crawl
The tutorial to me is not clear on how to merge the results
after the
two crawls in order to be able to
make a search operation.
Could some one please give me an Hints on what is the right
procedure ?!
here is what I am doing:
1. create an initial urls file /tmp/dmoz/urls.txt
2. hadoop dfs -put /tmp/urls/ url
3. nutch inject test/crawldb dmoz
4. nutch generate test/crawldb test/segments
5. nutch fetch test/segments/20060522144050
6. nutch updatedb test/crawldb
test/segments/20060522144050
7. nutch invertlinks linkdb test/segments/20060522144050
8. nutch index test/indexes test/crawldb linkdb
test/segments/20060522144050
..and now I am able to search...
Now I run
9. nutch generate test/crawldb test/segments -topN 1000
and I will end up to have a new segment :
test/segments/20060522151957
10. nutch fetch test/segments/20060522151957
11. nutch updatedb test/crawldb test/segments/20060522151957
From this point on I cannot make any progresses much
A) I have tried to merge the two segments into a new one
with the idea to rerun an invertlinks and index on it but:
nutch mergesegs test/segments -dir test/segments
whatever I specify as outputdir or outputsegment I get
errors
B) I have also tried to make invertlinks on all
test/segments with the goal to run nutch index command to
produce a second
indexes directory, let say test/indexes1, an finally run
the merge index on index2
nutch invertlinks test/linkdb -dir test/segments
This as created a new linkdb directory *NOT* under test
as specified but as <user>/linkdb-1108390519
nutch index test/indexes1 test/crawldb linkdb
test/segments/20060522144050
nutch merge index2 test/indexes test/indexes1
now I am not sure what to do; If I rename test/index2 to
be test/indexes after having removed test/indexes
I will not able to search anymore.
-Corrado
|
|
| Incremental crawl again ... (Please
explain) |

|
2006-05-22 15:58:53 |
Please do follow the link below..
http://www.mai
l-archive.com/nutch-user lucene.apache.org/msg03990.html
I have been able to follow the threads as explained and
merge multiple crawl.. It works like a champ.
Thanks
Sudhi
zzcgiacomini <zzgiacomini echo.fr> wrote:
I am currently using the last nightly nutch-0.8-dev build
and
I am really confused about how to proceed after I have done
two
different "whole web" incremental crawl
The tutorial to me is not clear on how to merge the results
after the
two crawls in order to be able to
make a search operation.
Could some one please give me an Hints on what is the right
procedure ?!
here is what I am doing:
1. create an initial urls file /tmp/dmoz/urls.txt
2. hadoop dfs -put /tmp/urls/ url
3. nutch inject test/crawldb dmoz
4. nutch generate test/crawldb test/segments
5. nutch fetch test/segments/20060522144050
6. nutch updatedb test/crawldb test/segments/20060522144050
7. nutch invertlinks linkdb test/segments/20060522144050
8. nutch index test/indexes test/crawldb linkdb
test/segments/20060522144050
..and now I am able to search...
Now I run
9. nutch generate test/crawldb test/segments -topN 1000
and I will end up to have a new segment :
test/segments/20060522151957
10. nutch fetch test/segments/20060522151957
11. nutch updatedb test/crawldb test/segments/20060522151957
From this point on I cannot make any progresses much
A) I have tried to merge the two segments into a new one
with the idea to rerun an invertlinks and index on it but:
nutch mergesegs test/segments -dir test/segments
whatever I specify as outputdir or outputsegment I get
errors
B) I have also tried to make invertlinks on all
test/segments with the goal to run nutch index command to
produce a second
indexes directory, let say test/indexes1, an finally run the
merge index on index2
nutch invertlinks test/linkdb -dir test/segments
This as created a new linkdb directory *NOT* under test as
specified but as /linkdb-1108390519
nutch index test/indexes1 test/crawldb linkdb
test/segments/20060522144050
nutch merge index2 test/indexes test/indexes1
now I am not sure what to do; If I rename test/index2 to be
test/indexes after having removed test/indexes
I will not able to search anymore.
-Corrado
Sudhi Seshachala
http://sudhilogs.blogs
pot.com/
__________________________________________________
Do You Yahoo!?
Tired of spam? Yahoo! Mail has the best spam protection
around
http://mail.yahoo.com |
|
| Incremental crawl again ... (Please
explain) |

|
2006-05-25 01:31:45 |
I looked at the referenced messaged at
http://www.mai
l-archive.com/nutch-user lucene.apache.org/msg03990.html
but I am still having problems.
I am running the latest checkout from subversion.
These are the commands which I've run:
bin/nutch crawl myurls/ -dir crawl -threads 4 -depth 3 -topN
10000
bin/nutch generate crawl/crawldb crawl/segments -topN 500
lastsegment=`ls -d crawl/segments/2* | tail -1`
bin/nutch fetch $lastsegment
bin/nutch updatedb crawl/crawldb $lastsegment
bin/nutch index crawl/indexes crawl/crawldb crawl/linkdb
$lastsegment
This last command fails with a java.io.IOException saying:
"Output
directory /home/nutch/nutch/crawl/indexes already
exists"
So I'm confused because it seems like I did exactly what
was described
in the referenced email, but it didn't work for me. Can
someone help
me figure out what I'm doing wrong or what I need to do
instead?
Thanks,
Jacob
On 5/22/06, sudhendra seshachala <sudhi_bs yahoo.com> wrote:
> Please do follow the link below..
> http://www.mai
l-archive.com/nutch-user lucene.apache.org/msg03990.html
>
> I have been able to follow the threads as explained
and merge multiple crawl.. It works like a champ.
>
> Thanks
> Sudhi
>
> zzcgiacomini <zzgiacomini echo.fr> wrote:
> I am currently using the last nightly nutch-0.8-dev
build and
> I am really confused about how to proceed after I have
done two
> different "whole web" incremental crawl
>
> The tutorial to me is not clear on how to merge the
results after the
> two crawls in order to be able to
> make a search operation.
>
> Could some one please give me an Hints on what is the
right procedure ?!
> here is what I am doing:
>
> 1. create an initial urls file /tmp/dmoz/urls.txt
> 2. hadoop dfs -put /tmp/urls/ url
> 3. nutch inject test/crawldb dmoz
> 4. nutch generate test/crawldb test/segments
> 5. nutch fetch test/segments/20060522144050
> 6. nutch updatedb test/crawldb
test/segments/20060522144050
> 7. nutch invertlinks linkdb
test/segments/20060522144050
> 8. nutch index test/indexes test/crawldb linkdb
> test/segments/20060522144050
>
> ..and now I am able to search...
>
> Now I run
>
> 9. nutch generate test/crawldb test/segments -topN 1000
>
> and I will end up to have a new segment :
test/segments/20060522151957
>
> 10. nutch fetch test/segments/20060522151957
> 11. nutch updatedb test/crawldb
test/segments/20060522151957
>
>
> From this point on I cannot make any progresses much
>
> A) I have tried to merge the two segments into a new
one with the idea to rerun an invertlinks and index on it
but:
>
> nutch mergesegs test/segments -dir test/segments
>
> whatever I specify as outputdir or outputsegment I get
errors
>
> B) I have also tried to make invertlinks on all
test/segments with the goal to run nutch index command to
produce a second
> indexes directory, let say test/indexes1, an finally
run the merge index on index2
>
> nutch invertlinks test/linkdb -dir test/segments
>
> This as created a new linkdb directory *NOT* under test
as specified but as /linkdb-1108390519
>
> nutch index test/indexes1 test/crawldb linkdb
test/segments/20060522144050
> nutch merge index2 test/indexes test/indexes1
>
> now I am not sure what to do; If I rename test/index2
to be test/indexes after having removed test/indexes
> I will not able to search anymore.
>
>
> -Corrado
>
>
>
>
>
>
> Sudhi Seshachala
> http://sudhilogs.blogs
pot.com/
>
>
>
> __________________________________________________
> Do You Yahoo!?
> Tired of spam? Yahoo! Mail has the best spam
protection around
> http://mail.yahoo.com
>
--
http://JacobBrunson.com
|
|
| Incremental crawl again ... (Please
explain) |

|
2006-05-25 11:56:16 |
On 5/25/06, Jacob Brunson <jacob.brunson gmail.com> wrote:
> I looked at the referenced messaged at
> http://www.mai
l-archive.com/nutch-user lucene.apache.org/msg03990.html
> but I am still having problems.
>
> I am running the latest checkout from subversion.
>
> These are the commands which I've run:
> bin/nutch crawl myurls/ -dir crawl -threads 4 -depth 3
-topN 10000
bin/nutch crawl - is a one shot command to
fetch/generate/index a
nutch index. I would NOT recommend one to use this one shot
command.
Please take the long route which will give you more control
over your
tasks. The long route meaning - inject, generate, fetch,
updatedb,
index, dedup, merge. Please see the following -
Whole web crawling...
http://lucene.apache.org/nutch/tutorial8.html#Wh
ole-web+Crawling
Cheers
> bin/nutch generate crawl/crawldb crawl/segments -topN
500
> lastsegment=`ls -d crawl/segments/2* | tail -1`
> bin/nutch fetch $lastsegment
> bin/nutch updatedb crawl/crawldb $lastsegment
> bin/nutch index crawl/indexes crawl/crawldb
crawl/linkdb $lastsegment
>
> This last command fails with a java.io.IOException
saying: "Output
> directory /home/nutch/nutch/crawl/indexes already
exists"
>
> So I'm confused because it seems like I did exactly
what was described
> in the referenced email, but it didn't work for me.
Can someone help
> me figure out what I'm doing wrong or what I need to
do instead?
> Thanks,
> Jacob
>
>
> On 5/22/06, sudhendra seshachala <sudhi_bs yahoo.com> wrote:
> > Please do follow the link below..
> > http://www.mai
l-archive.com/nutch-user lucene.apache.org/msg03990.html
> >
> > I have been able to follow the threads as
explained and merge multiple crawl.. It works like a champ.
> >
> > Thanks
> > Sudhi
> >
> > zzcgiacomini <zzgiacomini echo.fr> wrote:
> > I am currently using the last nightly
nutch-0.8-dev build and
> > I am really confused about how to proceed after I
have done two
> > different "whole web" incremental
crawl
> >
> > The tutorial to me is not clear on how to merge
the results after the
> > two crawls in order to be able to
> > make a search operation.
> >
> > Could some one please give me an Hints on what is
the right procedure ?!
> > here is what I am doing:
> >
> > 1. create an initial urls file /tmp/dmoz/urls.txt
> > 2. hadoop dfs -put /tmp/urls/ url
> > 3. nutch inject test/crawldb dmoz
> > 4. nutch generate test/crawldb test/segments
> > 5. nutch fetch test/segments/20060522144050
> > 6. nutch updatedb test/crawldb
test/segments/20060522144050
> > 7. nutch invertlinks linkdb
test/segments/20060522144050
> > 8. nutch index test/indexes test/crawldb linkdb
> > test/segments/20060522144050
> >
> > ..and now I am able to search...
> >
> > Now I run
> >
> > 9. nutch generate test/crawldb test/segments -topN
1000
> >
> > and I will end up to have a new segment :
test/segments/20060522151957
> >
> > 10. nutch fetch test/segments/20060522151957
> > 11. nutch updatedb test/crawldb
test/segments/20060522151957
> >
> >
> > From this point on I cannot make any progresses
much
> >
> > A) I have tried to merge the two segments into a
new one with the idea to rerun an invertlinks and index on
it but:
> >
> > nutch mergesegs test/segments -dir test/segments
> >
> > whatever I specify as outputdir or outputsegment I
get errors
> >
> > B) I have also tried to make invertlinks on all
test/segments with the goal to run nutch index command to
produce a second
> > indexes directory, let say test/indexes1, an
finally run the merge index on index2
> >
> > nutch invertlinks test/linkdb -dir test/segments
> >
> > This as created a new linkdb directory *NOT* under
test as specified but as /linkdb-1108390519
> >
> > nutch index test/indexes1 test/crawldb linkdb
test/segments/20060522144050
> > nutch merge index2 test/indexes test/indexes1
> >
> > now I am not sure what to do; If I rename
test/index2 to be test/indexes after having removed
test/indexes
> > I will not able to search anymore.
> >
> >
> > -Corrado
> >
> >
> >
> >
> >
> >
> > Sudhi Seshachala
> > http://sudhilogs.blogs
pot.com/
> >
> >
> >
> >
__________________________________________________
> > Do You Yahoo!?
> > Tired of spam? Yahoo! Mail has the best spam
protection around
> > http://mail.yahoo.com
> >
>
>
> --
> http://JacobBrunson.com
>
|
|
| Incremental crawl again ... (Please
explain) |

|
2006-05-25 21:16:44 |
Addition comments and testcase below.
On 5/25/06, Zaheed Haque <zaheed.haque gmail.com> wrote:
> On 5/25/06, Jacob Brunson <jacob.brunson gmail.com> wrote:
> > I looked at the referenced messaged at
> > http://www.mai
l-archive.com/nutch-user lucene.apache.org/msg03990.html
> > but I am still having problems.
> >
> > I am running the latest checkout from subversion.
> >
> > These are the commands which I've run:
>
> > bin/nutch crawl myurls/ -dir crawl -threads 4
-depth 3 -topN 10000
>
> bin/nutch crawl - is a one shot command to
fetch/generate/index a
> nutch index. I would NOT recommend one to use this one
shot command.
Thats funny because when I look at the source code for
crawling, it
does pretty much the same thing as the "whole web
crawling" method.
>
> Please take the long route which will give you more
control over your
> tasks. The long route meaning - inject, generate,
fetch, updatedb,
> index, dedup, merge. Please see the following -
> Whole web crawling...
>
> http://lucene.apache.org/nutch/tutorial8.html#Wh
ole-web+Crawling
>
Yes, I've gone though that tutorial also and followed it
and I'm
having the same problem. The tutorial does not describe how
to add to
the original index. If you can help me figure out this, I
would be
glad to add to the tutorial and make it more complete.
Just to be perfectly clear, these are the complete set of
steps I take
to get the error. (I'm running Java1.5, only the urls/
directory
exists at the beginning):
$ svn update
$ ant
$ bin/nutch inject crawl.test/crawldb urls/
$ bin/nutch generate crawl.test/crawldb crawl.test/segments
-topN 20
$ lastsegment=`ls -d crawl.test/segments/2* | tail -1`
$ bin/nutch fetch $lastsegment
$ bin/nutch updatedb crawl.test/crawldb $lastsegment
$ bin/nutch invertlinks crawl.test/linkdb $lastsegment
$ bin/nutch index crawl.test/indexes crawl.test/crawldb
crawl.test/linkdb $lastsegment
$ bin/nutch merge crawl.test/index crawl.test/indexes
$ bin/nutch generate crawl.test/crawldb crawl.test/segments
-topN 20
$ lastsegment=`ls -d crawl.test/segments/2* | tail -1`
$ bin/nutch fetch $lastsegment
$ bin/nutch updatedb crawl.test/crawldb $lastsegment
$ bin/nutch invertlinks crawl.test/linkdb $lastsegment
$ bin/nutch index crawl.test/indexes crawl.test/crawldb
crawl.test/linkdb $lastsegment
And at this point, I have my problem. I get the following
output:
060525 171327 Indexer: adding segment:
crawl.test/segments/20060525165518
Exception in thread "main" java.io.IOException:
Output directory
/home/nutch/nutch/crawl.test/indexes already exists.
at
org.apache.hadoop.mapred.OutputFormatBase.checkOutputSpecs(O
utputFormatBase.java:37)
at
org.apache.hadoop.mapred.JobClient.submitJob(JobClient.java:
263)
at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:311
)
at
org.apache.nutch.indexer.Indexer.index(Indexer.java:287)
at
org.apache.nutch.indexer.Indexer.main(Indexer.java:304)
So if you could help me figure out what I need to do
differently, I
would be sure to update the tutorial on the on the wiki to
help others
who might have the same problems as me.
Thanks,
Jacob
|
|
| Incremental crawl again ... (Please
explain) |

|
2006-05-26 08:39:09 |
I am not at all a Nutch expert, I am just experimenting a
little bit,
but as far as I understood it
you can remove the indexes directory and re-index again the
segments:
In may case ofter step 8 of the (see below) I have only one
segment :
test/segments/20060522144050
after step 9 I will have a second segment
test/segments/20060522144050
Now what we can do is to remove the test/indexes directory
and
re-index the two segments:
this what I did :
hadoop dfs -rm test/indexes
nutch index test/indexes test/crawldb linkdb
test/segments/20060522144050 test/segments/20060522144050
Hope it helps
-Corrqdo
Jacob Brunson wrote:
> I looked at the referenced messaged at
> http://www.mai
l-archive.com/nutch-user lucene.apache.org/msg03990.html
> but I am still having problems.
>
> I am running the latest checkout from subversion.
>
> These are the commands which I've run:
> bin/nutch crawl myurls/ -dir crawl -threads 4 -depth 3
-topN 10000
> bin/nutch generate crawl/crawldb crawl/segments -topN
500
> lastsegment=`ls -d crawl/segments/2* | tail -1`
> bin/nutch fetch $lastsegment
> bin/nutch updatedb crawl/crawldb $lastsegment
> bin/nutch index crawl/indexes crawl/crawldb
crawl/linkdb $lastsegment
>
> This last command fails with a java.io.IOException
saying: "Output
> directory /home/nutch/nutch/crawl/indexes already
exists"
>
> So I'm confused because it seems like I did exactly
what was described
> in the referenced email, but it didn't work for me.
Can someone help
> me figure out what I'm doing wrong or what I need to
do instead?
> Thanks,
> Jacob
>
>
> On 5/22/06, sudhendra seshachala <sudhi_bs yahoo.com> wrote:
>> Please do follow the link below..
>> http://www.mai
l-archive.com/nutch-user lucene.apache.org/msg03990.html
>>
>> I have been able to follow the threads as
explained and merge
>> multiple crawl.. It works like a champ.
>>
>> Thanks
>> Sudhi
>>
>> zzcgiacomini <zzgiacomini echo.fr> wrote:
>> I am currently using the last nightly
nutch-0.8-dev build and
>> I am really confused about how to proceed after I
have done two
>> different "whole web" incremental crawl
>>
>> The tutorial to me is not clear on how to merge the
results after the
>> two crawls in order to be able to
>> make a search operation.
>>
>> Could some one please give me an Hints on what is
the right procedure ?!
>> here is what I am doing:
>>
>> 1. create an initial urls file /tmp/dmoz/urls.txt
>> 2. hadoop dfs -put /tmp/urls/ url
>> 3. nutch inject test/crawldb dmoz
>> 4. nutch generate test/crawldb test/segments
>> 5. nutch fetch test/segments/20060522144050
>> 6. nutch updatedb test/crawldb
test/segments/20060522144050
>> 7. nutch invertlinks linkdb
test/segments/20060522144050
>> 8. nutch index test/indexes test/crawldb linkdb
>> test/segments/20060522144050
>>
>> ..and now I am able to search...
>>
>> Now I run
>>
>> 9. nutch generate test/crawldb test/segments -topN
1000
>>
>> and I will end up to have a new segment :
test/segments/20060522151957
>>
>> 10. nutch fetch test/segments/20060522151957
>> 11. nutch updatedb test/crawldb
test/segments/20060522151957
>>
>>
>> From this point on I cannot make any progresses
much
>>
>> A) I have tried to merge the two segments into a
new one with the
>> idea to rerun an invertlinks and index on it but:
>>
>> nutch mergesegs test/segments -dir test/segments
>>
>> whatever I specify as outputdir or outputsegment I
get errors
>>
>> B) I have also tried to make invertlinks on all
test/segments with
>> the goal to run nutch index command to produce a
second
>> indexes directory, let say test/indexes1, an
finally run the merge
>> index on index2
>>
>> nutch invertlinks test/linkdb -dir test/segments
>>
>> This as created a new linkdb directory *NOT* under
test as specified
>> but as /linkdb-1108390519
>>
>> nutch index test/indexes1 test/crawldb linkdb
>> test/segments/20060522144050
>> nutch merge index2 test/indexes test/indexes1
>>
>> now I am not sure what to do; If I rename
test/index2 to be
>> test/indexes after having removed test/indexes
>> I will not able to search anymore.
>>
>>
>> -Corrado
>>
>>
>>
>>
>>
>>
>> Sudhi Seshachala
>> http://sudhilogs.blogs
pot.com/
>>
>>
>>
>> __________________________________________________
>> Do You Yahoo!?
>> Tired of spam? Yahoo! Mail has the best spam
protection around
>> http://mail.yahoo.com
>>
>
>
|
|
| Incremental crawl again ... (Please
explain) |

|
2006-05-26 09:32:24 |
Yes, I see what you mean about re-indexing again over all
the
segments. However, indexing takes a lot of time and I was
hoping that
merging many smaller indexes would be a much more efficient
method.
Besides, deleting the index and re-indexing just doesn't
seem like
*The Right Thing(tm)*.
On 5/26/06, zzcgiacomini <zzgiacomini echo.fr> wrote:
> I am not at all a Nutch expert, I am just experimenting
a little bit,
> but as far as I understood it
> you can remove the indexes directory and re-index again
the segments:
> In may case ofter step 8 of the (see below) I have only
one segment :
> test/segments/20060522144050
> after step 9 I will have a second segment
> test/segments/20060522144050
> Now what we can do is to remove the test/indexes
directory and
> re-index the two segments:
> this what I did :
>
> hadoop dfs -rm test/indexes
> nutch index test/indexes test/crawldb linkdb
> test/segments/20060522144050
test/segments/20060522144050
>
> Hope it helps
>
> -Corrqdo
>
>
>
> Jacob Brunson wrote:
> > I looked at the referenced messaged at
> > http://www.mai
l-archive.com/nutch-user lucene.apache.org/msg03990.html
> > but I am still having problems.
> >
> > I am running the latest checkout from subversion.
> >
> > These are the commands which I've run:
> > bin/nutch crawl myurls/ -dir crawl -threads 4
-depth 3 -topN 10000
> > bin/nutch generate crawl/crawldb crawl/segments
-topN 500
> > lastsegment=`ls -d crawl/segments/2* | tail -1`
> > bin/nutch fetch $lastsegment
> > bin/nutch updatedb crawl/crawldb $lastsegment
> > bin/nutch index crawl/indexes crawl/crawldb
crawl/linkdb $lastsegment
> >
> > This last command fails with a java.io.IOException
saying: "Output
> > directory /home/nutch/nutch/crawl/indexes already
exists"
> >
> > So I'm confused because it seems like I did
exactly what was described
> > in the referenced email, but it didn't work for
me. Can someone help
> > me figure out what I'm doing wrong or what I need
to do instead?
> > Thanks,
> > Jacob
> >
> >
> > On 5/22/06, sudhendra seshachala <sudhi_bs yahoo.com> wrote:
> >> Please do follow the link below..
> >> http://www.mai
l-archive.com/nutch-user lucene.apache.org/msg03990.html
> >>
> >> I have been able to follow the threads as
explained and merge
> >> multiple crawl.. It works like a champ.
> >>
> >> Thanks
> >> Sudhi
> >>
> >> zzcgiacomini <zzgiacomini echo.fr> wrote:
> >> I am currently using the last nightly
nutch-0.8-dev build and
> >> I am really confused about how to proceed
after I have done two
> >> different "whole web" incremental
crawl
> >>
> >> The tutorial to me is not clear on how to
merge the results after the
> >> two crawls in order to be able to
> >> make a search operation.
> >>
> >> Could some one please give me an Hints on what
is the right procedure ?!
> >> here is what I am doing:
> >>
> >> 1. create an initial urls file
/tmp/dmoz/urls.txt
> >> 2. hadoop dfs -put /tmp/urls/ url
> >> 3. nutch inject test/crawldb dmoz
> >> 4. nutch generate test/crawldb test/segments
> >> 5. nutch fetch test/segments/20060522144050
> >> 6. nutch updatedb test/crawldb
test/segments/20060522144050
> >> 7. nutch invertlinks linkdb
test/segments/20060522144050
> >> 8. nutch index test/indexes test/crawldb
linkdb
> >> test/segments/20060522144050
> >>
> >> ..and now I am able to search...
> >>
> >> Now I run
> >>
> >> 9. nutch generate test/crawldb test/segments
-topN 1000
> >>
> >> and I will end up to have a new segment :
test/segments/20060522151957
> >>
> >> 10. nutch fetch test/segments/20060522151957
> >> 11. nutch updatedb test/crawldb
test/segments/20060522151957
> >>
> >>
> >> From this point on I cannot make any
progresses much
> >>
> >> A) I have tried to merge the two segments into
a new one with the
> >> idea to rerun an invertlinks and index on it
but:
> >>
> >> nutch mergesegs test/segments -dir
test/segments
> >>
> >> whatever I specify as outputdir or
outputsegment I get errors
> >>
> >> B) I have also tried to make invertlinks on
all test/segments with
> >> the goal to run nutch index command to produce
a second
> >> indexes directory, let say test/indexes1, an
finally run the merge
> >> index on index2
> >>
> >> nutch invertlinks test/linkdb -dir
test/segments
> >>
> >> This as created a new linkdb directory *NOT*
under test as specified
> >> but as /linkdb-1108390519
> >>
> >> nutch index test/indexes1 test/crawldb linkdb
> >> test/segments/20060522144050
> >> nutch merge index2 test/indexes test/indexes1
> >>
> >> now I am not sure what to do; If I rename
test/index2 to be
> >> test/indexes after having removed test/indexes
> >> I will not able to search anymore.
> >>
> >>
> >> -Corrado
> >>
> >>
> >>
> >>
> >>
> >>
> >> Sudhi Seshachala
> >> http://sudhilogs.blogs
pot.com/
> >>
> >>
> >>
> >>
__________________________________________________
> >> Do You Yahoo!?
> >> Tired of spam? Yahoo! Mail has the best spam
protection around
> >> http://mail.yahoo.com
> >>
> >
> >
>
>
--
http://JacobBrunson.com
|
|
| Incremental crawl again ... (Please
explain) |

|
2006-05-26 11:05:36 |
I haven't yet tried - but could you maybe:
- move the new segments somewhere independent of the
existing ones
- create a separate linkdb for it (to my understanding the
linkdb is
only needed when indexing)
- create a separate index on that
- then move segment into segments-dir and new index into
indexes-dir as
"part-XXXX"
- just merge indexes (should work relatively fast)
In the long term your segments, indexes etc. add up - so in
this case
you'd need to maybe think about merging segments etc.
Also, this is "only" my current understanding of
the topic. It would be
nice to get feedback and maybe easier solutions from others
as well.
Regards,
Stefan
Jacob Brunson wrote:
> Yes, I see what you mean about re-indexing again over
all the
> segments. However, indexing takes a lot of time and I
was hoping that
> merging many smaller indexes would be a much more
efficient method.
> Besides, deleting the index and re-indexing just
doesn't seem like
> *The Right Thing(tm)*.
>
> On 5/26/06, zzcgiacomini <zzgiacomini echo.fr> wrote:
>> I am not at all a Nutch expert, I am just
experimenting a little bit,
>> but as far as I understood it
>> you can remove the indexes directory and re-index
again the segments:
>> In may case ofter step 8 of the (see below) I have
only one segment :
>> test/segments/20060522144050
>> after step 9 I will have a second segment
>> test/segments/20060522144050
>> Now what we can do is to remove the test/indexes
directory and
>> re-index the two segments:
>> this what I did :
>>
>> hadoop dfs -rm test/indexes
>> nutch index test/indexes test/crawldb linkdb
>> test/segments/20060522144050
test/segments/20060522144050
>>
>> Hope it helps
>>
>> -Corrqdo
>>
>>
>>
>> Jacob Brunson wrote:
>> > I looked at the referenced messaged at
>> > http://www.mai
l-archive.com/nutch-user lucene.apache.org/msg03990.html
>> > but I am still having problems.
>> >
>> > I am running the latest checkout from
subversion.
>> >
>> > These are the commands which I've run:
>> > bin/nutch crawl myurls/ -dir crawl -threads 4
-depth 3 -topN 10000
>> > bin/nutch generate crawl/crawldb
crawl/segments -topN 500
>> > lastsegment=`ls -d crawl/segments/2* | tail
-1`
>> > bin/nutch fetch $lastsegment
>> > bin/nutch updatedb crawl/crawldb $lastsegment
>> > bin/nutch index crawl/indexes crawl/crawldb
crawl/linkdb $lastsegment
>> >
>> > This last command fails with a
java.io.IOException saying: "Output
>> > directory /home/nutch/nutch/crawl/indexes
already exists"
>> >
>> > So I'm confused because it seems like I did
exactly what was described
>> > in the referenced email, but it didn't work
for me. Can someone help
>> > me figure out what I'm doing wrong or what I
need to do instead?
>> > Thanks,
>> > Jacob
>> >
>> >
>> > On 5/22/06, sudhendra seshachala
<sudhi_bs yahoo.com> wrote:
>> >> Please do follow the link below..
>> >>
>> http://www.mai
l-archive.com/nutch-user lucene.apache.org/msg03990.html
>> >>
>> >> I have been able to follow the threads
as explained and merge
>> >> multiple crawl.. It works like a champ.
>> >>
>> >> Thanks
>> >> Sudhi
>> >>
>> >> zzcgiacomini <zzgiacomini echo.fr> wrote:
>> >> I am currently using the last nightly
nutch-0.8-dev build and
>> >> I am really confused about how to proceed
after I have done two
>> >> different "whole web"
incremental crawl
>> >>
>> >> The tutorial to me is not clear on how to
merge the results after the
>> >> two crawls in order to be able to
>> >> make a search operation.
>> >>
>> >> Could some one please give me an Hints on
what is the right
>> procedure ?!
>> >> here is what I am doing:
>> >>
>> >> 1. create an initial urls file
/tmp/dmoz/urls.txt
>> >> 2. hadoop dfs -put /tmp/urls/ url
>> >> 3. nutch inject test/crawldb dmoz
>> >> 4. nutch generate test/crawldb
test/segments
>> >> 5. nutch fetch
test/segments/20060522144050
>> >> 6. nutch updatedb test/crawldb
test/segments/20060522144050
>> >> 7. nutch invertlinks linkdb
test/segments/20060522144050
>> >> 8. nutch index test/indexes test/crawldb
linkdb
>> >> test/segments/20060522144050
>> >>
>> >> ..and now I am able to search...
>> >>
>> >> Now I run
>> >>
>> >> 9. nutch generate test/crawldb
test/segments -topN 1000
>> >>
>> >> and I will end up to have a new segment :
test/segments/20060522151957
>> >>
>> >> 10. nutch fetch
test/segments/20060522151957
>> >> 11. nutch updatedb test/crawldb
test/segments/20060522151957
>> >>
>> >>
>> >> From this point on I cannot make any
progresses much
>> >>
>> >> A) I have tried to merge the two segments
into a new one with the
>> >> idea to rerun an invertlinks and index on
it but:
>> >>
>> >> nutch mergesegs test/segments -dir
test/segments
>> >>
>> >> whatever I specify as outputdir or
outputsegment I get errors
>> >>
>> >> B) I have also tried to make invertlinks
on all test/segments with
>> >> the goal to run nutch index command to
produce a second
>> >> indexes directory, let say test/indexes1,
an finally run the merge
>> >> index on index2
>> >>
>> >> nutch invertlinks test/linkdb -dir
test/segments
>> >>
>> >> This as created a new linkdb directory
*NOT* under test as specified
>> >> but as /linkdb-1108390519
>> >>
>> >> nutch index test/indexes1 test/crawldb
linkdb
>> >> test/segments/20060522144050
>> >> nutch merge index2 test/indexes
test/indexes1
>> >>
>> >> now I am not sure what to do; If I rename
test/index2 to be
>> >> test/indexes after having removed
test/indexes
>> >> I will not able to search anymore.
|
|
| Incremental crawl again ... (Please
explain) |

|
2006-05-26 12:31:21 |
I just tried this and it looks is working:
nutch index test/indexes1 test/crawldb linkdb
test/segments/20060522181136
nutch index test/indexes2 test/crawldb linkdb
test/segments/20060522181136
nutch merge test/index test/indexes1 test/indexes2
querying also works, I have setup searcher.dir in
nutch-site.xml as "test"
and used the following line to query :
/opt/nutch-0.8-dev/bin/nutch
org.apache.nutch.searcher.NutchBean computer
I am just experimenting, so I do not know if is the right
way to do things
-Corrado
Stefan Neufeind wrote:
> I haven't yet tried - but could you maybe:
> - move the new segments somewhere independent of the
existing ones
> - create a separate linkdb for it (to my understanding
the linkdb is
> only needed when indexing)
> - create a separate index on that
> - then move segment into segments-dir and new index
into indexes-dir as
> "part-XXXX"
> - just merge indexes (should work relatively fast)
>
> In the long term your segments, indexes etc. add up -
so in this case
> you'd need to maybe think about merging segments etc.
>
> Also, this is "only" my current
understanding of the topic. It would be
> nice to get feedback and maybe easier solutions from
others as well.
>
>
>
> Regards,
> Stefan
>
>
|
|
[1-9]
|
|