List Info

Thread: Re: Exception in DeleteDuplicates in nutch-nightly




Re: Exception in DeleteDuplicates in nutch-nightly
country flaguser name
Germany
2007-03-29 04:00:59
I guess the problem lies in the Configuration which I create
with 
NutchConfiguration.create() because Nutch uses the
DeleteDuplicates 
class on indices anyway after finishing a crawl right?
What is really odd to me is that the number of documents
reportet by 
LUKE 0.7 and at the end of the crawl of Nutch-nightly
differs. I am 
refering to the number of documents merged at the end of
each crawl..
Has anybody an idea what could cause this inconsistence?

Tim Benke wrote:
> Hello,
>
> I downloaded nutch-2007-03-27_06-52-06 and crawling
works fine. I get 
> an error when trying to run DeleteDuplicates directly
in Eclipse. The 
> corresponding "crawl1\index" opens fine in
LUKE 0.7 and queries also 
> work. When trying to run it with args
"crawl1\indexes". output in 
> hadoop.log is:
>
> 2007-03-27 23:14:33,151 INFO  indexer.DeleteDuplicates
- Dedup: starting
> 2007-03-27 23:14:33,198 INFO  indexer.DeleteDuplicates
- Dedup: adding 
> indexes in: crawl1/indexes
> 2007-03-27 23:14:33,792 WARN  mapred.LocalJobRunner -
job_uyjjzt
> java.lang.ArrayIndexOutOfBoundsException: Array index
out of range: 550
>   at
org.apache.lucene.util.BitVector.get(BitVector.java:72)
>   at 
>
org.apache.lucene.index.SegmentReader.isDeleted(SegmentReade
r.java:346)
>   at 
>
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDReco
rdReader.next(DeleteDuplicates.java:176) 
>
>   at
org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
>   at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
>   at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
>   at 
>
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunn
er.java:126)
> 2007-03-27 23:14:34,495 FATAL indexer.DeleteDuplicates
- 
> DeleteDuplicates: java.io.IOException: Job failed!
>   at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604
)
>   at 
>
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplic
ates.java:439) 
>
>   at 
>
org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicat
es.java:506)
>   at
org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>   at 
>
org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplica
tes.java:490)
>
> Another thing I don't understand is that after crawling
nutch claims 
> 551 documents while LUKE states the index has only 473
documents.
>
> thanks in advance,
>
> Tim Benke


Re: Exception in DeleteDuplicates in nutch-nightly
country flaguser name
Germany
2007-03-29 11:00:42
I just wanted to tell you that I found my error. In fact
with the former 
nutch it never happened to me that documents were deleted
after 
crawling, but now that was the problem.
DeleteDuplicates needs an optimized index to work on, what I
mean is 
that all the deletions have to be already flushed because
then the 
number of documents in the index is correct and no 
ArrayIndexOutOfBoundsException can occur.
And of course the duplicates were also deleted as far as I
can tell, 
because the index is much smaller now...

Tim Benke wrote:
> I guess the problem lies in the Configuration which I
create with 
> NutchConfiguration.create() because Nutch uses the
DeleteDuplicates 
> class on indices anyway after finishing a crawl right?
> What is really odd to me is that the number of
documents reportet by 
> LUKE 0.7 and at the end of the crawl of Nutch-nightly
differs. I am 
> refering to the number of documents merged at the end
of each crawl..
> Has anybody an idea what could cause this
inconsistence?
>
> Tim Benke wrote:
>> Hello,
>>
>> I downloaded nutch-2007-03-27_06-52-06 and crawling
works fine. I get 
>> an error when trying to run DeleteDuplicates
directly in Eclipse. The 
>> corresponding "crawl1\index" opens fine
in LUKE 0.7 and queries also 
>> work. When trying to run it with args
"crawl1\indexes". output in 
>> hadoop.log is:
>>
>> 2007-03-27 23:14:33,151 INFO 
indexer.DeleteDuplicates - Dedup: starting
>> 2007-03-27 23:14:33,198 INFO 
indexer.DeleteDuplicates - Dedup: 
>> adding indexes in: crawl1/indexes
>> 2007-03-27 23:14:33,792 WARN  mapred.LocalJobRunner
- job_uyjjzt
>> java.lang.ArrayIndexOutOfBoundsException: Array
index out of range: 550
>>   at
org.apache.lucene.util.BitVector.get(BitVector.java:72)
>>   at 
>>
org.apache.lucene.index.SegmentReader.isDeleted(SegmentReade
r.java:346)
>>   at 
>>
org.apache.nutch.indexer.DeleteDuplicates$InputFormat$DDReco
rdReader.next(DeleteDuplicates.java:176) 
>>
>>   at
org.apache.hadoop.mapred.MapTask$1.next(MapTask.java:157)
>>   at
org.apache.hadoop.mapred.MapRunner.run(MapRunner.java:46)
>>   at
org.apache.hadoop.mapred.MapTask.run(MapTask.java:175)
>>   at 
>>
org.apache.hadoop.mapred.LocalJobRunner$Job.run(LocalJobRunn
er.java:126)
>> 2007-03-27 23:14:34,495 FATAL
indexer.DeleteDuplicates - 
>> DeleteDuplicates: java.io.IOException: Job failed!
>>   at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:604
)
>>   at 
>>
org.apache.nutch.indexer.DeleteDuplicates.dedup(DeleteDuplic
ates.java:439) 
>>
>>   at 
>>
org.apache.nutch.indexer.DeleteDuplicates.run(DeleteDuplicat
es.java:506)
>>   at
org.apache.hadoop.util.ToolBase.doMain(ToolBase.java:189)
>>   at 
>>
org.apache.nutch.indexer.DeleteDuplicates.main(DeleteDuplica
tes.java:490) 
>>
>>
>> Another thing I don't understand is that after
crawling nutch claims 
>> 551 documents while LUKE states the index has only
473 documents.
>>
>> thanks in advance,
>>
>> Tim Benke
>


[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )