List Info

Thread: Out of order key while in reduce phase




Out of order key while in reduce phase
user name
2007-10-19 18:40:30
I'm trying to perform a mapreduce of
IntWritable/{URL,CrawlDatum} ->
URL/CrawlDatum but I want the output to be sorted by the
initial
IntWritable and the partitioner to partition by host.  I
wrote a
mapreduce with an identity mapper, a partitioner that pulls
out the
host from the url and the reducer outputs just url,
crawldatum,
however every time I run it, as soon as the reduce phase
begin Reduce
> Reduce it gives me this error:

java.io.IOException: key out of order: http://web1.incl.ne.jp/
after
http://who2.com/
        at
org.apache.hadoop.io.MapFile$Writer.checkKey(MapFile.java:16
9)
        at
org.apache.hadoop.io.MapFile$Writer.append(MapFile.java:155)

        at
org.apache.hadoop.mapred.MapFileOutputFormat$1.write(MapFile
OutputFormat.java:56)
        at
org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.jav
a:340)
        at
org.apache.nutch.crawl.TimeSorter$FinalTimeSortMR.reduce(Tim
eSorter.java:96)
        at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:355)

        at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.
java:1707)


When I checked out the MapFileOutputFormat.append() method,
it says
the keys must be sorted, so I figured a quick change to
job.setOutputFormat(SequenceFileOutputFormat.class) would
fix it, but
I still see the exact same error message.  Is this something
others
have seen or would this be better fit in the hadoop-user
mailing list?

Thanks,
Ned

Re: Out of order key while in reduce phase
country flaguser name
United States
2007-10-21 00:05:44
Hey Ned,

SequenceFile : Support for flat files of binary key/value
pairs.
SequenceFileOutputFormat : plain files with name as
data-xxxxx
MapFile : A file-based map from keys to values
MapFileOutputFormat : A dir for each MapFile containing
"data" file and 
an "index" file.

So my thought is that both files are still  representation
of Map

As far as map reduce is concerned, I think this solution
might work for ya
Create new a key
KEY : URL and INT. Compare function shud use INT values.
and the custom partitioner will be able to partition on the
basis of 
host of URL




Ned Rockson wrote:
> I'm trying to perform a mapreduce of
IntWritable/{URL,CrawlDatum} ->
> URL/CrawlDatum but I want the output to be sorted by
the initial
> IntWritable and the partitioner to partition by host. 
I wrote a
> mapreduce with an identity mapper, a partitioner that
pulls out the
> host from the url and the reducer outputs just url,
crawldatum,
> however every time I run it, as soon as the reduce
phase begin Reduce
>   
>> Reduce it gives me this error:
>>     
>
> java.io.IOException: key out of order: http://web1.incl.ne.jp/
after
> http://who2.com/
>         at
org.apache.hadoop.io.MapFile$Writer.checkKey(MapFile.java:16
9)
>         at
org.apache.hadoop.io.MapFile$Writer.append(MapFile.java:155)

>         at
org.apache.hadoop.mapred.MapFileOutputFormat$1.write(MapFile
OutputFormat.java:56)
>         at
org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.jav
a:340)
>         at
org.apache.nutch.crawl.TimeSorter$FinalTimeSortMR.reduce(Tim
eSorter.java:96)
>         at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:355)

>         at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.
java:1707)
>
>
> When I checked out the MapFileOutputFormat.append()
method, it says
> the keys must be sorted, so I figured a quick change
to
> job.setOutputFormat(SequenceFileOutputFormat.class)
would fix it, but
> I still see the exact same error message.  Is this
something others
> have seen or would this be better fit in the
hadoop-user mailing list?
>
> Thanks,
> Ned
>
>   


-- 
This message has been scanned for viruses and
dangerous content and is believed to be clean.


[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )