Hey Ned,
SequenceFile : Support for flat files of binary key/value
pairs.
SequenceFileOutputFormat : plain files with name as
data-xxxxx
MapFile : A file-based map from keys to values
MapFileOutputFormat : A dir for each MapFile containing
"data" file and
an "index" file.
So my thought is that both files are still representation
of Map
As far as map reduce is concerned, I think this solution
might work for ya
Create new a key
KEY : URL and INT. Compare function shud use INT values.
and the custom partitioner will be able to partition on the
basis of
host of URL
Ned Rockson wrote:
> I'm trying to perform a mapreduce of
IntWritable/{URL,CrawlDatum} ->
> URL/CrawlDatum but I want the output to be sorted by
the initial
> IntWritable and the partitioner to partition by host.
I wrote a
> mapreduce with an identity mapper, a partitioner that
pulls out the
> host from the url and the reducer outputs just url,
crawldatum,
> however every time I run it, as soon as the reduce
phase begin Reduce
>
>> Reduce it gives me this error:
>>
>
> java.io.IOException: key out of order: http://web1.incl.ne.jp/
after
> http://who2.com/
> at
org.apache.hadoop.io.MapFile$Writer.checkKey(MapFile.java:16
9)
> at
org.apache.hadoop.io.MapFile$Writer.append(MapFile.java:155)
> at
org.apache.hadoop.mapred.MapFileOutputFormat$1.write(MapFile
OutputFormat.java:56)
> at
org.apache.hadoop.mapred.ReduceTask$3.collect(ReduceTask.jav
a:340)
> at
org.apache.nutch.crawl.TimeSorter$FinalTimeSortMR.reduce(Tim
eSorter.java:96)
> at
org.apache.hadoop.mapred.ReduceTask.run(ReduceTask.java:355)
> at
org.apache.hadoop.mapred.TaskTracker$Child.main(TaskTracker.
java:1707)
>
>
> When I checked out the MapFileOutputFormat.append()
method, it says
> the keys must be sorted, so I figured a quick change
to
> job.setOutputFormat(SequenceFileOutputFormat.class)
would fix it, but
> I still see the exact same error message. Is this
something others
> have seen or would this be better fit in the
hadoop-user mailing list?
>
> Thanks,
> Ned
>
>
--
This message has been scanned for viruses and
dangerous content and is believed to be clean.
|