List Info

Thread: sort speeds under java, c++, and streaming




sort speeds under java, c++, and streaming
country flaguser name
United States
2007-11-08 19:03:01
I set up a little benchmark on a 39 node cluster to sort
40gb of  
random text data (generated by RandomTextWriter using key
length:  
1-10 words and value length: 0-200 words, data
uncompressed). The  
runtimes in minutes are:

Java:			4:22
C++ (Pipes):		3:50
Streaming:		4:44

I was surprised to find that Pipes out performed Java, even
with the  
extra process. I suspect it was because of the buffering
between the  
input and output of Pipes.

-- Owen

Re: sort speeds under java, c++, and streaming
country flaguser name
United States
2007-11-08 19:11:08
Neat benchmark. I've been meaning to do exactly that myself.
And that is 
a surprise about Pipes!

Thanks for the data
- Aaron

Owen O'Malley wrote:
> I set up a little benchmark on a 39 node cluster to
sort 40gb of random 
> text data (generated by RandomTextWriter using key
length: 1-10 words 
> and value length: 0-200 words, data uncompressed). The
runtimes in 
> minutes are:
> 
> Java:            4:22
> C++ (Pipes):        3:50
> Streaming:        4:44
> 
> I was surprised to find that Pipes out performed Java,
even with the 
> extra process. I suspect it was because of the
buffering between the 
> input and output of Pipes.
> 
> -- Owen

RE: sort speeds under java, c++, and streaming
user name
2007-11-08 19:35:53
Doesn't the sorting and merging all still happen in
Java-land?

-----Original Message-----
From: Owen O'Malley [mailto:oomyahoo-inc.com] 
Sent: Thursday, November 08, 2007 5:03 PM
To: hadoop-userlucene.apache.org
Subject: sort speeds under java, c++, and streaming

I set up a little benchmark on a 39 node cluster to sort
40gb of  
random text data (generated by RandomTextWriter using key
length:  
1-10 words and value length: 0-200 words, data
uncompressed). The  
runtimes in minutes are:

Java:			4:22
C++ (Pipes):		3:50
Streaming:		4:44

I was surprised to find that Pipes out performed Java, even
with the  
extra process. I suspect it was because of the buffering
between the  
input and output of Pipes.

-- Owen

Re: sort speeds under java, c++, and streaming
user name
2007-11-08 19:39:46
Hi Owen,

Can you provide more details of your test?  In particular
what was the Java
Map-reduce program that your ran?  Was it
src/examples/org/apache/hadoop/examples/Sort.java ?  Also, I
can't find
anything called "RandomTextWriter" in the source
tarball, can you point me
to it?  Thanks.

- Doug

On Nov 8, 2007 5:03 PM, Owen O'Malley <oomyahoo-inc.com> wrote:

> I set up a little benchmark on a 39 node cluster to
sort 40gb of
> random text data (generated by RandomTextWriter using
key length:
> 1-10 words and value length: 0-200 words, data
uncompressed). The
> runtimes in minutes are:
>
> Java:                   4:22
> C++ (Pipes):            3:50
> Streaming:              4:44
>
> I was surprised to find that Pipes out performed Java,
even with the
> extra process. I suspect it was because of the
buffering between the
> input and output of Pipes.
>
> -- Owen
>
Re: sort speeds under java, c++, and streaming
country flaguser name
United States
2007-11-08 21:00:11
On Nov 8, 2007, at 5:39 PM, Doug Judd wrote:

> Can you provide more details of your test?

Sure, I guess I should have been more specific to start
with. *grin*

The data was generated with:
bin/hadoop jar hadoop-0.15.0-dev-examples.jar
randomtextwriter -conf  
gridmix-text.xml
    -outFormat org.apache.hadoop.mapred.TextOutputFormat
/gridmix/ 
data/sort/text
contents of gridmix-text.xml:
<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl"
href="nutch-conf.xsl"?>

<configuration>

<property>
  
<name>test.randomtextwrite.total_bytes</name>
   <value>429496729600</value>
</property>

<property>
  
<name>test.randomtextwrite.min_words_key</name>
   <value>1</value>
</property>

<property>
  
<name>test.randomtextwrite.max_words_key</name>
   <value>10</value>
</property>

<property>
  
<name>test.randomtextwrite.min_words_value</name>
;
   <value>0</value>
</property>

<property>
  
<name>test.randomtextwrite.max_words_value</name>
;
   <value>200</value>
</property>

</configuration>

And then ran the sort as:

Java:
bin/hadoop jar hadoop-0.15.0-dev-examples.jar sort 
    -inFormat
org.apache.hadoop.mapred.KeyValueTextInputFormat 
    -outFormat org.apache.hadoop.mapred.TextOutputFormat 
    -outKey org.apache.hadoop.io.Text -outValue  
org.apache.hadoop.io.Text 
    /gridmix/data/sort/text/part-*0 java-out

Pipes:
bin/hadoop pipes -input /gridmix/data/sort/text/part-*0
-output pipe- 
out 
   -inputformat
org.apache.hadoop.mapred.KeyValueTextInputFormat 
   -program /gridmix/programs/pipes-sort -reduces 78 
   -jobconf
   
mapred.output.key.class=org.apache.hadoop.io.Text,mapred.out
put.value.cl 
ass=org.apache.hadoop.io.Text 
   -writer org.apache.hadoop.mapred.TextOutputFormat

Streaming:
bin/hadoop jar contrib/hadoop-0.15.0-dev-streaming.jar 
   -input /gridmix/data/sort/text/part-*0 -output stream-out
-mapper  
cat -reducer cat 
   -numReduceTasks 78

Note that these are the commands I used, although they
generate 400gb  
data and then only sort 10%. Clearly, it is a bit faster to
just  
generate 40gb and sort all of it. I'm just going to run the
bigger  
sort in the next couple of days.

> In particular what was the Java
> Map-reduce program that your ran?  Was it
> src/examples/org/apache/hadoop/examples/Sort.java ?

Yes

> Also, I can't find anything called
"RandomTextWriter" in the source  
> tarball, can you point me to it?

It is in the example directory of 0.15 too. The only
remaining piece,  
is the pipes sort program and I'll upload that to
HADOOP-2127.

-- Owen

Re: sort speeds under java, c++, and streaming
user name
2007-11-08 22:39:16
Thanks, Owen.  Did it look like the system was CPU bound? 
It would be
interesting to see some top output for the various runs.  It
would also be
interesting to profile the Java stuff in both Pipes mode and
non-Pipes mode.

- Doug

On Nov 8, 2007 7:00 PM, Owen O'Malley <oomyahoo-inc.com> wrote:

>
> On Nov 8, 2007, at 5:39 PM, Doug Judd wrote:
>
> > Can you provide more details of your test?
>
> Sure, I guess I should have been more specific to start
with. *grin*
>
> The data was generated with:
> bin/hadoop jar hadoop-0.15.0-dev-examples.jar
randomtextwriter -conf
> gridmix-text.xml
>    -outFormat org.apache.hadoop.mapred.TextOutputFormat
/gridmix/
> data/sort/text
> contents of gridmix-text.xml:
> <?xml version="1.0"?>
> <?xml-stylesheet type="text/xsl"
href="nutch-conf.xsl"?>
>
> <configuration>
>
> <property>
>  
<name>test.randomtextwrite.total_bytes</name>
>   <value>429496729600</value>
> </property>
>
> <property>
>  
<name>test.randomtextwrite.min_words_key</name>
>   <value>1</value>
> </property>
>
> <property>
>  
<name>test.randomtextwrite.max_words_key</name>
>   <value>10</value>
> </property>
>
> <property>
>  
<name>test.randomtextwrite.min_words_value</name>
;
>   <value>0</value>
> </property>
>
> <property>
>  
<name>test.randomtextwrite.max_words_value</name>
;
>   <value>200</value>
> </property>
>
> </configuration>
>
> And then ran the sort as:
>
> Java:
> bin/hadoop jar hadoop-0.15.0-dev-examples.jar sort 
>    -inFormat
org.apache.hadoop.mapred.KeyValueTextInputFormat 
>    -outFormat org.apache.hadoop.mapred.TextOutputFormat

>    -outKey org.apache.hadoop.io.Text -outValue
> org.apache.hadoop.io.Text 
>    /gridmix/data/sort/text/part-*0 java-out
>
> Pipes:
> bin/hadoop pipes -input /gridmix/data/sort/text/part-*0
-output pipe-
> out 
>   -inputformat
org.apache.hadoop.mapred.KeyValueTextInputFormat 
>   -program /gridmix/programs/pipes-sort -reduces 78 
>   -jobconf
>
>
mapred.output.key.class=org.apache.hadoop.io.Text,mapred.out
put.value.cl
> ass=org.apache.hadoop.io.Text 
>   -writer org.apache.hadoop.mapred.TextOutputFormat
>
> Streaming:
> bin/hadoop jar contrib/hadoop-0.15.0-dev-streaming.jar

>   -input /gridmix/data/sort/text/part-*0 -output
stream-out -mapper
> cat -reducer cat 
>   -numReduceTasks 78
>
> Note that these are the commands I used, although they
generate 400gb
> data and then only sort 10%. Clearly, it is a bit
faster to just
> generate 40gb and sort all of it. I'm just going to run
the bigger
> sort in the next couple of days.
>
> > In particular what was the Java
> > Map-reduce program that your ran?  Was it
> > src/examples/org/apache/hadoop/examples/Sort.java
?
>
> Yes
>
> > Also, I can't find anything called
"RandomTextWriter" in the source
> > tarball, can you point me to it?
>
> It is in the example directory of 0.15 too. The only
remaining piece,
> is the pipes sort program and I'll upload that to
HADOOP-2127.
>
> -- Owen
>
Re: sort speeds under java, c++, and streaming
country flaguser name
United States
2007-11-09 02:15:04
On Nov 8, 2007, at 8:39 PM, Doug Judd wrote:

> Thanks, Owen.  Did it look like the system was CPU
bound?

I looked while the Java one was running and it was working a
couple  
of the cpus pretty hard. (I was only running with the
default 2tasks/ 
node, which is really low given these are nice 8 cpu
machines.)

I should also mention that I was using a 500 node hdfs
cluster that  
is a superset of the 39 node + 1 job tracker map/reduce
cluster, so  
most of the hdfs reads and writes were outside of the
map/reduce  
cluster.

> It would be
> interesting to see some top output for the various
runs.  It would  
> also be
> interesting to profile the Java stuff in both Pipes
mode and non- 
> Pipes mode.

What I'm doing is putting together a somewhat representative
workload  
to look at increasing utilization, so at some point I'll
deep dive  
into the detail, but the first pass will be looking at the
top level  
issues.

-- Owen



[1-7]

about | contact  Other archives ( Real Estate discussion Medical topics )