List Info

Thread: Re: sort speeds under java, c++, and streaming




Re: sort speeds under java, c++, and streaming
country flaguser name
United States
2007-11-08 19:14:44
When you figure it out, could you please suggest an
optimization for streaming?

Does pipes deserializes and serializes data for the identity
mappers or just "passes it through" ? (Streaming
converts input to text, afaik)

- milind


----- Original Message -----
From: Owen O'Malley <oomyahoo-inc.com>
To: hadoop-userlucene.apache.org <hadoop-userlucene.apache.org>
Sent: Thu Nov 08 17:03:01 2007
Subject: sort speeds under java, c++, and streaming

I set up a little benchmark on a 39 node cluster to sort
40gb of  
random text data (generated by RandomTextWriter using key
length:  
1-10 words and value length: 0-200 words, data
uncompressed). The  
runtimes in minutes are:

Java:			4:22
C++ (Pipes):		3:50
Streaming:		4:44

I was surprised to find that Pipes out performed Java, even
with the  
extra process. I suspect it was because of the buffering
between the  
input and output of Pipes.

-- Owen
Re: sort speeds under java, c++, and streaming
country flaguser name
United States
2007-11-08 21:10:30
On Nov 8, 2007, at 5:14 PM, Milind A Bhandarkar wrote:

> Does pipes deserializes and serializes data for the
identity  
> mappers or just "passes it through" ?
(Streaming converts input to  
> text, afaik)

Pipes serializes the objects to bytes and sends them to the
C++  
program. The C++ program gets them as C++ strings, which are
 
effectively byte arrays. Pipes does not do the conversion to
Java  
strings that streaming does. Therefore, pipes can support
arbitrary  
Writable objects. Hopefully in the future, we can change the
map/ 
reduce api to provide access to the raw bytes in the mapper
and  
reducer as an option. In that case, pipes would not need to
serialize  
at all.

-- Owen

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )