List Info

Thread: Hadoop overhead




Hadoop overhead
country flaguser name
United Kingdom
2008-01-15 11:15:50
Hi.

I believe someone posted about this a while back, but it's
worth 
mentioning again.

I just ran a job on our 10 node cluster where the input data
was
~70 empty sequence files, with our default settings this ran
about ~200 
mappers and ~70 reducers.

The job took almost exactly two minutes to finish.

How can we reduce this overhead?

* Pick number of mappers and reducers in a more dynamic
way,
   depending on the size of the input?
* JVM reuse, one jvm per job instead of one per task?

Any other ideas?

/Johan

Re: Hadoop overhead
country flaguser name
United States
2008-01-15 11:22:05
Why so many mappers and reducers relative to the number of
machines you
have?  This just causes excess heartache when running the
job.

My standard practice is to run with a small factor larger
than the number of
cores that I have (for instance 3 tasks on a 2 core
machine).  In fact, I
find it most helpful to have the cluster defaults rule the
choice except in
a few cases where I want one reducer or a few more than the
standard 4
reducers.


On 1/15/08 9:15 AM, "Johan Oskarsson"
<johanoskarsson.nu> wrote:

> Hi.
> 
> I believe someone posted about this a while back, but
it's worth
> mentioning again.
> 
> I just ran a job on our 10 node cluster where the input
data was
> ~70 empty sequence files, with our default settings
this ran about ~200
> mappers and ~70 reducers.
> 
> The job took almost exactly two minutes to finish.
> 
> How can we reduce this overhead?
> 
> * Pick number of mappers and reducers in a more dynamic
way,
>    depending on the size of the input?
> * JVM reuse, one jvm per job instead of one per task?
> 
> Any other ideas?
> 
> /Johan


[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )