Why so many mappers and reducers relative to the number of
machines you
have? This just causes excess heartache when running the
job.
My standard practice is to run with a small factor larger
than the number of
cores that I have (for instance 3 tasks on a 2 core
machine). In fact, I
find it most helpful to have the cluster defaults rule the
choice except in
a few cases where I want one reducer or a few more than the
standard 4
reducers.
On 1/15/08 9:15 AM, "Johan Oskarsson"
<johan oskarsson.nu> wrote:
> Hi.
>
> I believe someone posted about this a while back, but
it's worth
> mentioning again.
>
> I just ran a job on our 10 node cluster where the input
data was
> ~70 empty sequence files, with our default settings
this ran about ~200
> mappers and ~70 reducers.
>
> The job took almost exactly two minutes to finish.
>
> How can we reduce this overhead?
>
> * Pick number of mappers and reducers in a more dynamic
way,
> depending on the size of the input?
> * JVM reuse, one jvm per job instead of one per task?
>
> Any other ideas?
>
> /Johan
|