"mapred.tasktracker.tasks.maximum" does apply to
per task type.
The reason reduce tasks launch from the get go is that they
collect the
output from map tasks as soon as it's available. The
observation is that the
shuffle of the data from map tasks to reduce tasks over the
network is often
the number one bottleneck of the entire job, so starting
that early and
keeping the network saturation all during job execution
optimizes job
execution time.
In your case, ideally your 41 reducers will have almost all
their input
ready and waiting when the map tasks complete, and will
immediately start
sorting and reducing. More likely, the maps will complete
faster than data
can be shipped to the reducers, so the reducers will still
wait for it, but
for less time than if they were just launched. All during
map execution data
was being shipped to them.
Yoram
> -----Original Message-----
> From: Kalbande, Manish [mailto:mkalbande shopping.com]
> Sent: Thursday, July 20, 2006 11:32 AM
> To: hadoop-user lucene.apache.org
> Subject: Task type priorities during scheduling ?
>
> Hi,
>
> I am running a cluster of 21 nodes.
> while running any task I observed that reduce tasks are
getting
> scheduled much before all the map tasks are finished.
> As a result, reduce tasks are waiting for map tasks to
finish
> and total
> time for map tasks is more because they are not getting
scheduled
> quickly.
>
> It will be better if reduce tasks are scheduled only
after
> there are no
> map tasks to be performed.
>
> For example, during generate job, we had total 544 map
tasks and 41
> reduce tasks.
> All 41 reduce tasks got scheduled and only 42 map tasks
could be
> schedules at a time.
>
> My current configuration
>
> mapred.map.tasks = 83
> mapred.reduce.tasks=41
> mapred.tasktracker.tasks.maximum=2
>
> Also, does
"mapred.tasktracker.tasks.maximum" applies to
per
> task type?
> or is it for all tasks? From my observation is appears
to be per task
> type.
>
> thanks
> Manish
>
>
|