List Info

Thread: Created: (HADOOP-1965) Handle map output buffers better




Created: (HADOOP-1965) Handle map output buffers better
country flaguser name
United States
2007-09-28 08:20:50
Handle map output buffers better
--------------------------------

                 Key: HADOOP-1965
                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1965
             Project: Hadoop
          Issue Type: Improvement
          Components: mapred
            Reporter: Devaraj Das
            Assignee: Amar Kamat


Today, the map task stops calling the map method while
sort/spill is using the (single instance of) map output
buffer. One improvement that can be done to improve
performance of the map task is to have another buffer for
writing the map outputs to, while sort/spill is using the
first buffer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (HADOOP-1965) Handle map output buffers better
country flaguser name
United States
2007-09-28 11:22:50
    [ https://issues.apache.org/jira/browse
/HADOOP-1965?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12531078 ] 

Runping Qi commented on HADOOP-1965:
------------------------------------


This may improve the mapper performance significantly when
the map output size is large 
(either due to mapper expansion or due to input
decompression, 
which is pretty common in the applications I deal with).


> Handle map output buffers better
> --------------------------------
>
>                 Key: HADOOP-1965
>                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1965
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Amar Kamat
>
> Today, the map task stops calling the map method while
sort/spill is using the (single instance of) map output
buffer. One improvement that can be done to improve
performance of the map task is to have another buffer for
writing the map outputs to, while sort/spill is using the
first buffer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (HADOOP-1965) Handle map output buffers better
country flaguser name
United States
2007-09-28 13:33:50
    [ https://issues.apache.org/jira/browse
/HADOOP-1965?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12531113 ] 

Sameer Paranjpye commented on HADOOP-1965:
------------------------------------------

Is sorting/spilling done in a separate thread? If not,
adding a separate thread for combine/sort/spill is a
pre-requisite for this proposal.

> Handle map output buffers better
> --------------------------------
>
>                 Key: HADOOP-1965
>                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1965
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Amar Kamat
>
> Today, the map task stops calling the map method while
sort/spill is using the (single instance of) map output
buffer. One improvement that can be done to improve
performance of the map task is to have another buffer for
writing the map outputs to, while sort/spill is using the
first buffer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (HADOOP-1965) Handle map output buffers better
country flaguser name
United States
2007-09-28 14:54:51
    [ https://issues.apache.org/jira/browse
/HADOOP-1965?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12531135 ] 

Doug Cutting commented on HADOOP-1965:
--------------------------------------

How much better would this do than simply running more map
tasks per node at a time?  While one is spilling, another
can be mapping, no?  I guess this might improve task latency
a bit, but would it really help throughput much?

> Handle map output buffers better
> --------------------------------
>
>                 Key: HADOOP-1965
>                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1965
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Amar Kamat
>
> Today, the map task stops calling the map method while
sort/spill is using the (single instance of) map output
buffer. One improvement that can be done to improve
performance of the map task is to have another buffer for
writing the map outputs to, while sort/spill is using the
first buffer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (HADOOP-1965) Handle map output buffers better
country flaguser name
United States
2007-09-28 15:35:50
    [ https://issues.apache.org/jira/browse
/HADOOP-1965?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12531151 ] 

Runping Qi commented on HADOOP-1965:
------------------------------------

Having small enough map input by running more mappers will
certainly avoid the problem of spills.
On the other hand, you cannot make the input size too small,
otherwise, the overhead associated with 
task startup and shuffling will become significant.
And in reality, it is very hard to choose the right number
of mappers.


> Handle map output buffers better
> --------------------------------
>
>                 Key: HADOOP-1965
>                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1965
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Amar Kamat
>
> Today, the map task stops calling the map method while
sort/spill is using the (single instance of) map output
buffer. One improvement that can be done to improve
performance of the map task is to have another buffer for
writing the map outputs to, while sort/spill is using the
first buffer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (HADOOP-1965) Handle map output buffers better
country flaguser name
United States
2007-10-02 14:43:50
    [ https://issues.apache.org/jira/browse
/HADOOP-1965?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12531891 ] 

Sameer Paranjpye commented on HADOOP-1965:
------------------------------------------

We need to benchmark anything we implement for this issue.
The interesting use case is tasks that spill since these
make no progress while the sort/spill is happening. Running
more total maps / more concurrent maps per node are also
options. But these seem like orthogonal strategies,
improving latency will help regardless, no?



> Handle map output buffers better
> --------------------------------
>
>                 Key: HADOOP-1965
>                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1965
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Amar Kamat
>
> Today, the map task stops calling the map method while
sort/spill is using the (single instance of) map output
buffer. One improvement that can be done to improve
performance of the map task is to have another buffer for
writing the map outputs to, while sort/spill is using the
first buffer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (HADOOP-1965) Handle map output buffers better
country flaguser name
United States
2007-10-02 17:48:50
    [ https://issues.apache.org/jira/browse
/HADOOP-1965?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12531939 ] 

Doug Cutting commented on HADOOP-1965:
--------------------------------------

> We need to benchmark anything we implement for this
issue.

+1

> improving latency will help regardless, no?

Improving task latency alone will not improve job latency
much in most cases.  If we run more tasks per node than
there are CPU cores, and there are significantly more input
splits than task slots (as there normally should be) then
job latency might not be improved much.

I'm not arguing that this won't help at all, rather that it
might not help much.  But then again, it might.  It's
certainly worth a try.


> Handle map output buffers better
> --------------------------------
>
>                 Key: HADOOP-1965
>                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1965
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Amar Kamat
>
> Today, the map task stops calling the map method while
sort/spill is using the (single instance of) map output
buffer. One improvement that can be done to improve
performance of the map task is to have another buffer for
writing the map outputs to, while sort/spill is using the
first buffer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Updated: (HADOOP-1965) Handle map output buffers better
country flaguser name
United States
2007-10-17 06:38:50
     [ https://issues.apache.org/jira/browse/HADOOP-1965?page=co
m.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Amar Kamat updated HADOOP-1965:
-------------------------------

    Attachment: 1965_single_proc_150mb_gziped.pdf

I am attaching the results of having a threaded spill [ with
two DataOutputBuffer (each is half the size of io.sort.mb)
and a separate thread for sorting and spilling ] as compared
to a sequential spill [ as done now ] provided that the data
is non-splittable and should be consumed by a single map
task.  The setup is as follows
* Data source: Random text using random-text-writer
* Key size : 10 words
* Data size : ~150mb
* Input type : Gzip
* DFS-block-size : 200mb
* # nodes : 1
* Job type : wordcount
* io.sort.mb : {5,15,25,50,75,100,125}
----
Results : See the attachment
1965_single_proc_150mb_gziped.pdf
----
comments ? 
Kindly let me know if a text format of the comparison is
required.

> Handle map output buffers better
> --------------------------------
>
>                 Key: HADOOP-1965
>                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1965
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Amar Kamat
>         Attachments: 1965_single_proc_150mb_gziped.pdf
>
>
> Today, the map task stops calling the map method while
sort/spill is using the (single instance of) map output
buffer. One improvement that can be done to improve
performance of the map task is to have another buffer for
writing the map outputs to, while sort/spill is using the
first buffer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (HADOOP-1965) Handle map output buffers better
country flaguser name
United States
2007-10-17 07:12:51
    [ https://issues.apache.org/jira/browse
/HADOOP-1965?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12535529 ] 

Runping Qi commented on HADOOP-1965:
------------------------------------


It seems clear that threaded spill performed much better
than sequence spill.

One thing surprising is that the spill times got worse as
sort.io.mb increaseed.

This sounds counterintuitive. Any insights/explanations?


> Handle map output buffers better
> --------------------------------
>
>                 Key: HADOOP-1965
>                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1965
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Amar Kamat
>         Attachments: 1965_single_proc_150mb_gziped.pdf
>
>
> Today, the map task stops calling the map method while
sort/spill is using the (single instance of) map output
buffer. One improvement that can be done to improve
performance of the map task is to have another buffer for
writing the map outputs to, while sort/spill is using the
first buffer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (HADOOP-1965) Handle map output buffers better
country flaguser name
United States
2007-10-23 04:12:50
    [ https://issues.apache.org/jira/browse
/HADOOP-1965?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12536948 ] 

Amar Kamat commented on HADOOP-1965:
------------------------------------

Breakup of sort and combine+spill timings is attached in
1965_single_proc_150mb_gziped_breakup.png. The observation
is as follows
* The time required for sort and spill is proportional to
io.sort.mb.
* Sort timings increase rapidly as compared to combine+spill
with an increase in io.sort.mb.
Sort becomes a bottleneck as io.sort.mb is increased and
performs badly than spill causing the overall map timings to
increase with io.sort.mb.
----
A test was made to see the effects of having multiple map
tasks running on a same node concurrently. The observation
is that the performance gained by using a threaded
sort-spill over a sequential one is not very significant as
compared to running a single map task per node.

> Handle map output buffers better
> --------------------------------
>
>                 Key: HADOOP-1965
>                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1965
>             Project: Hadoop
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Devaraj Das
>            Assignee: Amar Kamat
>         Attachments: 1965_single_proc_150mb_gziped.pdf
>
>
> Today, the map task stops calling the map method while
sort/spill is using the (single instance of) map output
buffer. One improvement that can be done to improve
performance of the map task is to have another buffer for
writing the map outputs to, while sort/spill is using the
first buffer.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


[1-10] [11-18]

about | contact  Other archives ( Real Estate discussion Medical topics )