List Info

Thread: Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base




Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base
country flaguser name
United States
2007-10-25 12:19:59
So I managed to get my fast InputFormat working.... it does
still use the
FS, but in such a way that it improves mapper startup by
over 2X.  And last
night I got a prototype working that allows the map task to
run under the
JVM of the TaskTracker, rather than spawing a new JVM.

The initial performance look really, really good.  I just
ran a 1000 map
single input record job, (mappers doing no work however), in
a one master,
one slave setup... on my laptop....  It completed in a
couple thousand
seconds, or a couple seconds per map.  Earlier I did a
smaller 100 map job
with a stable quieced system and it came in at about 130
seconds.

So this prototype can start and end map jobs in 1-2 seconds,
and should
scale flatly with respect to nodes in the setup.



                                                            
              
             "Owen O'Malley"                      
                        
             <oomyahoo-inc.co                              
              
             m>                                          
              To 
                                       hadoop-userlucene.apache.org       
             10/24/2007 01:05                               
           cc 
             PM                                             
              
                                                            
      Subject 
                                       Re: InputFiles,
Splits, Maps, Tasks 
             Please respond to         Questions 1.3 Base   
              
             hadoop-userlucen                      
                      
               e.apache.org                                 
              
                                                            
              
                                                            
              
                                                            
              
                                                            
              





On Oct 24, 2007, at 12:42 PM, Doug Cutting wrote:

> Lance Amundsen wrote:
>> OK, that is encouraging.  I'll take another pass at
it.  I succeeded
>> yesterday with an in-memory only InputFormat, but
only after I
>> commented
>> out some of the split referencing code, like the
following in
>> MapTask.java
>>     if (instantiatedSplit instanceof FileSplit) {
>>       FileSplit fileSplit = (FileSplit)
instantiatedSplit;
>>       job.set("map.input.file",
fileSplit.getPath().toString());
>>       job.setLong("map.input.start",
fileSplit.getStart());
>>       job.setLong("map.input.length",
fileSplit.getLength());
>>     }
>
> Yes, that code should not exist, but it shouldn't
affect you
> either. You should be subclassing InputSplit, not
FileSplit, so
> this code shouldn't operate on your splits.

That code doesn't do anything if they are non file-splits,
so it
absolutely shouldn't break anything. Applications depend on
those
attributes to know which split they are working on and there
isn't a
better fix until we move to context objects. I know that
non-
filesplits work because there are units tests to make sure
they don't
break anything.

-- Owen



Re: InputFiles, Splits, Maps, Tasks Questions 1.3 Base
country flaguser name
United States
2007-10-25 12:38:23
I did a patch last year that got similar improvements but
still using an 
external process. (I really like the idea of keeping user
code out of the 
JobTracker and the TaskTracker. It makes things more
stable.) See HADOOP-249. 
It reuses the JVM for a task, which avoids the JVM restart
hit. This hit is 
really bad for cases such as yours. It also avoids the
performance hit of 
doing socket I/O for progress and task info, and instead
uses the process 
pip, which also gives a big performance improvement.

Unfortunately, it was never incorporated and now the patch
no longer applies. 
It's really not a big change, but the Hadoop code path to
spawn the JVM is a 
bit convoluted, which made it hard to do the change and
makes it hard to 
bring the patch up-to-date.

ben

On Thursday 25 October 2007 10:19:59 Lance Amundsen wrote:
> So I managed to get my fast InputFormat working.... it
does still use the
> FS, but in such a way that it improves mapper startup
by over 2X.  And last
> night I got a prototype working that allows the map
task to run under the
> JVM of the TaskTracker, rather than spawing a new JVM.
>
> The initial performance look really, really good.  I
just ran a 1000 map
> single input record job, (mappers doing no work
however), in a one master,
> one slave setup... on my laptop....  It completed in a
couple thousand
> seconds, or a couple seconds per map.  Earlier I did a
smaller 100 map job
> with a stable quieced system and it came in at about
130 seconds.
>
> So this prototype can start and end map jobs in 1-2
seconds, and should
> scale flatly with respect to nodes in the setup.
>
>
>
>
>              "Owen O'Malley"
>              <oomyahoo-inc.co
>              m>                                     
                   To
>                                        hadoop-userlucene.apache.org
>              10/24/2007 01:05                          
                cc
>              PM
>                                                        
           Subject
>                                        Re: InputFiles,
Splits, Maps, Tasks
>              Please respond to         Questions 1.3
Base
>              hadoop-userlucen
>                e.apache.org
>
> On Oct 24, 2007, at 12:42 PM, Doug Cutting wrote:
> > Lance Amundsen wrote:
> >> OK, that is encouraging.  I'll take another
pass at it.  I succeeded
> >> yesterday with an in-memory only InputFormat,
but only after I
> >> commented
> >> out some of the split referencing code, like
the following in
> >> MapTask.java
> >>     if (instantiatedSplit instanceof
FileSplit) {
> >>       FileSplit fileSplit = (FileSplit)
instantiatedSplit;
> >>       job.set("map.input.file",
fileSplit.getPath().toString());
> >>       job.setLong("map.input.start",
fileSplit.getStart());
> >>      
job.setLong("map.input.length",
fileSplit.getLength());
> >>     }
> >
> > Yes, that code should not exist, but it shouldn't
affect you
> > either. You should be subclassing InputSplit, not
FileSplit, so
> > this code shouldn't operate on your splits.
>
> That code doesn't do anything if they are non
file-splits, so it
> absolutely shouldn't break anything. Applications
depend on those
> attributes to know which split they are working on and
there isn't a
> better fix until we move to context objects. I know
that non-
> filesplits work because there are units tests to make
sure they don't
> break anything.
>
> -- Owen



[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )