List Info

Thread: Re: a question on number of parallel tasks




Re: a question on number of parallel tasks
country flaguser name
United States
2008-01-16 10:42:09
The part nomenclature does not refer to splits.  It refers
to how many
reduce processes were involved in actually writing the
output file.  Files
are split at read-time as necessary.

You will get more of them if you have more reducers.


On 1/16/08 8:25 AM, "Jim the Standing Bear"
<standingbeargmail.com> wrote:

> Thanks Ted.  I just didn't ask it right.  Here is a
stupid 101
> question, which I am sure the answer lies in the
documentation
> somewhere, just that I was having some difficulties in
finding it...
> 
> when I do an "ls" on the dfs,  I would see
this:
> /user/bear/output/part-00000 <r 4>
> 
> I probably got confused on what the part-##### means...
I thought
> part-##### tells how many splits a file has... so far,
I have only
> seen part-00000.  When will it have part-00001, 00002,
etc?
> 
> 
> 
> On Jan 16, 2008 11:04 AM, Ted Dunning <tdunningveoh.com> wrote:
>> 
>> 
>> Parallelizing the processing of data occurs at two
steps.  The first is
>> during the map phase where the input data file is
(hopefully) split across
>> multiple tasks.  This should happen transparently
most of the time unless
>> you have a perverse data format or use unsplittable
compression on your
>> file.
>> 
>> This parallelism can occur whether you have one
input file or many.
>> 
>> The second level of parallelism is at reduce phase.
 You set this by setting
>> the number of reducers.  This will also determine
the number of output files
>> that you get.
>> 
>> Depending on your algorithm, it may help or hurt to
have one or many
>> reducers.  The recent example of a program to find
the 10 largest elements
>> is an example that pretty much requires a single
reducer.  Other programs
>> where the mapper produces huge amounts of output
would be better served by
>> having many reducers.
>> 
>> This is a general answer since the question is kind
of non-specific.
>> 
>> 
>> 
>> On 1/16/08 7:59 AM, "Jim the Standing
Bear" <standingbeargmail.com> wrote:
>> 
>>> Hi,
>>> 
>>> How do I make hadoop split its output?  The
program I am writing
>>> crawls a catalog tree from a single url, so
initially the input
>>> contains only one entry.  after a few
iterations, it will have tens of
>>> thousands of urls.  But what I noticed is that
the file is always in
>>> one block (part-00000).   What I would like to
have is once the number
>>> of entries increases, it can parallelize the
job.  Currently it
>>> doesn't seem to be case.
>> 
>> 
> 
> 


Re: a question on number of parallel tasks
user name
2008-01-16 10:48:00
hmm.. interesting... these are supposed to be the output
from mappers
(and default reducers since I didn't specify any for those
jobs)...
but shouldn't the number of reducers match the number of
mappers?  If
there was only one reducer, it would mean I only had one
mapper task
running??  That is why I asked my question in the first
place, because
I suspect my jobs were not being running in parallel.

On Jan 16, 2008 11:42 AM, Ted Dunning <tdunningveoh.com> wrote:
>
> The part nomenclature does not refer to splits.  It
refers to how many
> reduce processes were involved in actually writing the
output file.  Files
> are split at read-time as necessary.
>
> You will get more of them if you have more reducers.
>
>
>
> On 1/16/08 8:25 AM, "Jim the Standing Bear"
<standingbeargmail.com> wrote:
>
> > Thanks Ted.  I just didn't ask it right.  Here is
a stupid 101
> > question, which I am sure the answer lies in the
documentation
> > somewhere, just that I was having some
difficulties in finding it...
> >
> > when I do an "ls" on the dfs,  I would
see this:
> > /user/bear/output/part-00000 <r 4>
> >
> > I probably got confused on what the part-#####
means... I thought
> > part-##### tells how many splits a file has... so
far, I have only
> > seen part-00000.  When will it have part-00001,
00002, etc?
> >
> >
> >
> > On Jan 16, 2008 11:04 AM, Ted Dunning
<tdunningveoh.com> wrote:
> >>
> >>
> >> Parallelizing the processing of data occurs at
two steps.  The first is
> >> during the map phase where the input data file
is (hopefully) split across
> >> multiple tasks.  This should happen
transparently most of the time unless
> >> you have a perverse data format or use
unsplittable compression on your
> >> file.
> >>
> >> This parallelism can occur whether you have
one input file or many.
> >>
> >> The second level of parallelism is at reduce
phase.  You set this by setting
> >> the number of reducers.  This will also
determine the number of output files
> >> that you get.
> >>
> >> Depending on your algorithm, it may help or
hurt to have one or many
> >> reducers.  The recent example of a program to
find the 10 largest elements
> >> is an example that pretty much requires a
single reducer.  Other programs
> >> where the mapper produces huge amounts of
output would be better served by
> >> having many reducers.
> >>
> >> This is a general answer since the question is
kind of non-specific.
> >>
> >>
> >>
> >> On 1/16/08 7:59 AM, "Jim the Standing
Bear" <standingbeargmail.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> How do I make hadoop split its output? 
The program I am writing
> >>> crawls a catalog tree from a single url,
so initially the input
> >>> contains only one entry.  after a few
iterations, it will have tens of
> >>> thousands of urls.  But what I noticed is
that the file is always in
> >>> one block (part-00000).   What I would
like to have is once the number
> >>> of entries increases, it can parallelize
the job.  Currently it
> >>> doesn't seem to be case.
> >>
> >>
> >
> >
>
>



-- 
--------------------------------------
Standing Bear Has Spoken
--------------------------------------

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )