hmm.. interesting... these are supposed to be the output
from mappers
(and default reducers since I didn't specify any for those
jobs)...
but shouldn't the number of reducers match the number of
mappers? If
there was only one reducer, it would mean I only had one
mapper task
running?? That is why I asked my question in the first
place, because
I suspect my jobs were not being running in parallel.
On Jan 16, 2008 11:42 AM, Ted Dunning <tdunning veoh.com> wrote:
>
> The part nomenclature does not refer to splits. It
refers to how many
> reduce processes were involved in actually writing the
output file. Files
> are split at read-time as necessary.
>
> You will get more of them if you have more reducers.
>
>
>
> On 1/16/08 8:25 AM, "Jim the Standing Bear"
<standingbear gmail.com> wrote:
>
> > Thanks Ted. I just didn't ask it right. Here is
a stupid 101
> > question, which I am sure the answer lies in the
documentation
> > somewhere, just that I was having some
difficulties in finding it...
> >
> > when I do an "ls" on the dfs, I would
see this:
> > /user/bear/output/part-00000 <r 4>
> >
> > I probably got confused on what the part-#####
means... I thought
> > part-##### tells how many splits a file has... so
far, I have only
> > seen part-00000. When will it have part-00001,
00002, etc?
> >
> >
> >
> > On Jan 16, 2008 11:04 AM, Ted Dunning
<tdunning veoh.com> wrote:
> >>
> >>
> >> Parallelizing the processing of data occurs at
two steps. The first is
> >> during the map phase where the input data file
is (hopefully) split across
> >> multiple tasks. This should happen
transparently most of the time unless
> >> you have a perverse data format or use
unsplittable compression on your
> >> file.
> >>
> >> This parallelism can occur whether you have
one input file or many.
> >>
> >> The second level of parallelism is at reduce
phase. You set this by setting
> >> the number of reducers. This will also
determine the number of output files
> >> that you get.
> >>
> >> Depending on your algorithm, it may help or
hurt to have one or many
> >> reducers. The recent example of a program to
find the 10 largest elements
> >> is an example that pretty much requires a
single reducer. Other programs
> >> where the mapper produces huge amounts of
output would be better served by
> >> having many reducers.
> >>
> >> This is a general answer since the question is
kind of non-specific.
> >>
> >>
> >>
> >> On 1/16/08 7:59 AM, "Jim the Standing
Bear" <standingbear gmail.com> wrote:
> >>
> >>> Hi,
> >>>
> >>> How do I make hadoop split its output?
The program I am writing
> >>> crawls a catalog tree from a single url,
so initially the input
> >>> contains only one entry. after a few
iterations, it will have tens of
> >>> thousands of urls. But what I noticed is
that the file is always in
> >>> one block (part-00000). What I would
like to have is once the number
> >>> of entries increases, it can parallelize
the job. Currently it
> >>> doesn't seem to be case.
> >>
> >>
> >
> >
>
>
--
--------------------------------------
Standing Bear Has Spoken
--------------------------------------
|