List Info

Thread: RE: Poly-reduce?




RE: Poly-reduce?
user name
2007-08-24 13:49:59
Would be cool to get an option to reduce replication factor
for reduce
outputs.

Hard to buy the argument that there's gonna be no
performance win with
direct streaming between jobs. Currently reduce jobs start
reading map
outputs before all maps are complete - and I am sure this
results in
significant speedup. Using the same logic, streaming reduce
outputs to
the next map and reduce steps (before the first reduce is
complete)
should also provide speedup. 

If the streaming option were available, the programmer would
have a
clear choice: excellent best case/poor worst case
performance with
streaming or good best case/good worst case performance with
hdfs based
checkpointing. I think this is a choice that the job-writer
is competent
enough to make.


To Owen's reference to PIG - I am curious whether the PIG
codebase also
frequently chains multiple map-reduce jobs to perform a
single Pig
operation? (especially since my experience resulted from the
need to
write some complicated multi-way joins). Anyone from Pig
developer
community who can chime in?

Joydeep




-----Original Message-----
From: Doug Cutting [mailto:cuttingapache.org] 
Sent: Friday, August 24, 2007 9:54 AM
To: hadoop-userlucene.apache.org
Subject: Re: Poly-reduce?

Ted Dunning wrote:
> It isn't hard to implement these programs as multiple
fully fledged
> map-reduces, but it appears to me that many of them
would be better
> expressed as something more like a map-reduce-reduce
program.
> 
> [ ... ]
> 
> Expressed conventionally, this would have write all of
the user
sessions to
> HDFS and a second map phase would generate the pairs
for counting.
The
> opportunity for efficiency would come from the ability
to avoid
writing
> intermediate results to the distributed data store.
>     
> Has anybody looked at whether this would help and
whether it would be
hard
> to do?

It would job tracker more complicated, and might not help
job execution 
time that much.

Consider implementing this as multiple map reduce steps, but
using a 
replication level of one for intermediate data.  That would
mostly have 
the performance characteristics you want.  But if a node
died, things 
could not intelligently automatically re-create just the
missing data. 
Instead the application would have to re-run the entire job,
or subsets 
of it, in order to re-create the un-replicated data.

Under poly-reduce, if a node failed, all tasks that were
incomplete on 
that node would need to be restarted.  But first, their
input data would

need to be located.  If you saved all intermediate data in
the course of

a job (which would be expensive) then the inputs that need
re-creation 
would mostly just be those that were created on the failed
node.  But 
this failure would generally cascade all the way back to the
initial map

stage.  So a single machine failure in the last phase could
double the 
run time of the job, with most of the cluster idle.

If, instead, you used normal mapreduce, with intermediate
data 
replicated in the filesystem, a single machine failure in
the last phase

would only require re-running tasks from the last job.

Perhaps, when chaining mapreduces, one should use a lower
replication 
level for intermediate data, like two.  Additionally, one
might wish to 
relax the one-replica-off-rack criterion for such files, so
that 
replication is faster, and since whole-rack failures are
rare.  This 
might give good chained performance, but keep machine
failures from 
knocking tasks back to the start of the chain.  Currently
its not 
possible to disable the one-replica-off-rack preference, but
that might 
be a reasonable feature request.

Doug


Re: Poly-reduce?
country flaguser name
United States
2007-08-24 14:11:01
Joydeep Sen Sarma wrote:
> Would be cool to get an option to reduce replication
factor for reduce
> outputs.

This should be as simple as setting dfs.replication on a
job.  If that 
does not work, it's a bug.

> Hard to buy the argument that there's gonna be no
performance win with
> direct streaming between jobs. Currently reduce jobs
start reading map
> outputs before all maps are complete - and I am sure
this results in
> significant speedup. 

That's not exactly true.  No reduces can be performed until
all maps are 
complete.  However the shuffle (transfer/sort of
intermediate data) 
happens in parallel with mapping.

> Using the same logic, streaming reduce outputs to
> the next map and reduce steps (before the first reduce
is complete)
> should also provide speedup. 

Perhaps, but the bookeeping required in the jobtracker might
be onerous. 
  The failure modes are more complex, complicating
recovery.

> If the streaming option were available, the programmer
would have a
> clear choice: excellent best case/poor worst case
performance with
> streaming or good best case/good worst case performance
with hdfs based
> checkpointing. I think this is a choice that the
job-writer is competent
> enough to make.

Please feel free to try to implement this.  If you can
develop a patch 
which implements this reliably in maintainable code, then it
would 
probably be committed.

Doug

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )