"Owen O'Malley" <oom yahoo-inc.com> wrote on
07/30/2007 02:31:04 PM:
>
> On Jul 26, 2007, at 5:02 PM, John M Cieslewicz wrote:
> > The combiner semantics, however, are the same as
the reducer’s and
> > there is nothing to prevent a programmer from
implementing a
> > combiner that
> > changes the value of the key or outputs more or
less than one key-
> > value
> > pair.
>
> The combiner and reducer share an interface. However,
the semantics
> are different. In particular,
> 1. Combiners may be invoked once or many times on
each of the map
> outputs, while reduces will be invoked exactly once on
each key.
> 2. As a result of that, combiners effectively can
not have side
> effects, while reduces can.
> 3. Reduces can emit different types than their
inputs, combiners
> can not.
> 4. Reduces can change the key, while combiners are
required not
> to. Currently this is not checked dynamically, although
it should be.
> (Things will break badly if combiners do this...)
Rather than checking this dynamically, I think it would be
easier and
clearer for the combiner programmer to define the combiner
interface along
the lines suggested by Doug:
public interface Combiner {
/** Combine all values passed into a single value that is
returned. */
public Writable combine(WritableComparable key, Iterator
values);
}:
If such a change seems reasonable, I would be happy to
implement the
necessary changes.
>
> Note that currently Hadoop invokes the combiner exactly
once. There
> is a jira issue filed to fix that. *smile*
>
Could you point me to the jira issue related to this? I have
already
implemented some things that are potentially related to this
such as
combining across map spills during the merge at a completion
of a map task.
> > This leads to a number of limitations, chief among
them the fact
> > that the
> > combiner cannot be applied more than once because
there are no
> > guarantees
> > regarding the effects of repeatedly using the
combiner (as
> > implemented, the
> > combiner could produce more than one output pair
or change the key).
>
> As I said in the previous point, the combiner can be
invoked more
> than once and should be. It currently does not.
Applications are
> required to keep the combiners pure. I hope it does not
break too
> many applications when we fix this.
>
> > A summary of desirable semantics:
> > 1 The map function produces as output partial
aggregate values
> > representing singletons.
> > 2 A new combiner function that explicitly
performs partial to
> > partial
> > aggregation over one or more values,
creating one new output
> > value of
> > the same type as the input value and not
changing the key.
> > 3 A reducer which takes as input partial
aggregates and produces
> > final
> > values of any format.
>
> Basically, we already have this, except that we allow
the combiner to
> emit multiple records. Multiple records out of the
combiner is not as
> clean, but in practice I don't think it hurts
anything.
>
A potential problem caused by allowing a combiner to output
multiple
records is that it could break future optimizations. With a
combiner
defined, one could, for instance, pipeline some combining
within the
reducer using a means other than sorting such as a tree or
hash table. In
those cases, one might require a combiner to produce a
single new value for
the given key.
-John |