List Info

Thread: Created: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce




Created: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce
country flaguser name
United States
2007-10-02 15:57:50
Add support for a general serialization mechanism for Map
Reduce
------------------------------------------------------------
----

                 Key: HADOOP-1986
                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1986
             Project: Hadoop
          Issue Type: New Feature
          Components: mapred
            Reporter: Tom White
             Fix For: 0.16.0


Currently Map Reduce programs have to use
WritableComparable-Writable key-value pairs. While it's
possible to write Writable wrappers for other serialization
frameworks (such as Thrift), this is not very convenient: it
would be nicer to be able to use arbitrary types directly,
without explicit wrapping and unwrapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce
country flaguser name
United States
2007-10-02 16:01:52
    [ https://issues.apache.org/jira/browse
/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12531912 ] 

Tom White commented on HADOOP-1986:
-----------------------------------

Here's an initial plan:

1. Remove the requirement for Map Reduce types to extend
WritableComparable/Writable. So for example Mapper would
become:


public interface Mapper<K1, V1, K2, V2>


2. Create a serialization class that can turn objects into
byte streams and vice versa. Something like:


public interface Serializer<T> {
  void serialize(T t, OutputStream out) throws IOException;
  void deserialize(T t, InputStream in) throws IOException;
}


3. Add a configuration property to specify the Serializer to
use. This would default to WritableSerializer (an
implementation of Serializer<Writable>).

4. Change the type of the output key comparator to be
Comparator<T>, with default WritableComparator (an
implmentation of Comparator<Writable>).

5. In MapTask use a Serializer to write map outputs to the
output files.

6. In ReduceTask use a Serializer to read sorted map outputs
for the reduce phase.

I've played with some of this and it looks like it would
work, however there is a problem with the Serializer
interface above as it stands. The serialize method of
WritableSerializer looks like this:


public void serialize(Writable w, OutputStream out) throws
IOException {
  w.write(new DataOutputStream(out));
}


Clearly it is not acceptable to create a new object on every
write. This is a general problem - Writables write to a
DataOutputStream, Thrift objects write to a TProtocol, etc.
So the solution is probably to make the Serializer stateful,
having the same lifetime as the wrapped stream. Something
like:


public interface Serializer<T> {
  void open(OutputStream out);
  void serialize(T t);
  void close();
}

(There would be a similar Deserializer interface.)

Could this work?

> Add support for a general serialization mechanism for
Map Reduce
>
------------------------------------------------------------
----
>
>                 Key: HADOOP-1986
>                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1986
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Tom White
>             Fix For: 0.16.0
>
>
> Currently Map Reduce programs have to use
WritableComparable-Writable key-value pairs. While it's
possible to write Writable wrappers for other serialization
frameworks (such as Thrift), this is not very convenient: it
would be nicer to be able to use arbitrary types directly,
without explicit wrapping and unwrapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce
country flaguser name
United States
2007-10-02 16:24:50
    [ https://issues.apache.org/jira/browse
/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12531916 ] 

Owen O'Malley commented on HADOOP-1986:
---------------------------------------

Actually, I'd probably set it up so that you could configure
the list of Serializers with something like:


<property>
  <name>hadoop.serializers</name>
 
<value>org.apache.hadoop.io.WritableSerializer,org.apa
che.hadoop.io.ThriftSerializer</value>
</property>


and serializer could also have a target class:


public interface Serializer<T> {
  void serialize(T t, OutputStream out) throws IOException;
  void deserialize(T t, InputStream in) throws IOException;
  // Get the base class that this serializer will work on
  Class getTargetClass();
}


> Add support for a general serialization mechanism for
Map Reduce
>
------------------------------------------------------------
----
>
>                 Key: HADOOP-1986
>                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1986
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Tom White
>             Fix For: 0.16.0
>
>
> Currently Map Reduce programs have to use
WritableComparable-Writable key-value pairs. While it's
possible to write Writable wrappers for other serialization
frameworks (such as Thrift), this is not very convenient: it
would be nicer to be able to use arbitrary types directly,
without explicit wrapping and unwrapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce
country flaguser name
United States
2007-10-02 17:16:50
    [ https://issues.apache.org/jira/browse
/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12531930 ] 

Tom White commented on HADOOP-1986:
-----------------------------------

Is the idea here to use the target class to decide which
Serializer to use? If so, it might not work too well if the
serialization framework doesn't have a base class or marker
interface (e.g. Thrift).

I was thinking that the Serializer would be specified per MR
job - which should be simpler.

> Add support for a general serialization mechanism for
Map Reduce
>
------------------------------------------------------------
----
>
>                 Key: HADOOP-1986
>                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1986
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Tom White
>             Fix For: 0.16.0
>
>
> Currently Map Reduce programs have to use
WritableComparable-Writable key-value pairs. While it's
possible to write Writable wrappers for other serialization
frameworks (such as Thrift), this is not very convenient: it
would be nicer to be able to use arbitrary types directly,
without explicit wrapping and unwrapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce
country flaguser name
United States
2007-10-02 17:46:50
    [ https://issues.apache.org/jira/browse
/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12531938 ] 

Owen O'Malley commented on HADOOP-1986:
---------------------------------------

First, I'd argue that Thrift should have a top level
serialization interface... but clearly that belongs on their
development list. *smile*

But if the serializer is specific to the job, you wouldn't
be able to mix Writables and Thrift objects. If you wanted
to translate, for instance, you'd like:

map input: MyWritableKey, MyWritableValue
map output: MyThriftKey, MyThriftValue

If I have to have a single serializer for my job, that is a
pain. Of course, without a Thrift record super class, you
really can't write ThriftSerializable anyways. (You need a
standard interface to generate the bytes...)

> Add support for a general serialization mechanism for
Map Reduce
>
------------------------------------------------------------
----
>
>                 Key: HADOOP-1986
>                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1986
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Tom White
>             Fix For: 0.16.0
>
>
> Currently Map Reduce programs have to use
WritableComparable-Writable key-value pairs. While it's
possible to write Writable wrappers for other serialization
frameworks (such as Thrift), this is not very convenient: it
would be nicer to be able to use arbitrary types directly,
without explicit wrapping and unwrapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce
country flaguser name
United States
2007-10-02 18:09:51
    [ https://issues.apache.org/jira/browse
/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12531954 ] 

Doug Cutting commented on HADOOP-1986:
--------------------------------------

> But if the serializer is specific to the job, you
wouldn't be able to mix Writables and Thrift objects.

We need a serializer and deserializer specified per job so
that the mapred kernel can store intermediate data.  Then
the InputFormat may use a deserializer, and the OutputFormat
may use a serializer.  So I don't see that Tom's proposal
(at least not as I interpret it) prohibits such intermixing.
 The job's serializer only applies to the map output.  The
InputFormat's deserializer would apply to the map input, and
the OutputFormat's deserializer would apply to reduce
output.  Does that make sense?

> Add support for a general serialization mechanism for
Map Reduce
>
------------------------------------------------------------
----
>
>                 Key: HADOOP-1986
>                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1986
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Tom White
>             Fix For: 0.16.0
>
>
> Currently Map Reduce programs have to use
WritableComparable-Writable key-value pairs. While it's
possible to write Writable wrappers for other serialization
frameworks (such as Thrift), this is not very convenient: it
would be nicer to be able to use arbitrary types directly,
without explicit wrapping and unwrapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce
country flaguser name
United States
2007-10-02 18:57:51
    [ https://issues.apache.org/jira/browse
/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12531962 ] 

Owen O'Malley commented on HADOOP-1986:
---------------------------------------

But it is strictly more powerful allowing a serializer per a
class (or hierarchy). Furthermore, it means you only have to
configure the small number of serializers rather than worry
about which context you need to set which serializer for. I
think it is more confusing if you have to say:

FileInputFormat.setSerializer(conf, Bar.class);
job.setMapOutputSerializer(Foo.class);
FileOutputFormat.setSerializer(conf, Baz.class);

and it still would prevent you from mixing serializers
between keys and values. Unless you are proposing the even
more verbose:

FileInputFormat.setKeySerializer(conf, BarKey.class);
FileInputFormat.setValueSerializer(conf, BarValue.class);
job.setMapOutputKeySerializer(FooKey.class);
job.setMapOutputValueSerializer(FooValue.class);
FileOutputFormat.setKeySerializer(conf, BazKey.class);
FileOutputFormat.setValueSerializer(conf, BazValue.class);

I think the Serializers for a given type are constant,
rather than the Serializers for a given context being
constant.



> Add support for a general serialization mechanism for
Map Reduce
>
------------------------------------------------------------
----
>
>                 Key: HADOOP-1986
>                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1986
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Tom White
>             Fix For: 0.16.0
>
>
> Currently Map Reduce programs have to use
WritableComparable-Writable key-value pairs. While it's
possible to write Writable wrappers for other serialization
frameworks (such as Thrift), this is not very convenient: it
would be nicer to be able to use arbitrary types directly,
without explicit wrapping and unwrapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce
country flaguser name
United States
2007-10-02 19:50:50
    [ https://issues.apache.org/jira/browse
/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12531975 ] 

Doug Cutting commented on HADOOP-1986:
--------------------------------------

> FileInputFormat.setSerializer(conf, Bar.class);

I think what we'd ideally have is something like:

job.setInputFormat(SequenceFile<Foo,Bar>);

Which isn't java.  So, yes, binding serializers by key/value
class would be nice.  But how we get there from here is the
question.  In particular, how can we handle something like
Thrift, whose instances don't all implement some interface?

> I think the Serializers for a given type are constant,
rather than the Serializers for a given context being
constant.

That sounds mostly reasonable.  Do you have a proposal for
how to implement this?

On a related note, I think that, if we go this way then,
within Hadoop, we should deprecate Writable and directly
implement the serializer API.  Moving away from implementing
serialization directly to the class permits us to work with
other serialization systems, but there's no need for us to
take the indirection hit internally.

> Add support for a general serialization mechanism for
Map Reduce
>
------------------------------------------------------------
----
>
>                 Key: HADOOP-1986
>                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1986
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Tom White
>             Fix For: 0.16.0
>
>
> Currently Map Reduce programs have to use
WritableComparable-Writable key-value pairs. While it's
possible to write Writable wrappers for other serialization
frameworks (such as Thrift), this is not very convenient: it
would be nicer to be able to use arbitrary types directly,
without explicit wrapping and unwrapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce
country flaguser name
United States
2007-10-03 15:35:51
    [ https://issues.apache.org/jira/browse
/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12532243 ] 

Tom White commented on HADOOP-1986:
-----------------------------------

> Do you have a proposal for how to implement this?

If we follow Owen's suggestion then we can construct a map
of types to Serializer classes. Then, when running 
MapTask or ReduceTask we can use the map to instantiate an
appropriate Serializer for each of the key and the value
types.

> In particular, how can we handle something like Thrift,
whose instances don't all implement some interface?

The target class would have to be Object. However, for this
to work we would need to have some notion of precedence so
more specific subtypes (like Writable) match first. Also,
this wouldn't allow you to use two different serialization
frameworks whose instances only have a common type of
Object. I'm not sure how much of a problem this would be in
practice though.

(I just had a look at a Thrift class, generated with release
20070917, and it is tagged with java.io.Serializable. It
would be more useful though it if implemented an interface
that defined the read/write fields.)

> Add support for a general serialization mechanism for
Map Reduce
>
------------------------------------------------------------
----
>
>                 Key: HADOOP-1986
>                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1986
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Tom White
>             Fix For: 0.16.0
>
>
> Currently Map Reduce programs have to use
WritableComparable-Writable key-value pairs. While it's
possible to write Writable wrappers for other serialization
frameworks (such as Thrift), this is not very convenient: it
would be nicer to be able to use arbitrary types directly,
without explicit wrapping and unwrapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


Commented: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce
country flaguser name
United States
2007-10-03 15:48:51
    [ https://issues.apache.org/jira/browse
/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12532251 ] 

Owen O'Malley commented on HADOOP-1986:
---------------------------------------

I think to actually do something other than use Java
serialization, you pretty much need a superclass that
supports conversion to and from bytes. Realistically, with
the current Thrift implementation, you'd need to register a
serializer for each specific class. *Yuck* 

However, according to the Thrift developers adding a base
class would be easy and they are considering doing it.

> Add support for a general serialization mechanism for
Map Reduce
>
------------------------------------------------------------
----
>
>                 Key: HADOOP-1986
>                 URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1986
>             Project: Hadoop
>          Issue Type: New Feature
>          Components: mapred
>            Reporter: Tom White
>             Fix For: 0.16.0
>
>
> Currently Map Reduce programs have to use
WritableComparable-Writable key-value pairs. While it's
possible to write Writable wrappers for other serialization
frameworks (such as Thrift), this is not very convenient: it
would be nicer to be able to use arbitrary types directly,
without explicit wrapping and unwrapping.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.


[1-10] [11-20] [21-30] [31-40] [41-47]

about | contact  Other archives ( Real Estate discussion Medical topics )