|
List Info
Thread: Created: (HADOOP-1986) Add support for a general serialization mechanism for Map Reduce
|
|
| Commented: (HADOOP-1986) Add support
for a general serialization mechanism
for Map Reduce |
  United States |
2007-10-18 16:04:51 |
[ https://issues.apache.org/jira/browse
/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12536048 ]
Doug Cutting commented on HADOOP-1986:
--------------------------------------
I think I'd opt for another level of indirection, for two
reasons:
- to bind together serializers and deserializers, which are
almost always paired;
- to permit generation of more specialized serializers and
deserializers
E.g.,
<property>
<name>io.serialization</name>
<value>WritableSerialization</value>
</property>
public interface Serialization {
Serializer getSerializer();
Deserializer getDeserializer();
}
public class SerializationFactory {
public Serializer getSerializer(Class c) { return
getSerialization(c).getSerializer(); }
public Deserializer getDeserializer(Class c) { return
getSerialization(c).getDeserializer(); }
public getSerialization(Class c) { ... infer from c's
superclasses & interfaces ... }
}
> Add support for a general serialization mechanism for
Map Reduce
>
------------------------------------------------------------
----
>
> Key: HADOOP-1986
> URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1986
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Reporter: Tom White
> Assignee: Tom White
> Fix For: 0.16.0
>
> Attachments: SerializableWritable.java,
serializer-v1.patch
>
>
> Currently Map Reduce programs have to use
WritableComparable-Writable key-value pairs. While it's
possible to write Writable wrappers for other serialization
frameworks (such as Thrift), this is not very convenient: it
would be nicer to be able to use arbitrary types directly,
without explicit wrapping and unwrapping.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (HADOOP-1986) Add support
for a general serialization mechanism
for Map Reduce |
  United States |
2007-10-18 16:18:51 |
[ https://issues.apache.org/jira/browse
/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12536053 ]
Doug Cutting commented on HADOOP-1986:
--------------------------------------
Also, instead of using introspection, we might use a method,
e.g.:
public interface Serialization {
boolean accept(Class c);
Serializer getSerializer();
Deserializer getDeserializer();
}
SerializationFactory#getSerialization(Class) would then just
iterate through the defined serializations calling accept().
It could cache an instance of each defined serialization.
Note also that, for primitive types, we could pass things
like Integer.TYPE. So then one might even, e.g., define a
mapper that takes <int,long> pairs.
> Add support for a general serialization mechanism for
Map Reduce
>
------------------------------------------------------------
----
>
> Key: HADOOP-1986
> URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1986
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Reporter: Tom White
> Assignee: Tom White
> Fix For: 0.16.0
>
> Attachments: SerializableWritable.java,
serializer-v1.patch
>
>
> Currently Map Reduce programs have to use
WritableComparable-Writable key-value pairs. While it's
possible to write Writable wrappers for other serialization
frameworks (such as Thrift), this is not very convenient: it
would be nicer to be able to use arbitrary types directly,
without explicit wrapping and unwrapping.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (HADOOP-1986) Add support
for a general serialization mechanism
for Map Reduce |
  United States |
2007-10-18 16:26:50 |
[ https://issues.apache.org/jira/browse
/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12536055 ]
Owen O'Malley commented on HADOOP-1986:
---------------------------------------
Doug,
The boolean accept method would be fine. Your proposal
would let you split apart serializers from deserializers,
which doesn't seem very useful to me. If I use serializer X,
I pretty much always want to use deserializer X too. I don't
see the value is saying for type X use serializer Y and
deserializer Z.
> Add support for a general serialization mechanism for
Map Reduce
>
------------------------------------------------------------
----
>
> Key: HADOOP-1986
> URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1986
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Reporter: Tom White
> Assignee: Tom White
> Fix For: 0.16.0
>
> Attachments: SerializableWritable.java,
serializer-v1.patch
>
>
> Currently Map Reduce programs have to use
WritableComparable-Writable key-value pairs. While it's
possible to write Writable wrappers for other serialization
frameworks (such as Thrift), this is not very convenient: it
would be nicer to be able to use arbitrary types directly,
without explicit wrapping and unwrapping.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (HADOOP-1986) Add support
for a general serialization mechanism
for Map Reduce |
  United States |
2007-10-18 16:41:51 |
[ https://issues.apache.org/jira/browse
/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12536061 ]
Doug Cutting commented on HADOOP-1986:
--------------------------------------
Owen, sure, maybe something like:
<property>
<name>io.serialization.factories</name>
<value>WritableSerializerFactory</value>
</property>
public interface Serializer {
void serialize(Object, OutputStream);
Object deserialize(Object reuse, InputStream);
}
public interface SerializerFactory {
Serializer getSerializer(Class c);
}
public class Serialization {
public Serializer getSerializer(Class c, Configuration
conf) {
for (factory defined in conf) {
Serializer s = factory.getSerializer(c);
if (s != null)
return s;
}
}
}
> Add support for a general serialization mechanism for
Map Reduce
>
------------------------------------------------------------
----
>
> Key: HADOOP-1986
> URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1986
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Reporter: Tom White
> Assignee: Tom White
> Fix For: 0.16.0
>
> Attachments: SerializableWritable.java,
serializer-v1.patch
>
>
> Currently Map Reduce programs have to use
WritableComparable-Writable key-value pairs. While it's
possible to write Writable wrappers for other serialization
frameworks (such as Thrift), this is not very convenient: it
would be nicer to be able to use arbitrary types directly,
without explicit wrapping and unwrapping.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (HADOOP-1986) Add support
for a general serialization mechanism
for Map Reduce |
  United States |
2007-10-18 18:46:51 |
[ https://issues.apache.org/jira/browse
/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12536076 ]
Joydeep Sen Sarma commented on HADOOP-1986:
-------------------------------------------
question - what's the compatibility path for all
applications written with 'Writable'?
> Add support for a general serialization mechanism for
Map Reduce
>
------------------------------------------------------------
----
>
> Key: HADOOP-1986
> URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1986
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Reporter: Tom White
> Assignee: Tom White
> Fix For: 0.16.0
>
> Attachments: SerializableWritable.java,
serializer-v1.patch
>
>
> Currently Map Reduce programs have to use
WritableComparable-Writable key-value pairs. While it's
possible to write Writable wrappers for other serialization
frameworks (such as Thrift), this is not very convenient: it
would be nicer to be able to use arbitrary types directly,
without explicit wrapping and unwrapping.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (HADOOP-1986) Add support
for a general serialization mechanism
for Map Reduce |
  United States |
2007-10-19 03:58:51 |
[ https://issues.apache.org/jira/browse
/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12536148 ]
Tom White commented on HADOOP-1986:
-----------------------------------
Doug/Owen
These changes generally look good - I'll try to work them
into a new patch.
In the current patch Serializers and Deserializers are
stateful with open/close methods and that was the reason
that led me to separate them. We could combine them in a
single object, but this would be at the expense of muddying
the method names: (e.g. closeSerializer and
closeDeserializer), so I'm reluctant to do that - I would
stick with Doug's first SerializationFactory proposal (plus
the accept method).
Another aspect that the current patch doesn't address is who
instantiates objects during deserialization. (Doug - I think
you're alluding to this in the "reuse" object in
the Serializer class above?) For Writables and Thrift the
serialization framework does not instantiate objects - it
merely populates the supplied object with the representation
from the stream. For Java Serialization the serialization
framework reads the type from the stream and instantiates an
object for that type. To cater for this difference we need
to make the Deserializer expose whether it can reuse types
so that the client (for example ReduceTask) knows whether to
hand it an object or not. This is needed for efficiency (so
the client doesn't needlessly create objects that aren't
used) and also since some serialization frameworks don't
require classes to have no-arg constructors (so the client
would not be able to create the required object in any
case).
> Add support for a general serialization mechanism for
Map Reduce
>
------------------------------------------------------------
----
>
> Key: HADOOP-1986
> URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1986
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Reporter: Tom White
> Assignee: Tom White
> Fix For: 0.16.0
>
> Attachments: SerializableWritable.java,
serializer-v1.patch
>
>
> Currently Map Reduce programs have to use
WritableComparable-Writable key-value pairs. While it's
possible to write Writable wrappers for other serialization
frameworks (such as Thrift), this is not very convenient: it
would be nicer to be able to use arbitrary types directly,
without explicit wrapping and unwrapping.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (HADOOP-1986) Add support
for a general serialization mechanism
for Map Reduce |
  United States |
2007-10-19 05:40:51 |
[ https://issues.apache.org/jira/browse
/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12536176 ]
Vivek Ratan commented on HADOOP-1986:
-------------------------------------
>> Another aspect that the current patch doesn't
address is who instantiates objects during deserialization.
Good point. Maybe this is what Doug and/or you mean, but you
could have something like this:
boolean acceptObjectReference(); // returns true if the
framework deserializes into an object, false if it creates
an object and returns it
Object deserialize(Object reuse, InputStream);
Frameworks that expect the caller to create an object before
deserializing it (Thrift, Record I/O), would return NULL,
but others that create their own objects would accept a NULL
value for the 'reuse' parameter. The
_acceptObjectReference()_ method tells a client which option
the deserializer prefers, though most clients would already
know before-hand. Is this similar to what you were thinking
about?
Another option is to have separate deserialize methods:
Object deserialize(InputStream);
void deserialize(Object, InputStream);
Most frameworks would implement only one of these methods,
perhaps throwing an exception for the one they don't
implement (you'd also need a way for a client to dynamically
find out which frameworks implements which method).
I lean towards the former - it's more compact, though the
latter is a little more cleaner (less confusion).
> Add support for a general serialization mechanism for
Map Reduce
>
------------------------------------------------------------
----
>
> Key: HADOOP-1986
> URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1986
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Reporter: Tom White
> Assignee: Tom White
> Fix For: 0.16.0
>
> Attachments: SerializableWritable.java,
serializer-v1.patch
>
>
> Currently Map Reduce programs have to use
WritableComparable-Writable key-value pairs. While it's
possible to write Writable wrappers for other serialization
frameworks (such as Thrift), this is not very convenient: it
would be nicer to be able to use arbitrary types directly,
without explicit wrapping and unwrapping.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (HADOOP-1986) Add support
for a general serialization mechanism
for Map Reduce |
  United States |
2007-10-19 07:33:00 |
[ https://issues.apache.org/jira/browse
/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12536201 ]
Tom White commented on HADOOP-1986:
-----------------------------------
Vivek,
I was thinking of the first way.
> Frameworks that expect the caller to create an object
before deserializing it (Thrift, Record I/O), would return
NULL,
> but others that create their own objects would accept a
NULL value for the 'reuse' parameter.
It's a small point but I think the return value would always
be non-null for convenience. The contract would be that the
deserialize method always returns a deserialized object.
Deserializers for Thrift, Record I/O etc. would just return
the "reuse" object.
> Add support for a general serialization mechanism for
Map Reduce
>
------------------------------------------------------------
----
>
> Key: HADOOP-1986
> URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1986
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Reporter: Tom White
> Assignee: Tom White
> Fix For: 0.16.0
>
> Attachments: SerializableWritable.java,
serializer-v1.patch
>
>
> Currently Map Reduce programs have to use
WritableComparable-Writable key-value pairs. While it's
possible to write Writable wrappers for other serialization
frameworks (such as Thrift), this is not very convenient: it
would be nicer to be able to use arbitrary types directly,
without explicit wrapping and unwrapping.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (HADOOP-1986) Add support
for a general serialization mechanism
for Map Reduce |
  United States |
2007-10-19 11:28:51 |
[ https://issues.apache.org/jira/browse
/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12536254 ]
Doug Cutting commented on HADOOP-1986:
--------------------------------------
I don't see the need for 'acceptObjectReference()'. Clients
which wish to reuse objects can, the first time, pass null.
For subsequent calls they can pass the value returned from
prior call. This will work well with implementations which
reuse and with implementations that don't, no?
Note also that we don't need an explicit 'accept()' method:
if factory.getSerializer(c) returns null, then that factory
does not know how to serialize instances of the class.
Finally, if we have stateful serializers, then we have to
make it clear that they must not do any buffering. An
application should be able to call
'serializer.setOutput(out)', serialize an instance or two,
do other, non-serializer output to 'out', and then serialize
more instances, without calling 'setOutput' again, right?
In the case of Writable, setOutput would just set the
protected 'out' field of a DataOutputStream, and this would
all work fine. That could be instead done on each call to a
'serialize(Object, OutputStream)' method, but perhaps its
better to factor it out of inner loops. Is that the
intent?
> Add support for a general serialization mechanism for
Map Reduce
>
------------------------------------------------------------
----
>
> Key: HADOOP-1986
> URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1986
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Reporter: Tom White
> Assignee: Tom White
> Fix For: 0.16.0
>
> Attachments: SerializableWritable.java,
serializer-v1.patch
>
>
> Currently Map Reduce programs have to use
WritableComparable-Writable key-value pairs. While it's
possible to write Writable wrappers for other serialization
frameworks (such as Thrift), this is not very convenient: it
would be nicer to be able to use arbitrary types directly,
without explicit wrapping and unwrapping.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (HADOOP-1986) Add support
for a general serialization mechanism
for Map Reduce |
  United States |
2007-10-19 14:28:51 |
[ https://issues.apache.org/jira/browse
/HADOOP-1986?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12536322 ]
Tom White commented on HADOOP-1986:
-----------------------------------
> Clients which wish to reuse objects can, the first
time, pass null.
Except there might not be enough type information to
construct an object. For example if a WritableSerializer
were deserializing a LongWritable how would it know to
create a LongWritable object?
> In the case of Writable, setOutput would just set the
protected 'out' field of a DataOutputStream, and this
> would all work fine. That could be instead done on each
call to a 'serialize(Object, OutputStream)' method,
> but perhaps its better to factor it out of inner loops.
Is that the intent?
>From an API point of view I prefer serialize(Object,
OutputStream), but it's not clear to me that you can
implement this efficiently for any serialization framework.
For example, I don't think the technique you describe of
setting the 'out' field would work for Java Serialization.
And creating a new ObjectOutputStream for every call to the
serialize(Object, OutputStream) method would be prohibitive.
So unless there's another way of getting round this then I
think we're stuck with stateful serializers. (I'd love to be
proved wrong on this!)
> Add support for a general serialization mechanism for
Map Reduce
>
------------------------------------------------------------
----
>
> Key: HADOOP-1986
> URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1986
> Project: Hadoop
> Issue Type: New Feature
> Components: mapred
> Reporter: Tom White
> Assignee: Tom White
> Fix For: 0.16.0
>
> Attachments: SerializableWritable.java,
serializer-v1.patch
>
>
> Currently Map Reduce programs have to use
WritableComparable-Writable key-value pairs. While it's
possible to write Writable wrappers for other serialization
frameworks (such as Thrift), this is not very convenient: it
would be nicer to be able to use arbitrary types directly,
without explicit wrapping and unwrapping.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
|
|