|
List Info
Thread: problem with IdentityMapper
|
|
| problem with IdentityMapper |

|
2008-01-10 16:51:05 |
Hi,
I'm running into a problem where IdentityMapper seems to
produce way too
much data. For example, I have a job that reads a sequence
file using
IdentityMapper and then uses IdentityReducer to write
everything back
out to another sequence file. My input is a ~60MB sequence
file and
after the map phase has completed, the job tracker UI
reports about 10GB
for "Map output bytes". It seems like the output
collector does not get
properly reset and so each map that gets emitted has the
correct key but
the value ends up being all the data you've encountered up
to that
point. I think this is a known issue but I can't seem to
find any
discussion about it right now. Has anyone else run into
this, and if
so, is there a solution? I'm using the latest code in the
0.15 branch.
Thanks
Mike
|
|
| RE: problem with IdentityMapper |

|
2008-01-10 17:01:00 |
what are the key value types in the Sequencefile?
seems that the maprunner calls createKey and createValue
just once. so if the value serializes out it's entire memory
allocated (and not what it last read) - it would cause this
problem.
(I have periodically shot myself in the foot with this
bullet).
________________________________
From: Mike Forrest [mailto:mforrest trailfire.com]
Sent: Thu 1/10/2008 2:51 PM
To: hadoop-user lucene.apache.org
Subject: problem with IdentityMapper
Hi,
I'm running into a problem where IdentityMapper seems to
produce way too
much data. For example, I have a job that reads a sequence
file using
IdentityMapper and then uses IdentityReducer to write
everything back
out to another sequence file. My input is a ~60MB sequence
file and
after the map phase has completed, the job tracker UI
reports about 10GB
for "Map output bytes". It seems like the output
collector does not get
properly reset and so each map that gets emitted has the
correct key but
the value ends up being all the data you've encountered up
to that
point. I think this is a known issue but I can't seem to
find any
discussion about it right now. Has anyone else run into
this, and if
so, is there a solution? I'm using the latest code in the
0.15 branch.
Thanks
Mike
|
|
| Re: problem with IdentityMapper |

|
2008-01-10 17:20:29 |
I'm using Text for the keys and MapWritable for the values.
Joydeep Sen Sarma wrote:
> what are the key value types in the Sequencefile?
>
> seems that the maprunner calls createKey and
createValue just once. so if the value serializes out it's
entire memory allocated (and not what it last read) - it
would cause this problem.
>
> (I have periodically shot myself in the foot with this
bullet).
>
> ________________________________
>
> From: Mike Forrest [mailto:mforrest trailfire.com]
> Sent: Thu 1/10/2008 2:51 PM
> To: hadoop-user lucene.apache.org
> Subject: problem with IdentityMapper
>
>
>
> Hi,
> I'm running into a problem where IdentityMapper seems
to produce way too
> much data. For example, I have a job that reads a
sequence file using
> IdentityMapper and then uses IdentityReducer to write
everything back
> out to another sequence file. My input is a ~60MB
sequence file and
> after the map phase has completed, the job tracker UI
reports about 10GB
> for "Map output bytes". It seems like the
output collector does not get
> properly reset and so each map that gets emitted has
the correct key but
> the value ends up being all the data you've encountered
up to that
> point. I think this is a known issue but I can't seem
to find any
> discussion about it right now. Has anyone else run
into this, and if
> so, is there a solution? I'm using the latest code in
the 0.15 branch.
> Thanks
> Mike
>
>
>
>
|
|
| RE: problem with IdentityMapper |
  United States |
2008-01-10 17:27:41 |
That explains.
The key/value objects are reused through the cycle of
recordreader.read
and mapper calls.
The MapWritable reader perhaps does not reset the
MapWritable object
passed to it.
Runping
> -----Original Message-----
> From: Mike Forrest [mailto:mforrest trailfire.com]
> Sent: Thursday, January 10, 2008 3:20 PM
> To: hadoop-user lucene.apache.org
> Subject: Re: problem with IdentityMapper
>
> I'm using Text for the keys and MapWritable for the
values.
>
> Joydeep Sen Sarma wrote:
> > what are the key value types in the Sequencefile?
> >
> > seems that the maprunner calls createKey and
createValue
> just once. so if the value serializes out it's entire
memory
> allocated (and not what it last read) - it would cause
this problem.
> >
> > (I have periodically shot myself in the foot with
this bullet).
> >
> > ________________________________
> >
> > From: Mike Forrest [mailto:mforrest trailfire.com]
> > Sent: Thu 1/10/2008 2:51 PM
> > To: hadoop-user lucene.apache.org
> > Subject: problem with IdentityMapper
> >
> >
> >
> > Hi,
> > I'm running into a problem where IdentityMapper
seems to
> produce way
> > too much data. For example, I have a job that
reads a
> sequence file
> > using IdentityMapper and then uses IdentityReducer
to write
> everything
> > back out to another sequence file. My input is a
~60MB
> sequence file
> > and after the map phase has completed, the job
tracker UI reports
> > about 10GB for "Map output bytes". It
seems like the
> output collector
> > does not get properly reset and so each map that
gets
> emitted has the
> > correct key but the value ends up being all the
data you've
> > encountered up to that point. I think this is a
known issue but I
> > can't seem to find any discussion about it right
now. Has
> anyone else
> > run into this, and if so, is there a solution?
I'm using
> the latest code in the 0.15 branch.
> > Thanks
> > Mike
> >
> >
> >
> >
>
>
|
|
| RE: problem with IdentityMapper |

|
2008-01-10 17:29:26 |
ouch. MapWritable does not reset the hash table on a
readFields. The hash table would just grow and grow. the
write method dumps the entire hash out.
patch is simple: just do a instance.clear() in the
readFields() call. (But i haven't looked at the base
class).
________________________________
From: Mike Forrest [mailto:mforrest trailfire.com]
Sent: Thu 1/10/2008 3:20 PM
To: hadoop-user lucene.apache.org
Subject: Re: problem with IdentityMapper
I'm using Text for the keys and MapWritable for the values.
Joydeep Sen Sarma wrote:
> what are the key value types in the Sequencefile?
>
> seems that the maprunner calls createKey and
createValue just once. so if the value serializes out it's
entire memory allocated (and not what it last read) - it
would cause this problem.
>
> (I have periodically shot myself in the foot with this
bullet).
>
> ________________________________
>
> From: Mike Forrest [mailto:mforrest trailfire.com]
> Sent: Thu 1/10/2008 2:51 PM
> To: hadoop-user lucene.apache.org
> Subject: problem with IdentityMapper
>
>
>
> Hi,
> I'm running into a problem where IdentityMapper seems
to produce way too
> much data. For example, I have a job that reads a
sequence file using
> IdentityMapper and then uses IdentityReducer to write
everything back
> out to another sequence file. My input is a ~60MB
sequence file and
> after the map phase has completed, the job tracker UI
reports about 10GB
> for "Map output bytes". It seems like the
output collector does not get
> properly reset and so each map that gets emitted has
the correct key but
> the value ends up being all the data you've encountered
up to that
> point. I think this is a known issue but I can't seem
to find any
> discussion about it right now. Has anyone else run
into this, and if
> so, is there a solution? I'm using the latest code in
the 0.15 branch.
> Thanks
> Mike
>
>
>
>
|
|
| Re: problem with IdentityMapper |

|
2008-01-10 17:44:22 |
You were exactly right. Your simple patch has completely
fixed my
problem. Thank you, Joydeep and Runping.
Joydeep Sen Sarma wrote:
> ouch. MapWritable does not reset the hash table on a
readFields. The hash table would just grow and grow. the
write method dumps the entire hash out.
>
> patch is simple: just do a instance.clear() in the
readFields() call. (But i haven't looked at the base
class).
>
> ________________________________
>
> From: Mike Forrest [mailto:mforrest trailfire.com]
> Sent: Thu 1/10/2008 3:20 PM
> To: hadoop-user lucene.apache.org
> Subject: Re: problem with IdentityMapper
>
>
>
> I'm using Text for the keys and MapWritable for the
values.
>
> Joydeep Sen Sarma wrote:
>
>> what are the key value types in the Sequencefile?
>>
>> seems that the maprunner calls createKey and
createValue just once. so if the value serializes out it's
entire memory allocated (and not what it last read) - it
would cause this problem.
>>
>> (I have periodically shot myself in the foot with
this bullet).
>>
>> ________________________________
>>
>> From: Mike Forrest [mailto:mforrest trailfire.com]
>> Sent: Thu 1/10/2008 2:51 PM
>> To: hadoop-user lucene.apache.org
>> Subject: problem with IdentityMapper
>>
>>
>>
>> Hi,
>> I'm running into a problem where IdentityMapper
seems to produce way too
>> much data. For example, I have a job that reads a
sequence file using
>> IdentityMapper and then uses IdentityReducer to
write everything back
>> out to another sequence file. My input is a ~60MB
sequence file and
>> after the map phase has completed, the job tracker
UI reports about 10GB
>> for "Map output bytes". It seems like
the output collector does not get
>> properly reset and so each map that gets emitted
has the correct key but
>> the value ends up being all the data you've
encountered up to that
>> point. I think this is a known issue but I can't
seem to find any
>> discussion about it right now. Has anyone else run
into this, and if
>> so, is there a solution? I'm using the latest code
in the 0.15 branch.
>> Thanks
>> Mike
>>
>>
>>
>>
>>
>
>
>
>
>
|
|
[1-6]
|
|
|
about | contact Other archives ( Real Estate discussion Medical topics )
|