Just in case someone's curious.
Stop and restart dfs with 0.13.1:
- master name node says:
2007-08-24 18:31:27,318 INFO org.apache.hadoop.dfs.NameNode:
Namenode up
at: hadoop001.sf2p.facebook.com/10.16.159.101:9000
2007-08-24 18:31:28,560 WARN
org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedDelete: failed to remove /tmp/pu3
because
it does not exist
2007-08-24 18:31:28,571 WARN
org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename
/user/facebook
/chatter/rawcounts/2007-08-04/_task_0001_r_000044_0/part-000
44 to
/user/facebook/chatter/rawcounts/2007-08-04/part-00044
because dest
ination exists
2007-08-24 18:31:28,571 WARN
org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename
/user/facebook
/chatter/rawcounts/2007-08-04/_task_0001_r_000044_0/.part-00
044.crc to
/user/facebook/chatter/rawcounts/2007-08-04/.part-00044.crc
be
cause destination exists
2007-08-24 18:31:28,572 WARN
org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename
/user/facebook
/chatter/rawcounts/2007-08-04/_task_0001_r_000040_0/part-000
40 to
/user/facebook/chatter/rawcounts/2007-08-04/part-00040
because dest
ination exists
2007-08-24 18:31:28,572 WARN
org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename
/user/facebook
/chatter/rawcounts/2007-08-04/_task_0001_r_000040_0/.part-00
040.crc to
/user/facebook/chatter/rawcounts/2007-08-04/.part-00040.crc
be
cause destination exists
2007-08-24 18:31:28,573 WARN
org.apache.hadoop.dfs.StateChange: DIR*
FSDirectory.unprotectedRenameTo: failed to rename
/user/facebook
/chatter/rawcounts/2007-08-04/_task_0001_r_000052_0/part-000
52 to
/user/facebook/chatter/rawcounts/2007-08-04/part-00052
because dest
ination exists
...
there's a serious blast of these (replaying edit log?). In
any case -
after this is done - it enters safemode - presume the fs is
corrupted by
then. At the exact same time - the datanodes are busy
deleting blocks!:
2007-08-24 18:31:33,243 INFO org.apache.hadoop.dfs.DataNode:
Starting
DataNode in:
FSDataset{dirpath='/var/hadoop/tmp/dfs/data/curren
t'}
2007-08-24 18:31:33,243 INFO org.apache.hadoop.dfs.DataNode:
using
BLOCKREPORT_INTERVAL of 3588023msec
2007-08-24 18:31:34,252 INFO org.apache.hadoop.dfs.DataNode:
Deleting
block blk_-9223045762536565560 file
/var/hadoop/tmp/dfs/data/cu
rrent/subdir14/subdir18/blk_-9223045762536565560
2007-08-24 18:31:34,269 INFO org.apache.hadoop.dfs.DataNode:
Deleting
block blk_-9214178286744587840 file
/var/hadoop/tmp/dfs/data/cu
rrent/subdir14/subdir12/blk_-9214178286744587840
2007-08-24 18:31:34,370 INFO org.apache.hadoop.dfs.DataNode:
Deleting
block blk_-9213127144044535407 file
/var/hadoop/tmp/dfs/data/cu
rrent/subdir14/subdir20/blk_-9213127144044535407
2007-08-24 18:31:34,386 INFO org.apache.hadoop.dfs.DataNode:
Deleting
block blk_-9211625398030978419 file
/var/hadoop/tmp/dfs/data/cu
rrent/subdir14/subdir26/blk_-9211625398030978419
2007-08-24 18:31:34,418 INFO org.apache.hadoop.dfs.DataNode:
Deleting
block blk_-9189558923884323865 file
/var/hadoop/tmp/dfs/data/cu
rrent/subdir14/subdir24/blk_-9189558923884323865
2007-08-24 18:31:34,419 INFO org.apache.hadoop.dfs.DataNode:
Deleting
block blk_-9115468136273900585 file
/var/hadoop/tmp/dfs/data/cu
rrent/subdir10/blk_-9115468136273900585
ouch - I guess those are all the blocks that fsck is now
reporting
missing. Known bug? Operator error? (well - I did do a clean
shutdown
..)
-----Original Message-----
From: Joydeep Sen Sarma [mailto:jssarma facebook.com]
Sent: Friday, August 24, 2007 7:21 PM
To: hadoop-user lucene.apache.org
Subject: RE: secondary namenode errors
I wish I had read the bug more carefully - thought that the
issue was
fixed in 0.13.1.
Of course not, the issue persists. Meanwhile - half the
files are
corrupted after the upgrade (followed the upgrade wiki,
tried to restore
to backed up metadata and old version - to no avail).
Sigh - have a nice weekend everyone,
Joydeep
-----Original Message-----
From: Koji Noguchi [mailto:knoguchi yahoo-inc.com]
Sent: Friday, August 24, 2007 8:29 AM
To: hadoop-user lucene.apache.org
Subject: Re: secondary namenode errors
Joydeep,
I think you're hitting this bug.
http
://issues.apache.org/jira/browse/HADOOP-1076
In any case, as Raghu suggested, please use 0.13.1 and not
0.13.
Koji
Raghu Angadi wrote:
> Joydeep Sen Sarma wrote:
>> Thanks for replying.
>>
>> Can you please clarify - is it the case that the
secondary namenode
>> stuff only works in 0.13.1? and what's the
connection with
replication
>> factor?
>>
>> We lost the file system completely once, trying to
make sure we can
>> avoid it the next time.
>
> I am not sure if the problem you reported still exists
in 0.13.1. You
> might still have the problem and you can ask again. But
you should
> move to 0.13.1 since it has some critical fixes. See
release notes for
> 0.13.1 or HADOOP-1603. You should always upgrade to the
latest minor
> release version when moving to next major version.
>
> Raghu.
>
>> Joydeep
>>
>> -----Original Message-----
>> From: Raghu Angadi [mailto:rangadi yahoo-inc.com] Sent: Thursday,
>> August 23, 2007 9:44 PM
>> To: hadoop-user lucene.apache.org
>> Subject: Re: secondary namenode errors
>>
>>
>> On a related note, please don't use 0.13.0, use the
latest released
>> version for 0.13 (I think it is 0.13.1). If the
secondary namenode
>> actually works, then it will resulting all the
replications set to 1.
>>
>> Raghu.
>>
>> Joydeep Sen Sarma wrote:
>>> Hi folks,
|