Odd how corruption is in key hbase files. For example, here
is
complaint about the hbase -ROOT- block from namenode log:
2007-11-24 23:43:31,966 INFO
org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.allocateBlock:
/tmp/hadoop-kcd/hbase/hregion_-ROOT-,,0/info/mapfiles/356202
0000090370572/data. blk_3947372034074506449
2007-11-24 23:43:31,966 WARN
org.apache.hadoop.fs.FSNamesystem: Not able to place enough
replicas, still in need of 1
2007-11-24 23:43:31,966 INFO org.apache.hadoop.ipc.Server:
IPC Server handler 7 on 54310, call
addBlock(/tmp/hadoop-kcd/hbase/hregion_-ROOT-,,0/info/mapfil
es/3562020000090370572/data, DFSClient_550313675) from
172.16.6.57:37157: error: java.io.IOException: File
/tmp/hadoop-kcd/hbase/hregion_-ROOT-,,0/info/mapfiles/356202
0000090370572/data could only be replicated to 0 nodes,
instead of 1
java.io.IOException: File
/tmp/hadoop-kcd/hbase/hregion_-ROOT-,,0/info/mapfiles/356202
0000090370572/data could only be replicated to 0 nodes,
instead of 1
at
org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSName
system.java:1003)
at
org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:293)
at sun.reflect.GeneratedMethodAccessor24.invoke(Unknown
Source)
at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
thodAccessorImpl.java:25)
at java.lang.reflect.Method.invoke(Method.java:585)
at org.apache.hadoop.ipc.RPC$Server.call(RPC.java:379)
at
org.apache.hadoop.ipc.Server$Handler.run(Server.java:596)
It looks like your FS goes bad of a sudden and is no longer
able to
allocate blocks (Nothing in /var/log/messages on any of the
HDFS nodes
58 and 59?). If you run w/ DEBUG enabled, perhaps it will
give a clue
as to why its going bad? Perhaps the HDFS experts from this
list have
input on what could be going wrong? You might also try w/ a
replication-level > 1?
The NotServingRegionException in regionserver logs is
'normal'. They
are running fine until they run into the
failure-to-replicate complaint
from the HDFS.
Thanks Kareem,
St.Ack
Kareem Dana wrote:
> I cleaned the logs, cleaned and reformatted the DFS.
Started up DFS,
> mapred, and hbase.
> Ran fsck just to make sure:
> kcd hadoop07:~/hadoop$ bin/hadoop fsck /
> .....Status: HEALTHY
> Total size: 450 B
> Total blocks: 3 (avg. block size 150 B)
> Total dirs: 16
> Total files: 5
> Over-replicated blocks: 0 (0.0 %)
> Under-replicated blocks: 0 (0.0 %)
> Target replication factor: 1
> Real replication factor: 1.0
>
>
> The filesystem under path '/' is HEALTHY
>
> Started the PE Test:
> kcd hadoop07:~/hadoop$ bin/hadoop jar
>
/usr/research/home/kcd/hadoop/build/contrib/hbase/hadoop-0.1
5.0-dev-hbase-test.jar
> sequentialWrite 2
> 07/11/24 23:14:03 INFO hbase.PerformanceEvaluation:
Table name:
> TestTable, families: {info:={name: info, max versions:
3, compression:
> NONE, in memory: false, max length: 2147483647, bloom
filter: none}}
> created
> 07/11/24 23:14:04 INFO mapred.FileInputFormat: Total
input paths to process : 1
> 07/11/24 23:14:06 INFO mapred.JobClient: Running job:
job_200711242311_0001
> 07/11/24 23:14:07 INFO mapred.JobClient: map 0% reduce
0%
> 07/11/24 23:20:53 INFO mapred.JobClient: map 5% reduce
0%
> 07/11/24 23:21:03 INFO mapred.JobClient: map 5% reduce
1%
> 07/11/24 23:22:35 INFO mapred.JobClient: map 10%
reduce 1%
> 07/11/24 23:22:46 INFO mapred.JobClient: map 10%
reduce 3%
> 07/11/24 23:24:10 INFO mapred.JobClient: map 14%
reduce 3%
> 07/11/24 23:24:22 INFO mapred.JobClient: map 19%
reduce 3%
> 07/11/24 23:24:34 INFO mapred.JobClient: map 19%
reduce 5%
> 07/11/24 23:29:31 INFO mapred.JobClient: map 24%
reduce 5%
> 07/11/24 23:29:42 INFO mapred.JobClient: map 24%
reduce 6%
> 07/11/24 23:31:37 INFO mapred.JobClient: map 29%
reduce 6%
> 07/11/24 23:31:45 INFO mapred.JobClient: map 30%
reduce 6%
> 07/11/24 23:31:56 INFO mapred.JobClient: map 30%
reduce 10%
> 07/11/24 23:32:12 INFO mapred.JobClient: map 35%
reduce 10%
> 07/11/24 23:32:18 INFO mapred.JobClient: map 35%
reduce 11%
> 07/11/24 23:37:33 INFO mapred.JobClient: map 40%
reduce 11%
> 07/11/24 23:37:39 INFO mapred.JobClient: map 40%
reduce 13%
> 07/11/24 23:37:45 INFO mapred.JobClient: map 45%
reduce 13%
> 07/11/24 23:37:54 INFO mapred.JobClient: map 45%
reduce 15%
> 07/11/24 23:42:02 INFO mapred.JobClient: map 50%
reduce 15%
> 07/11/24 23:52:14 INFO mapred.JobClient: map 100%
reduce 100%
> java.io.IOException: Job failed!
> at
org.apache.hadoop.mapred.JobClient.runJob(JobClient.java:831
)
> at
org.apache.hadoop.hbase.PerformanceEvaluation.runNIsMoreThan
One(PerformanceEvaluation.java:248)
> at
org.apache.hadoop.hbase.PerformanceEvaluation.runTest(Perfor
manceEvaluation.java:588)
> at
org.apache.hadoop.hbase.PerformanceEvaluation.doCommandLine(
PerformanceEvaluation.java:680)
> at
org.apache.hadoop.hbase.PerformanceEvaluation.main(Performan
ceEvaluation.java:699)
> at
sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
> at
sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAcce
ssorImpl.java:39)
> at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
thodAccessorImpl.java:25)
> at
java.lang.reflect.Method.invoke(Method.java:585)
> at
org.apache.hadoop.util.RunJar.main(RunJar.java:155)
>
>
> Checked DFS status again:
> kcd hadoop07:~/hadoop$ bin/hadoop fsck /
> .....
>
/tmp/hadoop-kcd/hbase/hregion_-ROOT-,,0/info/mapfiles/356202
0000090370572/data:
> MISSING 1 blocks of total size 0 B.
> .......
>
/tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/434460
773960343228/data:
> MISSING 1 blocks of total size 0 B.
> ....................................
>
/tmp/hadoop-kcd/hbase/hregion_TestTable,1152738,800744514559
5019431/info/mapfiles/5636229081894866081/data:
> MISSING 1 blocks of total size 0 B.
> ....................................................
> ....................................................
>
/tmp/hadoop-kcd/hbase/hregion_TestTable,8426559,750141718782
5603851/info/mapfiles/6826621922173844792/data:
> MISSING 1 blocks of total size 0 B.
> ...........
>
/tmp/hadoop-kcd/hbase/log_172.16.6.56_-1556193815149242895_6
0020/hlog.dat.025:
> MISSING 1 blocks of total size 0 B.
> .Status: CORRUPT
> Total size: 1823865158 B
> Total blocks: 174 (avg. block size 10481983 B)
> Total dirs: 180
> Total files: 164
> Over-replicated blocks: 0 (0.0 %)
> Under-replicated blocks: 0 (0.0 %)
> Target replication factor: 1
> Real replication factor: 1.0
>
>
> The filesystem under path '/' is CORRUPT
>
> All my logs are uploaded at http://www.
cs.duke.edu/~kcd/hadoop-logs2/
>
> The namenode log looks fine until the very end when it
has a few of
> these errors:
> 2007-11-24 23:43:00,798 INFO
org.apache.hadoop.dfs.StateChange: BLOCK*
> NameSystem.allocateBlock:
>
/tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/914606
8477937578655/data.
> blk_-6879410450276751752
> 2007-11-24 23:43:00,798 WARN
org.apache.hadoop.fs.FSNamesystem: Not
> able to place enough replicas, still in need of 1
> 2007-11-24 23:43:00,808 INFO
org.apache.hadoop.ipc.Server: IPC Server
> handler 1 on 54310, call
>
addBlock(/tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfil
es/9146068477937578655/data,
> DFSClient_-1049549396) from 172.16.6.56:56970: error:
> java.io.IOException: File
>
/tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/914606
8477937578655/data
> could only be replicated to 0 nodes, instead of 1
> java.io.IOException: File
>
/tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/914606
8477937578655/data
> could only be replicated to 0 nodes, instead of 1
> at
org.apache.hadoop.dfs.FSNamesystem.getAdditionalBlock(FSName
system.java:1003)
> at
org.apache.hadoop.dfs.NameNode.addBlock(NameNode.java:293)
> at
sun.reflect.GeneratedMethodAccessor24.invoke(Unknown
Source)
> at
sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMe
thodAccessorImpl.java:25)
> at java.lang.reflect.Method.invoke(Method.java:585)
> at
org.apache.hadoop.ipc.RPC$Server.call(RPC.java:379)
> at
org.apache.hadoop.ipc.Server$Handler.run(Server.java:596)
>
> The hbase-region servers seem to be having a lot of
problems though
> according to their logs. Not quite sure what could be
the cause.
>
> - Kareem
>
> On Nov 24, 2007 9:56 PM, Ted Dunning <tdunning veoh.com> wrote:
>
>> Can you clear the logs, clean out the file system
and run the test again?
>>
>> The namenode logs should tell an interesting
story.
>>
>>
>>
>> On 11/24/07 6:12 PM, "Kareem Dana"
<kareem.dana gmail.com> wrote:
>>
>>
>>> I ran hadoop fsck and sure enough the DFS was
corrupted. It seems that
>>> the PerformanceEvaluation test is corrupting
it. Before I run the
>>> test, I ran fsck and the DFS was reported as
HEALTHY. Once the PE
>>> fails, the DFS is reported as corrupt. I tried
to simplify my setup
>>> and run the PE again. My new config is as
follows:
>>>
>>> hadoop07 - DFS Master, Mapred master, hbase
master
>>> hadoop09-10 - 2 hbase region servers
>>> hadoop11-12 - 2 datanodes, task trackers
>>>
>>> mapred.map.tasks = 2
>>> mapred.reduce.tasks = 1
>>> dfs.replication = 1
>>>
>>> I ran the distributed PE in that configuration
and it still failed
>>> with similar errors. The output of the hadoop
fsck for this run was:
>>>
>>> ..........
>>>
>>>
>>
/tmp/hadoop-kcd/hbase/hregion_.META.,,1/info/mapfiles/643488
1831082231493/data>
>> :
>>
>>> MISSING 1 blocks of total size 0 B.
>>> ......................................
>>>
/tmp/hadoop-kcd/hbase/hregion_TestTable,11566878,12270926815
44002579/info/mapf
>>> iles/5263238643231358600/data:
>>> MISSING 1 blocks of total size 0 B.
>>> ....
>>>
/tmp/hadoop-kcd/hbase/hregion_TestTable,12612310,16520624110
16999689/info/mapf
>>> iles/2024298319068625138/data:
>>> MISSING 1 blocks of total size 0 B.
>>> ....
>>>
/tmp/hadoop-kcd/hbase/hregion_TestTable,12612310,16520624110
16999689/info/mapf
>>> iles/5071453667327337040/data:
>>> MISSING 1 blocks of total size 0 B.
>>> .........
>>>
/tmp/hadoop-kcd/hbase/hregion_TestTable,13932,47381927475213
22482/info/mapfile
>>> s/4400784113695734765/data:
>>> MISSING 1 blocks of total size 0 B.
>>> ...................................
>>>
............................................................
............
>>>
/tmp/hadoop-kcd/hbase/log_172.16.6.56_-1823376333333123807_6
0020/hlog.dat.027:
>>> MISSING 1 blocks of total size 0 B.
>>> .Status: CORRUPT
>>> Total size: 1890454330 B
>>> Total blocks: 180 (avg. block size 10502524
B)
>>> Total dirs: 190
>>> Total files: 173
>>> Over-replicated blocks: 0 (0.0 %)
>>> Under-replicated blocks: 0 (0.0 %)
>>> Target replication factor: 1
>>> Real replication factor: 1.0
>>>
>>>
>>> The filesystem under path '/' is CORRUPT
>>>
>>>
>>> On Nov 24, 2007 6:21 PM, Ted Dunning
<tdunning veoh.com> wrote:
>>>
>>>> I think that stack was suggesting an HDFS
fsck, not a disk level fsck.
>>>>
>>>> Try [hadoop fsck /]
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On 11/24/07 4:09 PM, "Kareem
Dana" <kareem.dana gmail.com> wrote:
>>>>
>>>>
>>>>> I do not have root access on the xen
cluster I'm using. I will ask the
>>>>> admin to make sure the disk is working
properly. Regarding the
>>>>> mismatch versions though, are you
suggesting that different region
>>>>> servers might be running different
versions of hbase/hadoop? They are
>>>>> all running the same code from the same
shared storage. There isn't
>>>>> even another version of hadoop anywhere
for the other nodes to run. I
>>>>> think I'll try dropping my cluster down
to 2 nodes and working back
>>>>> up... maybe I can pin point a specific
problem node. Thanks for taking
>>>>> a look at my logs.
>>>>>
>>>>> On Nov 24, 2007 5:49 PM, stack
<stack duboce.net> wrote:
>>>>>
>>>>>> I took a quick look Kareem. As
with the last time, hbase keeps having
>>>>>> trouble w/ the hdfs. Things start
out fine around 16:00 then go bad
>>>>>> because can't write reliably to the
hdfs -- a variety of reasons. You
>>>>>> then seem to restart the cluster
around 17:37 or so and things seem to
>>>>>> go along fine for a while until
19:05 when again, all regionservers
>>>>>> report trouble writing the hdfs.
Have you run an fsck?
>>>>>>
>>>>
>>
|