|
List Info
Thread: Created: (HADOOP-1998) No recovery when trying to replicate on marginal datanode
|
|
| Created: (HADOOP-1998) No recovery when
trying to replicate on marginal datanode |
  United States |
2007-10-05 13:45:51 |
No recovery when trying to replicate on marginal datanode
---------------------------------------------------------
Key: HADOOP-1998
URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1998
Project: Hadoop
Issue Type: Bug
Components: dfs
Affects Versions: 0.15.0
Environment: Sep 14 nightly build with a couple of
mapred-related patches
Reporter: Christian Kunz
We have been uploading a lot of data to hdfs, running about
400 scripts in parallel calling hadoop's command line
utility in distributed fashion. Many of them started to hang
when copying large files (>120GB), repeating the
following messages without end:
07/10/05 15:44:25 INFO fs.DFSClient: Could not complete
file, retrying...
07/10/05 15:44:26 INFO fs.DFSClient: Could not complete
file, retrying...
07/10/05 15:44:26 INFO fs.DFSClient: Could not complete
file, retrying...
07/10/05 15:44:27 INFO fs.DFSClient: Could not complete
file, retrying...
07/10/05 15:44:27 INFO fs.DFSClient: Could not complete
file, retrying...
07/10/05 15:44:28 INFO fs.DFSClient: Could not complete
file, retrying...
In the namenode log I eventually found repeated messages
like:
2007-10-05 14:40:08,063 WARN
org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor
timed out block blk_3124504920241431462
2007-10-05 14:40:11,876 INFO
org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.pendingTransfer: ask <IP4>50010 to
replicate blk_3124504920241431462 to datanode(s)
<IP4_1>:50010
2007-10-05 14:45:08,069 WARN
org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor
timed out block blk_8533614499490422104
2007-10-05 14:45:08,070 WARN
org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor
timed out block blk_7741954594593177224
2007-10-05 14:45:13,973 INFO
org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.pendingTransfer: ask <IP4>:50010 to
replicate blk_7741954594593177224 to datanode(s)
<IP4_2>:50010
2007-10-05 14:45:13,973 INFO
org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.pendingTransfer: ask <IP4>:50010 to
replicate blk_8533614499490422104 to datanode(s)
<IP4_3>50010
I could not ssh to the node with IpAdress <IP4>, but
seemingly the datanode server still sent heartbeats. After
rebooting the node it was okay again and a few files and a
few clients recovered, but not all.
I restarted these clients and they completed this time
(before noticing the marginal node we restarted the clients
twice without success).
I would conclude that the existence of the marginal node
must have caused loss of blocks, at least in the tracking
mechanism, in addition to eternal retries.
In summary, dfs should be able to handle datanodes with good
heartbeat but otherwise failing to do their job. This should
include datanodes that have a high rate of socket connection
timeouts.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (HADOOP-1998) No recovery
when trying to replicate on marginal
datanode |
  United States |
2007-10-05 14:01:50 |
[ https://issues.apache.org/jira/browse
/HADOOP-1998?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12532751 ]
Raghu Angadi commented on HADOOP-1998:
--------------------------------------
I think the problem might have been less severe with fix
HADOOP-1955 (it still does not explain quite a few things
about this case). What % of clients do think were stuck like
this?
> No recovery when trying to replicate on marginal
datanode
>
---------------------------------------------------------
>
> Key: HADOOP-1998
> URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1998
> Project: Hadoop
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.15.0
> Environment: Sep 14 nightly build with a couple
of mapred-related patches
> Reporter: Christian Kunz
>
> We have been uploading a lot of data to hdfs, running
about 400 scripts in parallel calling hadoop's command line
utility in distributed fashion. Many of them started to hang
when copying large files (>120GB), repeating the
following messages without end:
> 07/10/05 15:44:25 INFO fs.DFSClient: Could not complete
file, retrying...
> 07/10/05 15:44:26 INFO fs.DFSClient: Could not complete
file, retrying...
> 07/10/05 15:44:26 INFO fs.DFSClient: Could not complete
file, retrying...
> 07/10/05 15:44:27 INFO fs.DFSClient: Could not complete
file, retrying...
> 07/10/05 15:44:27 INFO fs.DFSClient: Could not complete
file, retrying...
> 07/10/05 15:44:28 INFO fs.DFSClient: Could not complete
file, retrying...
> In the namenode log I eventually found repeated
messages like:
> 2007-10-05 14:40:08,063 WARN
org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor
timed out block blk_3124504920241431462
> 2007-10-05 14:40:11,876 INFO
org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.pendingTransfer: ask <IP4>50010 to
replicate blk_3124504920241431462 to datanode(s)
<IP4_1>:50010
> 2007-10-05 14:45:08,069 WARN
org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor
timed out block blk_8533614499490422104
> 2007-10-05 14:45:08,070 WARN
org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor
timed out block blk_7741954594593177224
> 2007-10-05 14:45:13,973 INFO
org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.pendingTransfer: ask <IP4>:50010 to
replicate blk_7741954594593177224 to datanode(s)
<IP4_2>:50010
> 2007-10-05 14:45:13,973 INFO
org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.pendingTransfer: ask <IP4>:50010 to
replicate blk_8533614499490422104 to datanode(s)
<IP4_3>50010
> I could not ssh to the node with IpAdress <IP4>,
but seemingly the datanode server still sent heartbeats.
After rebooting the node it was okay again and a few files
and a few clients recovered, but not all.
> I restarted these clients and they completed this time
(before noticing the marginal node we restarted the clients
twice without success).
> I would conclude that the existence of the marginal
node must have caused loss of blocks, at least in the
tracking mechanism, in addition to eternal retries.
> In summary, dfs should be able to handle datanodes with
good heartbeat but otherwise failing to do their job. This
should include datanodes that have a high rate of socket
connection timeouts.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (HADOOP-1998) No recovery
when trying to replicate on marginal
datanode |
  United States |
2007-10-05 18:34:51 |
[ https://issues.apache.org/jira/browse
/HADOOP-1998?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12532795 ]
Christian Kunz commented on HADOOP-1998:
----------------------------------------
Number of clients that got stuck were about 5%.
> No recovery when trying to replicate on marginal
datanode
>
---------------------------------------------------------
>
> Key: HADOOP-1998
> URL: htt
ps://issues.apache.org/jira/browse/HADOOP-1998
> Project: Hadoop
> Issue Type: Bug
> Components: dfs
> Affects Versions: 0.15.0
> Environment: Sep 14 nightly build with a couple
of mapred-related patches
> Reporter: Christian Kunz
>
> We have been uploading a lot of data to hdfs, running
about 400 scripts in parallel calling hadoop's command line
utility in distributed fashion. Many of them started to hang
when copying large files (>120GB), repeating the
following messages without end:
> 07/10/05 15:44:25 INFO fs.DFSClient: Could not complete
file, retrying...
> 07/10/05 15:44:26 INFO fs.DFSClient: Could not complete
file, retrying...
> 07/10/05 15:44:26 INFO fs.DFSClient: Could not complete
file, retrying...
> 07/10/05 15:44:27 INFO fs.DFSClient: Could not complete
file, retrying...
> 07/10/05 15:44:27 INFO fs.DFSClient: Could not complete
file, retrying...
> 07/10/05 15:44:28 INFO fs.DFSClient: Could not complete
file, retrying...
> In the namenode log I eventually found repeated
messages like:
> 2007-10-05 14:40:08,063 WARN
org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor
timed out block blk_3124504920241431462
> 2007-10-05 14:40:11,876 INFO
org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.pendingTransfer: ask <IP4>50010 to
replicate blk_3124504920241431462 to datanode(s)
<IP4_1>:50010
> 2007-10-05 14:45:08,069 WARN
org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor
timed out block blk_8533614499490422104
> 2007-10-05 14:45:08,070 WARN
org.apache.hadoop.fs.FSNamesystem: PendingReplicationMonitor
timed out block blk_7741954594593177224
> 2007-10-05 14:45:13,973 INFO
org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.pendingTransfer: ask <IP4>:50010 to
replicate blk_7741954594593177224 to datanode(s)
<IP4_2>:50010
> 2007-10-05 14:45:13,973 INFO
org.apache.hadoop.dfs.StateChange: BLOCK*
NameSystem.pendingTransfer: ask <IP4>:50010 to
replicate blk_8533614499490422104 to datanode(s)
<IP4_3>50010
> I could not ssh to the node with IpAdress <IP4>,
but seemingly the datanode server still sent heartbeats.
After rebooting the node it was okay again and a few files
and a few clients recovered, but not all.
> I restarted these clients and they completed this time
(before noticing the marginal node we restarted the clients
twice without success).
> I would conclude that the existence of the marginal
node must have caused loss of blocks, at least in the
tracking mechanism, in addition to eternal retries.
> In summary, dfs should be able to handle datanodes with
good heartbeat but otherwise failing to do their job. This
should include datanodes that have a high rate of socket
connection timeouts.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
[1-3]
|
|