|
List Info
Thread: Created: (HADOOP-2076) extensive map tasks failures because of SocketTimeoutException during
|
|
| Created: (HADOOP-2076) extensive map
tasks failures because of
SocketTimeoutException during |
  United States |
2007-10-18 13:33:51 |
extensive map tasks failures because of
SocketTimeoutException during statusUpdate
------------------------------------------------------------
----------------------
Key: HADOOP-2076
URL: htt
ps://issues.apache.org/jira/browse/HADOOP-2076
Project: Hadoop
Issue Type: Bug
Affects Versions: 0.16.0
Environment: Oct 17 #718 nightly build with patches
2033 and 2048
Reporter: Christian Kunz
A job with 3600 tasks on a cluster of 1350 nodes (up 3 tasks
per node) shows extensive map tasks failures because of
connection timeouts at the end of the task (c++ application
using pipes interface completed successfully)
More than 600 tasks failed, slowing down the job because of
retries. Only a portion of the tasks fail because of the
timeout issue, but they spawn other failures because retries
and speculatively executed tasks cannot even get a
connection and fail just after a few seconds.
JobTracker is running with 60 handlers. We allow up to 10
attempts for maps.
I attach the log of a task failing because of timeout (which
includes a thread dump), and the log of one task which could
not start.
2007-10-18 15:58:41,743 INFO
org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM
Metrics with processName=MAP, sessionId=
2007-10-18 15:58:41,827 INFO
org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
2007-10-18 16:12:28,918 INFO
org.apache.hadoop.util.NativeCodeLoader: Loaded the
native-hadoop library
2007-10-18 16:12:28,920 INFO
org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully
loaded & initialized native-zlib library
2007-10-18 17:43:00,785 INFO
org.apache.hadoop.mapred.TaskRunner: Communication
exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:484)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown
Source)
at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
at java.lang.Thread.run(Thread.java:619)
2007-10-18 17:44:03,833 INFO
org.apache.hadoop.mapred.TaskRunner: Communication
exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:484)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown
Source)
at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
at java.lang.Thread.run(Thread.java:619)
2007-10-18 17:45:06,838 INFO
org.apache.hadoop.mapred.TaskRunner: Communication
exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
at org.apache.hadoop.ipc.Client.call(Client.java:484)
at org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
at org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown
Source)
at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
at java.lang.Thread.run(Thread.java:619)
2007-10-18 17:45:40,258 INFO
org.apache.hadoop.mapred.TaskRunner: Process Thread Dump:
Communication exception
8 active threads
Thread 13 (Comm thread for
task_200710172336_0016_m_000071_0):
State: RUNNABLE
Blocked count: 0
Waited count: 4128
Stack:
sun.management.ThreadImpl.getThreadInfo0(Native Method)
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
org.apache.hadoop.util.ReflectionUtils.printThreadInfo(Refle
ctionUtils.java:114)
org.apache.hadoop.util.ReflectionUtils.logThreadInfo(Reflect
ionUtils.java:162)
org.apache.hadoop.mapred.Task$1.run(Task.java:315)
java.lang.Thread.run(Thread.java:619)
Thread 12 (org.apache.hadoop.dfs.DFSClient$LeaseChecker 141b571):
State: TIMED_WAITING
Blocked count: 0
Waited count: 6403
Stack:
java.lang.Thread.sleep(Native Method)
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.j
ava:558)
java.lang.Thread.run(Thread.java:619)
Thread 9 (IPC Client connection to /127.0.0.1:49458):
State: RUNNABLE
Blocked count: 21
Waited count: 2063
Stack:
java.net.SocketInputStream.socketRead0(Native Method)
java.net.SocketInputStream.read(SocketInputStream.java:129)
java.io.FilterInputStream.read(FilterInputStream.java:116)
org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:1
81)
java.io.BufferedInputStream.fill(BufferedInputStream.java:21
8)
java.io.BufferedInputStream.read(BufferedInputStream.java:23
7)
java.io.DataInputStream.readInt(DataInputStream.java:370)
org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
Thread 8 (org.apache.hadoop.io.ObjectWritable Connection
Culler):
State: TIMED_WAITING
Blocked count: 0
Waited count: 6402
Stack:
java.lang.Thread.sleep(Native Method)
org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.jav
a:404)
Thread 4 (Signal Dispatcher):
State: RUNNABLE
Blocked count: 0
Waited count: 0
Stack:
Thread 3 (Finalizer):
State: WAITING
Blocked count: 398
Waited count: 2270
Waiting on java.lang.ref.ReferenceQueue$Lock c278b5
Stack:
java.lang.Object.wait(Native Method)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:1
59)
Thread 2 (Reference Handler):
State: WAITING
Blocked count: 257
Waited count: 2269
Waiting on java.lang.ref.Reference$Lock 1c66ec7
Stack:
java.lang.Object.wait(Native Method)
java.lang.Object.wait(Object.java:485)
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:
116)
Thread 1 (main):
State: RUNNABLE
Blocked count: 1
Waited count: 10
Stack:
java.io.FileInputStream.readBytes(Native Method)
java.io.FileInputStream.read(FileInputStream.java:199)
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStre
am.read(RawLocalFileSystem.java:105)
java.io.BufferedInputStream.fill(BufferedInputStream.java:21
8)
java.io.BufferedInputStream.read1(BufferedInputStream.java:2
58)
java.io.BufferedInputStream.read(BufferedInputStream.java:31
7)
java.io.DataInputStream.read(DataInputStream.java:132)
org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker
.java:378)
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputCheck
er.readChunk(ChecksumFileSystem.java:200)
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInpu
tChecker.java:234)
org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java
:176)
org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.jav
a:193)
org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java
:157)
org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker
.java:378)
org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java
:359)
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputCheck
er.seek(ChecksumFileSystem.java:254)
org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStrea
m.java:37)
org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.j
ava:1793)
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:
1217)
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:
1142)
2007-10-18 17:45:40,258 WARN
org.apache.hadoop.mapred.TaskRunner: Last retry, killing
task_200710172336_0016_m_000071_0
Log of task that could not start:
2007-10-18 17:43:55,766 INFO org.apache.hadoop.ipc.Client:
Retrying connect to server: /127.0.0.1:53972. Already tried
1 time(s).
2007-10-18 17:43:56,768 INFO org.apache.hadoop.ipc.Client:
Retrying connect to server: /127.0.0.1:53972. Already tried
2 time(s).
2007-10-18 17:43:57,770 INFO org.apache.hadoop.ipc.Client:
Retrying connect to server: /127.0.0.1:53972. Already tried
3 time(s).
2007-10-18 17:43:58,772 INFO org.apache.hadoop.ipc.Client:
Retrying connect to server: /127.0.0.1:53972. Already tried
4 time(s).
2007-10-18 17:43:59,774 INFO org.apache.hadoop.ipc.Client:
Retrying connect to server: /127.0.0.1:53972. Already tried
5 time(s).
2007-10-18 17:44:00,776 INFO org.apache.hadoop.ipc.Client:
Retrying connect to server: /127.0.0.1:53972. Already tried
6 time(s).
2007-10-18 17:44:01,778 INFO org.apache.hadoop.ipc.Client:
Retrying connect to server: /127.0.0.1:53972. Already tried
7 time(s).
2007-10-18 17:44:02,780 INFO org.apache.hadoop.ipc.Client:
Retrying connect to server: /127.0.0.1:53972. Already tried
8 time(s).
2007-10-18 17:44:03,783 INFO org.apache.hadoop.ipc.Client:
Retrying connect to server: /127.0.0.1:53972. Already tried
9 time(s).
2007-10-18 17:44:04,785 INFO org.apache.hadoop.ipc.Client:
Retrying connect to server: /127.0.0.1:53972. Already tried
10 time(s).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (HADOOP-2076) extensive map
tasks failures because of
SocketTimeoutException durin |
  United States |
2007-10-18 13:45:50 |
[ https://issues.apache.org/jira/browse
/HADOOP-2076?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12536011 ]
Owen O'Malley commented on HADOOP-2076:
---------------------------------------
Christian, please try a build with hadoop-2070, which was
committed last night. I suspect you are seeing the same
problem.
> extensive map tasks failures because of
SocketTimeoutException during statusUpdate
>
------------------------------------------------------------
----------------------
>
> Key: HADOOP-2076
> URL: htt
ps://issues.apache.org/jira/browse/HADOOP-2076
> Project: Hadoop
> Issue Type: Bug
> Affects Versions: 0.16.0
> Environment: Oct 17 #718 nightly build with
patches 2033 and 2048
> Reporter: Christian Kunz
>
> A job with 3600 tasks on a cluster of 1350 nodes (up 3
tasks per node) shows extensive map tasks failures because
of connection timeouts at the end of the task (c++
application using pipes interface completed successfully)
> More than 600 tasks failed, slowing down the job
because of retries. Only a portion of the tasks fail because
of the timeout issue, but they spawn other failures because
retries and speculatively executed tasks cannot even get a
connection and fail just after a few seconds.
> JobTracker is running with 60 handlers. We allow up to
10 attempts for maps.
> I attach the log of a task failing because of timeout
(which includes a thread dump), and the log of one task
which could not start.
> 2007-10-18 15:58:41,743 INFO
org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM
Metrics with processName=MAP, sessionId=
> 2007-10-18 15:58:41,827 INFO
org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
> 2007-10-18 16:12:28,918 INFO
org.apache.hadoop.util.NativeCodeLoader: Loaded the
native-hadoop library
> 2007-10-18 16:12:28,920 INFO
org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully
loaded & initialized native-zlib library
> 2007-10-18 17:43:00,785 INFO
org.apache.hadoop.mapred.TaskRunner: Communication
exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:484)
> at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> at
org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown
Source)
> at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:44:03,833 INFO
org.apache.hadoop.mapred.TaskRunner: Communication
exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:484)
> at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> at
org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown
Source)
> at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:06,838 INFO
org.apache.hadoop.mapred.TaskRunner: Communication
exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:484)
> at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> at
org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown
Source)
> at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:40,258 INFO
org.apache.hadoop.mapred.TaskRunner: Process Thread Dump:
Communication exception
> 8 active threads
> Thread 13 (Comm thread for
task_200710172336_0016_m_000071_0):
> State: RUNNABLE
> Blocked count: 0
> Waited count: 4128
> Stack:
> sun.management.ThreadImpl.getThreadInfo0(Native
Method)
>
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>
org.apache.hadoop.util.ReflectionUtils.printThreadInfo(Refle
ctionUtils.java:114)
>
org.apache.hadoop.util.ReflectionUtils.logThreadInfo(Reflect
ionUtils.java:162)
> org.apache.hadoop.mapred.Task$1.run(Task.java:315)
> java.lang.Thread.run(Thread.java:619)
> Thread 12
(org.apache.hadoop.dfs.DFSClient$LeaseChecker 141b571):
> State: TIMED_WAITING
> Blocked count: 0
> Waited count: 6403
> Stack:
> java.lang.Thread.sleep(Native Method)
>
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.j
ava:558)
> java.lang.Thread.run(Thread.java:619)
> Thread 9 (IPC Client connection to /127.0.0.1:49458):
> State: RUNNABLE
> Blocked count: 21
> Waited count: 2063
> Stack:
> java.net.SocketInputStream.socketRead0(Native
Method)
>
java.net.SocketInputStream.read(SocketInputStream.java:129)
>
java.io.FilterInputStream.read(FilterInputStream.java:116)
>
org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:1
81)
>
java.io.BufferedInputStream.fill(BufferedInputStream.java:21
8)
>
java.io.BufferedInputStream.read(BufferedInputStream.java:23
7)
>
java.io.DataInputStream.readInt(DataInputStream.java:370)
>
org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
> Thread 8 (org.apache.hadoop.io.ObjectWritable
Connection Culler):
> State: TIMED_WAITING
> Blocked count: 0
> Waited count: 6402
> Stack:
> java.lang.Thread.sleep(Native Method)
>
org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.jav
a:404)
> Thread 4 (Signal Dispatcher):
> State: RUNNABLE
> Blocked count: 0
> Waited count: 0
> Stack:
> Thread 3 (Finalizer):
> State: WAITING
> Blocked count: 398
> Waited count: 2270
> Waiting on java.lang.ref.ReferenceQueue$Lock c278b5
> Stack:
> java.lang.Object.wait(Native Method)
>
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:1
59)
> Thread 2 (Reference Handler):
> State: WAITING
> Blocked count: 257
> Waited count: 2269
> Waiting on java.lang.ref.Reference$Lock 1c66ec7
> Stack:
> java.lang.Object.wait(Native Method)
> java.lang.Object.wait(Object.java:485)
>
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:
116)
> Thread 1 (main):
> State: RUNNABLE
> Blocked count: 1
> Waited count: 10
> Stack:
> java.io.FileInputStream.readBytes(Native Method)
>
java.io.FileInputStream.read(FileInputStream.java:199)
>
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStre
am.read(RawLocalFileSystem.java:105)
>
java.io.BufferedInputStream.fill(BufferedInputStream.java:21
8)
>
java.io.BufferedInputStream.read1(BufferedInputStream.java:2
58)
>
java.io.BufferedInputStream.read(BufferedInputStream.java:31
7)
>
java.io.DataInputStream.read(DataInputStream.java:132)
>
org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker
.java:378)
>
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputCheck
er.readChunk(ChecksumFileSystem.java:200)
>
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInpu
tChecker.java:234)
>
org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java
:176)
>
org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.jav
a:193)
>
org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java
:157)
>
org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker
.java:378)
>
org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java
:359)
>
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputCheck
er.seek(ChecksumFileSystem.java:254)
>
org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStrea
m.java:37)
>
org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.j
ava:1793)
>
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:
1217)
>
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:
1142)
> 2007-10-18 17:45:40,258 WARN
org.apache.hadoop.mapred.TaskRunner: Last retry, killing
task_200710172336_0016_m_000071_0
> Log of task that could not start:
> 2007-10-18 17:43:55,766 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 1 time(s).
> 2007-10-18 17:43:56,768 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 2 time(s).
> 2007-10-18 17:43:57,770 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 3 time(s).
> 2007-10-18 17:43:58,772 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 4 time(s).
> 2007-10-18 17:43:59,774 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 5 time(s).
> 2007-10-18 17:44:00,776 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 6 time(s).
> 2007-10-18 17:44:01,778 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 7 time(s).
> 2007-10-18 17:44:02,780 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 8 time(s).
> 2007-10-18 17:44:03,783 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 9 time(s).
> 2007-10-18 17:44:04,785 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 10 time(s).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (HADOOP-2076) extensive map
tasks failures because of
SocketTimeoutException durin |
  United States |
2007-10-18 19:18:50 |
[ https://issues.apache.org/jira/browse
/HADOOP-2076?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12536082 ]
Christian Kunz commented on HADOOP-2076:
----------------------------------------
Owen, I am not convinced that this is related to
Hadoop-2070.
As a matter of fact, as far as I can see the original patch
for HADOOP-1788 flushed the buffer when the application is
done, but missed two cases: when the application does not
use a java inputformat and when it gets aborted. That's why
the first 3 out of 4 unit tests were successful; the 4th
unit test does not use a java input format, therefore, it
failed. I would have thought that without any flushing all
unit tests would have failed.
Our application uses a java inputformat and has the input
flushed (it actually completes). Also the symptom of missing
flushing is that the task it self times out.
The current issue seems to be related to bad communication
between task and taskTracker. I checked the tasktracker log
around the same time, and found that it was running out of
threads
2007-10-18 19:27:43,719 INFO
org.mortbay.http.SocketListener: LOW ON THREADS
((40-40+0)<1) on SocketListener0 0.0.0.0:50060
2007-10-18 19:27:44,432 INFO
org.apache.hadoop.mapred.TaskTracker:
task_200710172336_0016_m_000045_1 <status msg>
2007-10-18 19:27:44,432 INFO
org.apache.hadoop.mapred.TaskTracker:
task_200710172336_0016_r_000889_1 <statusmsg> >
reduce
2007-10-18 19:27:44,437 INFO
org.apache.hadoop.mapred.TaskTracker:
task_200710172336_0016_r_003314_0 0.70760524% reduce >
reduce
2007-10-18 19:27:44,997 INFO
org.apache.hadoop.mapred.TaskTracker:
task_200710172336_0016_r_003312_0 0.6736239% reduce >
reduce
2007-10-18 19:27:45,430 INFO
org.apache.hadoop.mapred.TaskTracker:
task_200710172336_0016_m_001001_1 0<statusmsg>
2007-10-18 19:27:45,498 WARN
org.mortbay.http.SocketListener: OUT OF THREADS:
SocketListener0 0.0.0.0:50060
2007-10-18 19:27:47,436 INFO
org.apache.hadoop.mapred.TaskTracker:
task_200710172336_0016_m_000045_1 <statusmsg>
> extensive map tasks failures because of
SocketTimeoutException during statusUpdate
>
------------------------------------------------------------
----------------------
>
> Key: HADOOP-2076
> URL: htt
ps://issues.apache.org/jira/browse/HADOOP-2076
> Project: Hadoop
> Issue Type: Bug
> Affects Versions: 0.16.0
> Environment: Oct 17 #718 nightly build with
patches 2033 and 2048
> Reporter: Christian Kunz
>
> A job with 3600 tasks on a cluster of 1350 nodes (up 3
tasks per node) shows extensive map tasks failures because
of connection timeouts at the end of the task (c++
application using pipes interface completed successfully)
> More than 600 tasks failed, slowing down the job
because of retries. Only a portion of the tasks fail because
of the timeout issue, but they spawn other failures because
retries and speculatively executed tasks cannot even get a
connection and fail just after a few seconds.
> JobTracker is running with 60 handlers. We allow up to
10 attempts for maps.
> I attach the log of a task failing because of timeout
(which includes a thread dump), and the log of one task
which could not start.
> 2007-10-18 15:58:41,743 INFO
org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM
Metrics with processName=MAP, sessionId=
> 2007-10-18 15:58:41,827 INFO
org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
> 2007-10-18 16:12:28,918 INFO
org.apache.hadoop.util.NativeCodeLoader: Loaded the
native-hadoop library
> 2007-10-18 16:12:28,920 INFO
org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully
loaded & initialized native-zlib library
> 2007-10-18 17:43:00,785 INFO
org.apache.hadoop.mapred.TaskRunner: Communication
exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:484)
> at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> at
org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown
Source)
> at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:44:03,833 INFO
org.apache.hadoop.mapred.TaskRunner: Communication
exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:484)
> at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> at
org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown
Source)
> at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:06,838 INFO
org.apache.hadoop.mapred.TaskRunner: Communication
exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:484)
> at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> at
org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown
Source)
> at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:40,258 INFO
org.apache.hadoop.mapred.TaskRunner: Process Thread Dump:
Communication exception
> 8 active threads
> Thread 13 (Comm thread for
task_200710172336_0016_m_000071_0):
> State: RUNNABLE
> Blocked count: 0
> Waited count: 4128
> Stack:
> sun.management.ThreadImpl.getThreadInfo0(Native
Method)
>
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>
org.apache.hadoop.util.ReflectionUtils.printThreadInfo(Refle
ctionUtils.java:114)
>
org.apache.hadoop.util.ReflectionUtils.logThreadInfo(Reflect
ionUtils.java:162)
> org.apache.hadoop.mapred.Task$1.run(Task.java:315)
> java.lang.Thread.run(Thread.java:619)
> Thread 12
(org.apache.hadoop.dfs.DFSClient$LeaseChecker 141b571):
> State: TIMED_WAITING
> Blocked count: 0
> Waited count: 6403
> Stack:
> java.lang.Thread.sleep(Native Method)
>
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.j
ava:558)
> java.lang.Thread.run(Thread.java:619)
> Thread 9 (IPC Client connection to /127.0.0.1:49458):
> State: RUNNABLE
> Blocked count: 21
> Waited count: 2063
> Stack:
> java.net.SocketInputStream.socketRead0(Native
Method)
>
java.net.SocketInputStream.read(SocketInputStream.java:129)
>
java.io.FilterInputStream.read(FilterInputStream.java:116)
>
org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:1
81)
>
java.io.BufferedInputStream.fill(BufferedInputStream.java:21
8)
>
java.io.BufferedInputStream.read(BufferedInputStream.java:23
7)
>
java.io.DataInputStream.readInt(DataInputStream.java:370)
>
org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
> Thread 8 (org.apache.hadoop.io.ObjectWritable
Connection Culler):
> State: TIMED_WAITING
> Blocked count: 0
> Waited count: 6402
> Stack:
> java.lang.Thread.sleep(Native Method)
>
org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.jav
a:404)
> Thread 4 (Signal Dispatcher):
> State: RUNNABLE
> Blocked count: 0
> Waited count: 0
> Stack:
> Thread 3 (Finalizer):
> State: WAITING
> Blocked count: 398
> Waited count: 2270
> Waiting on java.lang.ref.ReferenceQueue$Lock c278b5
> Stack:
> java.lang.Object.wait(Native Method)
>
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:1
59)
> Thread 2 (Reference Handler):
> State: WAITING
> Blocked count: 257
> Waited count: 2269
> Waiting on java.lang.ref.Reference$Lock 1c66ec7
> Stack:
> java.lang.Object.wait(Native Method)
> java.lang.Object.wait(Object.java:485)
>
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:
116)
> Thread 1 (main):
> State: RUNNABLE
> Blocked count: 1
> Waited count: 10
> Stack:
> java.io.FileInputStream.readBytes(Native Method)
>
java.io.FileInputStream.read(FileInputStream.java:199)
>
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStre
am.read(RawLocalFileSystem.java:105)
>
java.io.BufferedInputStream.fill(BufferedInputStream.java:21
8)
>
java.io.BufferedInputStream.read1(BufferedInputStream.java:2
58)
>
java.io.BufferedInputStream.read(BufferedInputStream.java:31
7)
>
java.io.DataInputStream.read(DataInputStream.java:132)
>
org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker
.java:378)
>
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputCheck
er.readChunk(ChecksumFileSystem.java:200)
>
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInpu
tChecker.java:234)
>
org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java
:176)
>
org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.jav
a:193)
>
org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java
:157)
>
org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker
.java:378)
>
org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java
:359)
>
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputCheck
er.seek(ChecksumFileSystem.java:254)
>
org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStrea
m.java:37)
>
org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.j
ava:1793)
>
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:
1217)
>
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:
1142)
> 2007-10-18 17:45:40,258 WARN
org.apache.hadoop.mapred.TaskRunner: Last retry, killing
task_200710172336_0016_m_000071_0
> Log of task that could not start:
> 2007-10-18 17:43:55,766 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 1 time(s).
> 2007-10-18 17:43:56,768 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 2 time(s).
> 2007-10-18 17:43:57,770 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 3 time(s).
> 2007-10-18 17:43:58,772 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 4 time(s).
> 2007-10-18 17:43:59,774 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 5 time(s).
> 2007-10-18 17:44:00,776 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 6 time(s).
> 2007-10-18 17:44:01,778 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 7 time(s).
> 2007-10-18 17:44:02,780 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 8 time(s).
> 2007-10-18 17:44:03,783 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 9 time(s).
> 2007-10-18 17:44:04,785 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 10 time(s).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Updated: (HADOOP-2076) extensive map
tasks failures because of
SocketTimeoutException during |
  United States |
2007-10-19 21:27:50 |
[ https://issues.apache.org/jira/browse/HADOOP-2076?page=co
m.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
a> ]
Christian Kunz updated HADOOP-2076:
-----------------------------------
With input from Owen and Devaraj, I will try to use less
jetty threads to reduce contention and see whether this
helps
> extensive map tasks failures because of
SocketTimeoutException during statusUpdate
>
------------------------------------------------------------
----------------------
>
> Key: HADOOP-2076
> URL: htt
ps://issues.apache.org/jira/browse/HADOOP-2076
> Project: Hadoop
> Issue Type: Bug
> Affects Versions: 0.16.0
> Environment: Oct 17 #718 nightly build with
patches 2033 and 2048
> Reporter: Christian Kunz
>
> A job with 3600 tasks on a cluster of 1350 nodes (up 3
tasks per node) shows extensive map tasks failures because
of connection timeouts at the end of the task (c++
application using pipes interface completed successfully)
> More than 600 tasks failed, slowing down the job
because of retries. Only a portion of the tasks fail because
of the timeout issue, but they spawn other failures because
retries and speculatively executed tasks cannot even get a
connection and fail just after a few seconds.
> JobTracker is running with 60 handlers. We allow up to
10 attempts for maps.
> I attach the log of a task failing because of timeout
(which includes a thread dump), and the log of one task
which could not start.
> 2007-10-18 15:58:41,743 INFO
org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM
Metrics with processName=MAP, sessionId=
> 2007-10-18 15:58:41,827 INFO
org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
> 2007-10-18 16:12:28,918 INFO
org.apache.hadoop.util.NativeCodeLoader: Loaded the
native-hadoop library
> 2007-10-18 16:12:28,920 INFO
org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully
loaded & initialized native-zlib library
> 2007-10-18 17:43:00,785 INFO
org.apache.hadoop.mapred.TaskRunner: Communication
exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:484)
> at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> at
org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown
Source)
> at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:44:03,833 INFO
org.apache.hadoop.mapred.TaskRunner: Communication
exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:484)
> at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> at
org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown
Source)
> at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:06,838 INFO
org.apache.hadoop.mapred.TaskRunner: Communication
exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:484)
> at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> at
org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown
Source)
> at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:40,258 INFO
org.apache.hadoop.mapred.TaskRunner: Process Thread Dump:
Communication exception
> 8 active threads
> Thread 13 (Comm thread for
task_200710172336_0016_m_000071_0):
> State: RUNNABLE
> Blocked count: 0
> Waited count: 4128
> Stack:
> sun.management.ThreadImpl.getThreadInfo0(Native
Method)
>
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>
org.apache.hadoop.util.ReflectionUtils.printThreadInfo(Refle
ctionUtils.java:114)
>
org.apache.hadoop.util.ReflectionUtils.logThreadInfo(Reflect
ionUtils.java:162)
> org.apache.hadoop.mapred.Task$1.run(Task.java:315)
> java.lang.Thread.run(Thread.java:619)
> Thread 12
(org.apache.hadoop.dfs.DFSClient$LeaseChecker 141b571):
> State: TIMED_WAITING
> Blocked count: 0
> Waited count: 6403
> Stack:
> java.lang.Thread.sleep(Native Method)
>
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.j
ava:558)
> java.lang.Thread.run(Thread.java:619)
> Thread 9 (IPC Client connection to /127.0.0.1:49458):
> State: RUNNABLE
> Blocked count: 21
> Waited count: 2063
> Stack:
> java.net.SocketInputStream.socketRead0(Native
Method)
>
java.net.SocketInputStream.read(SocketInputStream.java:129)
>
java.io.FilterInputStream.read(FilterInputStream.java:116)
>
org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:1
81)
>
java.io.BufferedInputStream.fill(BufferedInputStream.java:21
8)
>
java.io.BufferedInputStream.read(BufferedInputStream.java:23
7)
>
java.io.DataInputStream.readInt(DataInputStream.java:370)
>
org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
> Thread 8 (org.apache.hadoop.io.ObjectWritable
Connection Culler):
> State: TIMED_WAITING
> Blocked count: 0
> Waited count: 6402
> Stack:
> java.lang.Thread.sleep(Native Method)
>
org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.jav
a:404)
> Thread 4 (Signal Dispatcher):
> State: RUNNABLE
> Blocked count: 0
> Waited count: 0
> Stack:
> Thread 3 (Finalizer):
> State: WAITING
> Blocked count: 398
> Waited count: 2270
> Waiting on java.lang.ref.ReferenceQueue$Lock c278b5
> Stack:
> java.lang.Object.wait(Native Method)
>
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:1
59)
> Thread 2 (Reference Handler):
> State: WAITING
> Blocked count: 257
> Waited count: 2269
> Waiting on java.lang.ref.Reference$Lock 1c66ec7
> Stack:
> java.lang.Object.wait(Native Method)
> java.lang.Object.wait(Object.java:485)
>
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:
116)
> Thread 1 (main):
> State: RUNNABLE
> Blocked count: 1
> Waited count: 10
> Stack:
> java.io.FileInputStream.readBytes(Native Method)
>
java.io.FileInputStream.read(FileInputStream.java:199)
>
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStre
am.read(RawLocalFileSystem.java:105)
>
java.io.BufferedInputStream.fill(BufferedInputStream.java:21
8)
>
java.io.BufferedInputStream.read1(BufferedInputStream.java:2
58)
>
java.io.BufferedInputStream.read(BufferedInputStream.java:31
7)
>
java.io.DataInputStream.read(DataInputStream.java:132)
>
org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker
.java:378)
>
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputCheck
er.readChunk(ChecksumFileSystem.java:200)
>
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInpu
tChecker.java:234)
>
org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java
:176)
>
org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.jav
a:193)
>
org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java
:157)
>
org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker
.java:378)
>
org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java
:359)
>
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputCheck
er.seek(ChecksumFileSystem.java:254)
>
org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStrea
m.java:37)
>
org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.j
ava:1793)
>
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:
1217)
>
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:
1142)
> 2007-10-18 17:45:40,258 WARN
org.apache.hadoop.mapred.TaskRunner: Last retry, killing
task_200710172336_0016_m_000071_0
> Log of task that could not start:
> 2007-10-18 17:43:55,766 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 1 time(s).
> 2007-10-18 17:43:56,768 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 2 time(s).
> 2007-10-18 17:43:57,770 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 3 time(s).
> 2007-10-18 17:43:58,772 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 4 time(s).
> 2007-10-18 17:43:59,774 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 5 time(s).
> 2007-10-18 17:44:00,776 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 6 time(s).
> 2007-10-18 17:44:01,778 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 7 time(s).
> 2007-10-18 17:44:02,780 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 8 time(s).
> 2007-10-18 17:44:03,783 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 9 time(s).
> 2007-10-18 17:44:04,785 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 10 time(s).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Commented: (HADOOP-2076) extensive map
tasks failures because of
SocketTimeoutException durin |
  United States |
2007-11-05 13:35:50 |
[ https://issues.apache.org/jira/browse
/HADOOP-2076?page=com.atlassian.jira.plugin.system.issuetabp
anels:comment-tabpanel#action_12540257 ]
Christian Kunz commented on HADOOP-2076:
----------------------------------------
A couple of days ago, I ran the job successfully with 20
(instead of 30) jetty threads (although with some long tail
because of the socket connection timeouts).
To save time I would like to not repeat the job as this
time, because we made some progress in a batch of jobs and
are further down the line.
Once we hit the problem again (requires a job with some
substantial intermediate output), I will try your
suggestion.
> extensive map tasks failures because of
SocketTimeoutException during statusUpdate
>
------------------------------------------------------------
----------------------
>
> Key: HADOOP-2076
> URL: htt
ps://issues.apache.org/jira/browse/HADOOP-2076
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.16.0
> Environment: Oct 17 #718 nightly build with
patches 2033 and 2048
> Reporter: Christian Kunz
> Priority: Blocker
>
> A job with 3600 tasks on a cluster of 1350 nodes (up 3
tasks per node) shows extensive map tasks failures because
of connection timeouts at the end of the task (c++
application using pipes interface completed successfully)
> More than 600 tasks failed, slowing down the job
because of retries. Only a portion of the tasks fail because
of the timeout issue, but they spawn other failures because
retries and speculatively executed tasks cannot even get a
connection and fail just after a few seconds.
> JobTracker is running with 60 handlers. We allow up to
10 attempts for maps.
> I attach the log of a task failing because of timeout
(which includes a thread dump), and the log of one task
which could not start.
> 2007-10-18 15:58:41,743 INFO
org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM
Metrics with processName=MAP, sessionId=
> 2007-10-18 15:58:41,827 INFO
org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
> 2007-10-18 16:12:28,918 INFO
org.apache.hadoop.util.NativeCodeLoader: Loaded the
native-hadoop library
> 2007-10-18 16:12:28,920 INFO
org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully
loaded & initialized native-zlib library
> 2007-10-18 17:43:00,785 INFO
org.apache.hadoop.mapred.TaskRunner: Communication
exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:484)
> at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> at
org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown
Source)
> at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:44:03,833 INFO
org.apache.hadoop.mapred.TaskRunner: Communication
exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:484)
> at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> at
org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown
Source)
> at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:06,838 INFO
org.apache.hadoop.mapred.TaskRunner: Communication
exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:484)
> at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> at
org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown
Source)
> at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:40,258 INFO
org.apache.hadoop.mapred.TaskRunner: Process Thread Dump:
Communication exception
> 8 active threads
> Thread 13 (Comm thread for
task_200710172336_0016_m_000071_0):
> State: RUNNABLE
> Blocked count: 0
> Waited count: 4128
> Stack:
> sun.management.ThreadImpl.getThreadInfo0(Native
Method)
>
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>
org.apache.hadoop.util.ReflectionUtils.printThreadInfo(Refle
ctionUtils.java:114)
>
org.apache.hadoop.util.ReflectionUtils.logThreadInfo(Reflect
ionUtils.java:162)
> org.apache.hadoop.mapred.Task$1.run(Task.java:315)
> java.lang.Thread.run(Thread.java:619)
> Thread 12
(org.apache.hadoop.dfs.DFSClient$LeaseChecker 141b571):
> State: TIMED_WAITING
> Blocked count: 0
> Waited count: 6403
> Stack:
> java.lang.Thread.sleep(Native Method)
>
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.j
ava:558)
> java.lang.Thread.run(Thread.java:619)
> Thread 9 (IPC Client connection to /127.0.0.1:49458):
> State: RUNNABLE
> Blocked count: 21
> Waited count: 2063
> Stack:
> java.net.SocketInputStream.socketRead0(Native
Method)
>
java.net.SocketInputStream.read(SocketInputStream.java:129)
>
java.io.FilterInputStream.read(FilterInputStream.java:116)
>
org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:1
81)
>
java.io.BufferedInputStream.fill(BufferedInputStream.java:21
8)
>
java.io.BufferedInputStream.read(BufferedInputStream.java:23
7)
>
java.io.DataInputStream.readInt(DataInputStream.java:370)
>
org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
> Thread 8 (org.apache.hadoop.io.ObjectWritable
Connection Culler):
> State: TIMED_WAITING
> Blocked count: 0
> Waited count: 6402
> Stack:
> java.lang.Thread.sleep(Native Method)
>
org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.jav
a:404)
> Thread 4 (Signal Dispatcher):
> State: RUNNABLE
> Blocked count: 0
> Waited count: 0
> Stack:
> Thread 3 (Finalizer):
> State: WAITING
> Blocked count: 398
> Waited count: 2270
> Waiting on java.lang.ref.ReferenceQueue$Lock c278b5
> Stack:
> java.lang.Object.wait(Native Method)
>
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:1
59)
> Thread 2 (Reference Handler):
> State: WAITING
> Blocked count: 257
> Waited count: 2269
> Waiting on java.lang.ref.Reference$Lock 1c66ec7
> Stack:
> java.lang.Object.wait(Native Method)
> java.lang.Object.wait(Object.java:485)
>
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:
116)
> Thread 1 (main):
> State: RUNNABLE
> Blocked count: 1
> Waited count: 10
> Stack:
> java.io.FileInputStream.readBytes(Native Method)
>
java.io.FileInputStream.read(FileInputStream.java:199)
>
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStre
am.read(RawLocalFileSystem.java:105)
>
java.io.BufferedInputStream.fill(BufferedInputStream.java:21
8)
>
java.io.BufferedInputStream.read1(BufferedInputStream.java:2
58)
>
java.io.BufferedInputStream.read(BufferedInputStream.java:31
7)
>
java.io.DataInputStream.read(DataInputStream.java:132)
>
org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker
.java:378)
>
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputCheck
er.readChunk(ChecksumFileSystem.java:200)
>
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInpu
tChecker.java:234)
>
org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java
:176)
>
org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.jav
a:193)
>
org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java
:157)
>
org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker
.java:378)
>
org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java
:359)
>
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputCheck
er.seek(ChecksumFileSystem.java:254)
>
org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStrea
m.java:37)
>
org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.j
ava:1793)
>
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:
1217)
>
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:
1142)
> 2007-10-18 17:45:40,258 WARN
org.apache.hadoop.mapred.TaskRunner: Last retry, killing
task_200710172336_0016_m_000071_0
> Log of task that could not start:
> 2007-10-18 17:43:55,766 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 1 time(s).
> 2007-10-18 17:43:56,768 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 2 time(s).
> 2007-10-18 17:43:57,770 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 3 time(s).
> 2007-10-18 17:43:58,772 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 4 time(s).
> 2007-10-18 17:43:59,774 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 5 time(s).
> 2007-10-18 17:44:00,776 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 6 time(s).
> 2007-10-18 17:44:01,778 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 7 time(s).
> 2007-10-18 17:44:02,780 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 8 time(s).
> 2007-10-18 17:44:03,783 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 9 time(s).
> 2007-10-18 17:44:04,785 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 10 time(s).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
| Updated: (HADOOP-2076) extensive map
tasks failures because of
SocketTimeoutException during |
  United States |
2007-11-06 10:36:50 |
[ https://issues.apache.org/jira/browse/HADOOP-2076?page=co
m.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
a> ]
Sameer Paranjpye updated HADOOP-2076:
-------------------------------------
Fix Version/s: 0.15.1
> extensive map tasks failures because of
SocketTimeoutException during statusUpdate
>
------------------------------------------------------------
----------------------
>
> Key: HADOOP-2076
> URL: htt
ps://issues.apache.org/jira/browse/HADOOP-2076
> Project: Hadoop
> Issue Type: Bug
> Components: mapred
> Affects Versions: 0.16.0
> Environment: Oct 17 #718 nightly build with
patches 2033 and 2048
> Reporter: Christian Kunz
> Priority: Blocker
> Fix For: 0.15.1
>
>
> A job with 3600 tasks on a cluster of 1350 nodes (up 3
tasks per node) shows extensive map tasks failures because
of connection timeouts at the end of the task (c++
application using pipes interface completed successfully)
> More than 600 tasks failed, slowing down the job
because of retries. Only a portion of the tasks fail because
of the timeout issue, but they spawn other failures because
retries and speculatively executed tasks cannot even get a
connection and fail just after a few seconds.
> JobTracker is running with 60 handlers. We allow up to
10 attempts for maps.
> I attach the log of a task failing because of timeout
(which includes a thread dump), and the log of one task
which could not start.
> 2007-10-18 15:58:41,743 INFO
org.apache.hadoop.metrics.jvm.JvmMetrics: Initializing JVM
Metrics with processName=MAP, sessionId=
> 2007-10-18 15:58:41,827 INFO
org.apache.hadoop.mapred.MapTask: numReduceTasks: 3600
> 2007-10-18 16:12:28,918 INFO
org.apache.hadoop.util.NativeCodeLoader: Loaded the
native-hadoop library
> 2007-10-18 16:12:28,920 INFO
org.apache.hadoop.io.compress.zlib.ZlibFactory: Successfully
loaded & initialized native-zlib library
> 2007-10-18 17:43:00,785 INFO
org.apache.hadoop.mapred.TaskRunner: Communication
exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:484)
> at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> at
org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown
Source)
> at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:44:03,833 INFO
org.apache.hadoop.mapred.TaskRunner: Communication
exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:484)
> at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> at
org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown
Source)
> at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:06,838 INFO
org.apache.hadoop.mapred.TaskRunner: Communication
exception: java.net.SocketTimeoutException: timed out
waiting for rpc response
> at org.apache.hadoop.ipc.Client.call(Client.java:484)
> at
org.apache.hadoop.ipc.RPC$Invoker.invoke(RPC.java:184)
> at
org.apache.hadoop.mapred.$Proxy0.statusUpdate(Unknown
Source)
> at org.apache.hadoop.mapred.Task$1.run(Task.java:293)
> at java.lang.Thread.run(Thread.java:619)
> 2007-10-18 17:45:40,258 INFO
org.apache.hadoop.mapred.TaskRunner: Process Thread Dump:
Communication exception
> 8 active threads
> Thread 13 (Comm thread for
task_200710172336_0016_m_000071_0):
> State: RUNNABLE
> Blocked count: 0
> Waited count: 4128
> Stack:
> sun.management.ThreadImpl.getThreadInfo0(Native
Method)
>
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:147)
>
sun.management.ThreadImpl.getThreadInfo(ThreadImpl.java:123)
>
org.apache.hadoop.util.ReflectionUtils.printThreadInfo(Refle
ctionUtils.java:114)
>
org.apache.hadoop.util.ReflectionUtils.logThreadInfo(Reflect
ionUtils.java:162)
> org.apache.hadoop.mapred.Task$1.run(Task.java:315)
> java.lang.Thread.run(Thread.java:619)
> Thread 12
(org.apache.hadoop.dfs.DFSClient$LeaseChecker 141b571):
> State: TIMED_WAITING
> Blocked count: 0
> Waited count: 6403
> Stack:
> java.lang.Thread.sleep(Native Method)
>
org.apache.hadoop.dfs.DFSClient$LeaseChecker.run(DFSClient.j
ava:558)
> java.lang.Thread.run(Thread.java:619)
> Thread 9 (IPC Client connection to /127.0.0.1:49458):
> State: RUNNABLE
> Blocked count: 21
> Waited count: 2063
> Stack:
> java.net.SocketInputStream.socketRead0(Native
Method)
>
java.net.SocketInputStream.read(SocketInputStream.java:129)
>
java.io.FilterInputStream.read(FilterInputStream.java:116)
>
org.apache.hadoop.ipc.Client$Connection$1.read(Client.java:1
81)
>
java.io.BufferedInputStream.fill(BufferedInputStream.java:21
8)
>
java.io.BufferedInputStream.read(BufferedInputStream.java:23
7)
>
java.io.DataInputStream.readInt(DataInputStream.java:370)
>
org.apache.hadoop.ipc.Client$Connection.run(Client.java:258)
> Thread 8 (org.apache.hadoop.io.ObjectWritable
Connection Culler):
> State: TIMED_WAITING
> Blocked count: 0
> Waited count: 6402
> Stack:
> java.lang.Thread.sleep(Native Method)
>
org.apache.hadoop.ipc.Client$ConnectionCuller.run(Client.jav
a:404)
> Thread 4 (Signal Dispatcher):
> State: RUNNABLE
> Blocked count: 0
> Waited count: 0
> Stack:
> Thread 3 (Finalizer):
> State: WAITING
> Blocked count: 398
> Waited count: 2270
> Waiting on java.lang.ref.ReferenceQueue$Lock c278b5
> Stack:
> java.lang.Object.wait(Native Method)
>
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:116)
>
java.lang.ref.ReferenceQueue.remove(ReferenceQueue.java:132)
>
java.lang.ref.Finalizer$FinalizerThread.run(Finalizer.java:1
59)
> Thread 2 (Reference Handler):
> State: WAITING
> Blocked count: 257
> Waited count: 2269
> Waiting on java.lang.ref.Reference$Lock 1c66ec7
> Stack:
> java.lang.Object.wait(Native Method)
> java.lang.Object.wait(Object.java:485)
>
java.lang.ref.Reference$ReferenceHandler.run(Reference.java:
116)
> Thread 1 (main):
> State: RUNNABLE
> Blocked count: 1
> Waited count: 10
> Stack:
> java.io.FileInputStream.readBytes(Native Method)
>
java.io.FileInputStream.read(FileInputStream.java:199)
>
org.apache.hadoop.fs.RawLocalFileSystem$LocalFSFileInputStre
am.read(RawLocalFileSystem.java:105)
>
java.io.BufferedInputStream.fill(BufferedInputStream.java:21
8)
>
java.io.BufferedInputStream.read1(BufferedInputStream.java:2
58)
>
java.io.BufferedInputStream.read(BufferedInputStream.java:31
7)
>
java.io.DataInputStream.read(DataInputStream.java:132)
>
org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker
.java:378)
>
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputCheck
er.readChunk(ChecksumFileSystem.java:200)
>
org.apache.hadoop.fs.FSInputChecker.readChecksumChunk(FSInpu
tChecker.java:234)
>
org.apache.hadoop.fs.FSInputChecker.fill(FSInputChecker.java
:176)
>
org.apache.hadoop.fs.FSInputChecker.read1(FSInputChecker.jav
a:193)
>
org.apache.hadoop.fs.FSInputChecker.read(FSInputChecker.java
:157)
>
org.apache.hadoop.fs.FSInputChecker.readFully(FSInputChecker
.java:378)
>
org.apache.hadoop.fs.FSInputChecker.seek(FSInputChecker.java
:359)
>
org.apache.hadoop.fs.ChecksumFileSystem$ChecksumFSInputCheck
er.seek(ChecksumFileSystem.java:254)
>
org.apache.hadoop.fs.FSDataInputStream.seek(FSDataInputStrea
m.java:37)
>
org.apache.hadoop.io.SequenceFile$Reader.seek(SequenceFile.j
ava:1793)
>
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:
1217)
>
org.apache.hadoop.io.SequenceFile$Reader.(SequenceFile.java:
1142)
> 2007-10-18 17:45:40,258 WARN
org.apache.hadoop.mapred.TaskRunner: Last retry, killing
task_200710172336_0016_m_000071_0
> Log of task that could not start:
> 2007-10-18 17:43:55,766 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 1 time(s).
> 2007-10-18 17:43:56,768 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 2 time(s).
> 2007-10-18 17:43:57,770 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 3 time(s).
> 2007-10-18 17:43:58,772 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 4 time(s).
> 2007-10-18 17:43:59,774 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 5 time(s).
> 2007-10-18 17:44:00,776 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 6 time(s).
> 2007-10-18 17:44:01,778 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 7 time(s).
> 2007-10-18 17:44:02,780 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 8 time(s).
> 2007-10-18 17:44:03,783 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 9 time(s).
> 2007-10-18 17:44:04,785 INFO
org.apache.hadoop.ipc.Client: Retrying connect to server:
/127.0.0.1:53972. Already tried 10 time(s).
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue
online.
|
|
[1-6]
|
|