List Info

Thread: LAM: bufferd (dtry_send): No child processes




LAM: bufferd (dtry_send): No child processes
country flaguser name
Malaysia
2007-10-08 20:57:03

Hi,
 
I was running an MPI timing program (2cz-prog1) through qsub job.1 (attached at the end of message).
A few runs have been successfully executed until end of the job file. ;
However there was one job run terminated with the following message (see result1 file for full output):
 
::
bufferd (dtry_send): No child processes
mpirun (rpwait): Connection reset by peer
Broken pipe
::
 
Then I resubmitted the same job file and it was ok. Can someone enlighten me what happened that caused the message above?
 
Thanks.
Daniel.
 
 
======================== begin of job.1 ==============================
#!/bin/bash
#$ -cwd
#$ -o ./result1
#$ -j y
#$ -r y
echo "Job started."
lamboot -v hosts/hostfile
lamnodes
echo --------------------------- RUN 1 ---------------------------
date
# Each mpirun below creates one line of output
mpirun n0-8  ./2cz-prog1 482 1.9584 0.00001 t
mpirun n0-10 ./2cz-prog1 482 1.9584 0.00001 t
mpirun n0-12 ./2cz-prog1 482 1.9584 0.00001 t
mpirun n0-15 ./2cz-prog1 482 1.9584 0.00001 t
date
mpirun n0-8  ./2cz-prog1 962 1.9784 0.00001 t
mpirun n0-10 ./2cz-prog1 962 1.9784 0.00001 t
mpirun n0-12 ./2cz-prog1 962 1.9784 0.00001 t
mpirun n0-15 ./2cz-prog1 962 1.9784 0.00001 t
date
mpirun n0-8  ./2cz-prog1 1442 1.9853 0.00001 t
mpirun n0-10 ./2cz-prog1 1442 1.9853 0.00001 t
mpirun n0-12 ./2cz-prog1 1442 1.9853 0.00001 t
mpirun n0-15 ./2cz-prog1 1442 1.9853 0.00001 t
date
======================== end of job.1 ==============================
 

======================== begin of result1 ==============================
Warning: no access to tty (Bad file descriptor).
Thus no job control in this shell.
Batch job started.
n-1<23242> ssi:boot:base:linear: booting n0 (aurora.local)
n-1&lt;23242>; ssi:boot:base:linear: booting n1 (compute-0-0.local)
n-1<23242> ssi:boot:base:linear: booting n2 (compute-0-1.local)
n-1<23242> ssi:boot:base:linear: booting n3 (compute-0-2.local)
n-1<23242> ssi:boot:base:linear: booting n4 (compute-0-3.local)
n-1<23242> ssi:boot:base:linear: booting n5 (compute-0-4.local)
n-1<23242> ssi:boot:base:linear: booting n6 (compute-0-5.local)
n-1<23242> ssi:boot:base:linear: booting n7 (compute-0-6.local)
n-1<23242> ssi:boot:base:linear: booting n8 (compute-0-7.local)
n-1<23242> ssi:boot:base:linear: booting n9 (compute-0-8.local)
n-1<23242> ssi:boot:base:linear: booting n10 (compute-0-9.local)
n-1<23242> ssi:boot:base:linear: booting n11 (compute-0-10.local)
n-1<23242> ssi:boot:base:linear: booting n12 (compute-0-11.local)
n-1<23242> ssi:boot:base:linear: booting n13 (compute-0-12.local)
n-1<23242> ssi:boot:base:linear: booting n14 (compute-0-13.local)
n-1<23242> ssi:boot:base:linear: booting n15 (compute-0-14.local)
n-1<23242> ssi:boot:base:linear: finished
 
LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
 
n0 aurora.cs.usm.my:1:
n1&nbsp;compute-0-0.local:1:
n2 compute-0-1.local:1:
n3 compute-0-2.local:1:
n4&nbsp;compute-0-3.local:1:
n5 compute-0-4.local:1:
n6 compute-0-5.local:1:
n7&nbsp;compute-0-6.local:1:
n8 compute-0-7.local:1:
n9&nbsp;compute-0-8.local:1:
n10&nbsp;compute-0-9.local:1:
n11 compute-0-10.local:1:
n12 compute-0-11.local:1:
n13 compute-0-12.local:1:
n14&nbsp;compute-0-13.local:1:
n15&nbsp;compute-0-14.local:1:origin,this_node
--------------------------- RUN 1 ---------------------------
Tue Oct  9 08:48:51 MYT 2007
0.499175 prog1 n=482 nW=8 panel=15 w=1.9584 Itr=305 Theo=263 eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue 08:48:51AM
0.494605 prog1 n=482 nW=10 panel=12 w=1.9584 Itr=305 Theo=263 eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue 08:48:52AM
0.451702 prog1 n=482 nW=12 panel=10 w=1.9584 Itr=305 Theo=263 eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue 08:48:53AM
0.558133 prog1 n=482 nW=15 panel=8 w=1.9584 Itr=305 Theo=263 eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue 08:48:55AM
Tue Oct  9 08:48:55 MYT 2007
2.700609 prog1 n=962 nW=8 panel=30 w=1.9784 Itr=592 Theo=525 eps=1e-05 maxE=5.4471e-04 09/10/2007 Tue 08:48:58AM
2.189707 prog1 n=962 nW=10 panel=24 w=1.9784 Itr=592 Theo=525 eps=1e-05 maxE=5.4471e-04 09/10/2007 Tue 08:49:01AM
1.897054 prog1 n=962 nW=12 panel=20 w=1.9784 Itr=592 Theo=525 eps=1e-05 maxE=5.4471e-04 09/10/2007 Tue 08:49:04AM
bufferd (dtry_send): No child processes
mpirun (rpwait): Connection reset by peer
Broken pipe
Tue Oct  9 08:49:06 MYT 2007
-----------------------------------------------------------------------------
It seems that there is no lamd running on the host compute-0-14.local.
 
This indicates that the LAM/MPI runtime environment is not operating.
The LAM/MPI runtime environment is necessary for the "mpirun" command.
 
Please run the "lamboot" command the start the LAM/MPI runtime
environment.&nbsp; See the LAM/MPI documentation for how to invoke
"lamboot" across multiple machines.
-----------------------------------------------------------------------------
:
; .... The "there is no lamd running" message repeats ....
:
 
======================== end of result1 ==============================
 
 
 
 
Re: LAM: bufferd (dtry_send): No child processes
country flaguser name
United States
2007-10-09 02:23:22
Here's a guess: your time limit expired on the job and PBS
killed a  
bunch of LAM daemons / processes such that internal LAM
communication  
started failing, eventually resulting in mpirun's failing
because the  
local lamd was already dead.  Or PBS (or some entity) killed
lamd's  
for some other reason.


On Oct 9, 2007, at 3:57 AM, Daniel Ng wrote:

>
> Hi,
>
> I was running an MPI timing program (2cz-prog1) through
qsub job.1  
> (attached at the end of message).
> A few runs have been successfully executed until end of
the job file.
> However there was one job run terminated with the
following message  
> (see result1 file for full output):
>
> ::
> bufferd (dtry_send): No child processes
> mpirun (rpwait): Connection reset by peer
> Broken pipe
> ::
>
> Then I resubmitted the same job file and it was ok. Can
someone  
> enlighten me what happened that caused the message
above?
>
> Thanks.
> Daniel.
>
>
> ======================== begin of job.1
==============================
> #!/bin/bash
> #$ -cwd
> #$ -o ./result1
> #$ -j y
> #$ -r y
> echo "Job started."
> lamboot -v hosts/hostfile
> lamnodes
> echo --------------------------- RUN 1
---------------------------
> date
> # Each mpirun below creates one line of output
> mpirun n0-8  ./2cz-prog1 482 1.9584 0.00001 t
> mpirun n0-10 ./2cz-prog1 482 1.9584 0.00001 t
> mpirun n0-12 ./2cz-prog1 482 1.9584 0.00001 t
> mpirun n0-15 ./2cz-prog1 482 1.9584 0.00001 t
> date
> mpirun n0-8  ./2cz-prog1 962 1.9784 0.00001 t
> mpirun n0-10 ./2cz-prog1 962 1.9784 0.00001 t
> mpirun n0-12 ./2cz-prog1 962 1.9784 0.00001 t
> mpirun n0-15 ./2cz-prog1 962 1.9784 0.00001 t
> date
> mpirun n0-8  ./2cz-prog1 1442 1.9853 0.00001 t
> mpirun n0-10 ./2cz-prog1 1442 1.9853 0.00001 t
> mpirun n0-12 ./2cz-prog1 1442 1.9853 0.00001 t
> mpirun n0-15 ./2cz-prog1 1442 1.9853 0.00001 t
> date
> ======================== end of job.1
==============================
>
>
> ======================== begin of result1  
> ==============================
> Warning: no access to tty (Bad file descriptor).
> Thus no job control in this shell.
> Batch job started.
> n-1<23242> ssi:boot:base:linear: booting n0
(aurora.local)
> n-1<23242> ssi:boot:base:linear: booting n1
(compute-0-0.local)
> n-1<23242> ssi:boot:base:linear: booting n2
(compute-0-1.local)
> n-1<23242> ssi:boot:base:linear: booting n3
(compute-0-2.local)
> n-1<23242> ssi:boot:base:linear: booting n4
(compute-0-3.local)
> n-1<23242> ssi:boot:base:linear: booting n5
(compute-0-4.local)
> n-1<23242> ssi:boot:base:linear: booting n6
(compute-0-5.local)
> n-1<23242> ssi:boot:base:linear: booting n7
(compute-0-6.local)
> n-1<23242> ssi:boot:base:linear: booting n8
(compute-0-7.local)
> n-1<23242> ssi:boot:base:linear: booting n9
(compute-0-8.local)
> n-1<23242> ssi:boot:base:linear: booting n10
(compute-0-9.local)
> n-1<23242> ssi:boot:base:linear: booting n11
(compute-0-10.local)
> n-1<23242> ssi:boot:base:linear: booting n12
(compute-0-11.local)
> n-1<23242> ssi:boot:base:linear: booting n13
(compute-0-12.local)
> n-1<23242> ssi:boot:base:linear: booting n14
(compute-0-13.local)
> n-1<23242> ssi:boot:base:linear: booting n15
(compute-0-14.local)
> n-1<23242> ssi:boot:base:linear: finished
>
> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
>
> n0 aurora.cs.usm.my:1:
> n1 compute-0-0.local:1:
> n2 compute-0-1.local:1:
> n3 compute-0-2.local:1:
> n4 compute-0-3.local:1:
> n5 compute-0-4.local:1:
> n6 compute-0-5.local:1:
> n7 compute-0-6.local:1:
> n8 compute-0-7.local:1:
> n9 compute-0-8.local:1:
> n10 compute-0-9.local:1:
> n11 compute-0-10.local:1:
> n12 compute-0-11.local:1:
> n13 compute-0-12.local:1:
> n14 compute-0-13.local:1:
> n15 compute-0-14.local:1:origin,this_node
> --------------------------- RUN 1
---------------------------
> Tue Oct  9 08:48:51 MYT 2007
> 0.499175 prog1 n=482 nW=8 panel=15 w=1.9584 Itr=305
Theo=263  
> eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue 08:48:51AM
> 0.494605 prog1 n=482 nW=10 panel=12 w=1.9584 Itr=305
Theo=263  
> eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue 08:48:52AM
> 0.451702 prog1 n=482 nW=12 panel=10 w=1.9584 Itr=305
Theo=263  
> eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue 08:48:53AM
> 0.558133 prog1 n=482 nW=15 panel=8 w=1.9584 Itr=305
Theo=263  
> eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue 08:48:55AM
> Tue Oct  9 08:48:55 MYT 2007
> 2.700609 prog1 n=962 nW=8 panel=30 w=1.9784 Itr=592
Theo=525  
> eps=1e-05 maxE=5.4471e-04 09/10/2007 Tue 08:48:58AM
> 2.189707 prog1 n=962 nW=10 panel=24 w=1.9784 Itr=592
Theo=525  
> eps=1e-05 maxE=5.4471e-04 09/10/2007 Tue 08:49:01AM
> 1.897054 prog1 n=962 nW=12 panel=20 w=1.9784 Itr=592
Theo=525  
> eps=1e-05 maxE=5.4471e-04 09/10/2007 Tue 08:49:04AM
> bufferd (dtry_send): No child processes
> mpirun (rpwait): Connection reset by peer
> Broken pipe
> Tue Oct  9 08:49:06 MYT 2007
>
------------------------------------------------------------
---------- 
> -------
> It seems that there is no lamd running on the host
compute-0-14.local.
>
> This indicates that the LAM/MPI runtime environment is
not operating.
> The LAM/MPI runtime environment is necessary for the
"mpirun" command.
>
> Please run the "lamboot" command the start
the LAM/MPI runtime
> environment.  See the LAM/MPI documentation for how to
invoke
> "lamboot" across multiple machines.
>
------------------------------------------------------------
---------- 
> -------
> :
> :  .... The "there is no lamd running"
message repeats ....
> :
>
> ======================== end of result1
==============================
>
>
>
>
> _______________________________________________
> This list is archived at http://www.l
am-mpi.org/MailArchives/lam/


-- 
Jeff Squyres
Cisco Systems

_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/

Re: LAM: bufferd (dtry_send): No child processes
country flaguser name
Malaysia
2007-10-09 04:59:05
Hi Jeff,

Actually that instance of unsuccessful run was the last job
of a queue 
(total 8 jobs) that had lasted for about 11 h 23 mins.
The exact running times for the 7 jobs are (hh:mm:ss)
      5:18:11 1:14:32 1:36:44 0:28:36 1:17:43 1:03:38
0:24:07


Is there a default total time limit set for a user to run a
series of jobs?



----- Original Message ----- 
From: "Jeff Squyres" <jsquyrescisco.com>
To: "General LAM/MPI mailing list" <lamlam-mpi.org>
Sent: Tuesday, October 09, 2007 3:23 PM
Subject: Re: LAM: bufferd (dtry_send): No child processes


> Here's a guess: your time limit expired on the job and
PBS killed a
> bunch of LAM daemons / processes such that internal LAM
communication
> started failing, eventually resulting in mpirun's
failing because the
> local lamd was already dead.  Or PBS (or some entity)
killed lamd's
> for some other reason.
>
>
> On Oct 9, 2007, at 3:57 AM, Daniel Ng wrote:
>
>>
>> Hi,
>>
>> I was running an MPI timing program (2cz-prog1)
through qsub job.1
>> (attached at the end of message).
>> A few runs have been successfully executed until
end of the job file.
>> However there was one job run terminated with the
following message
>> (see result1 file for full output):
>>
>> ::
>> bufferd (dtry_send): No child processes
>> mpirun (rpwait): Connection reset by peer
>> Broken pipe
>> ::
>>
>> Then I resubmitted the same job file and it was ok.
Can someone
>> enlighten me what happened that caused the message
above?
>>
>> Thanks.
>> Daniel.
>>
>>
>> ======================== begin of job.1
==============================
>> #!/bin/bash
>> #$ -cwd
>> #$ -o ./result1
>> #$ -j y
>> #$ -r y
>> echo "Job started."
>> lamboot -v hosts/hostfile
>> lamnodes
>> echo --------------------------- RUN 1
---------------------------
>> date
>> # Each mpirun below creates one line of output
>> mpirun n0-8  ./2cz-prog1 482 1.9584 0.00001 t
>> mpirun n0-10 ./2cz-prog1 482 1.9584 0.00001 t
>> mpirun n0-12 ./2cz-prog1 482 1.9584 0.00001 t
>> mpirun n0-15 ./2cz-prog1 482 1.9584 0.00001 t
>> date
>> mpirun n0-8  ./2cz-prog1 962 1.9784 0.00001 t
>> mpirun n0-10 ./2cz-prog1 962 1.9784 0.00001 t
>> mpirun n0-12 ./2cz-prog1 962 1.9784 0.00001 t
>> mpirun n0-15 ./2cz-prog1 962 1.9784 0.00001 t
>> date
>> mpirun n0-8  ./2cz-prog1 1442 1.9853 0.00001 t
>> mpirun n0-10 ./2cz-prog1 1442 1.9853 0.00001 t
>> mpirun n0-12 ./2cz-prog1 1442 1.9853 0.00001 t
>> mpirun n0-15 ./2cz-prog1 1442 1.9853 0.00001 t
>> date
>> ======================== end of job.1
==============================
>>
>>
>> ======================== begin of result1
>> ==============================
>> Warning: no access to tty (Bad file descriptor).
>> Thus no job control in this shell.
>> Batch job started.
>> n-1<23242> ssi:boot:base:linear: booting n0
(aurora.local)
>> n-1<23242> ssi:boot:base:linear: booting n1
(compute-0-0.local)
>> n-1<23242> ssi:boot:base:linear: booting n2
(compute-0-1.local)
>> n-1<23242> ssi:boot:base:linear: booting n3
(compute-0-2.local)
>> n-1<23242> ssi:boot:base:linear: booting n4
(compute-0-3.local)
>> n-1<23242> ssi:boot:base:linear: booting n5
(compute-0-4.local)
>> n-1<23242> ssi:boot:base:linear: booting n6
(compute-0-5.local)
>> n-1<23242> ssi:boot:base:linear: booting n7
(compute-0-6.local)
>> n-1<23242> ssi:boot:base:linear: booting n8
(compute-0-7.local)
>> n-1<23242> ssi:boot:base:linear: booting n9
(compute-0-8.local)
>> n-1<23242> ssi:boot:base:linear: booting n10
(compute-0-9.local)
>> n-1<23242> ssi:boot:base:linear: booting n11
(compute-0-10.local)
>> n-1<23242> ssi:boot:base:linear: booting n12
(compute-0-11.local)
>> n-1<23242> ssi:boot:base:linear: booting n13
(compute-0-12.local)
>> n-1<23242> ssi:boot:base:linear: booting n14
(compute-0-13.local)
>> n-1<23242> ssi:boot:base:linear: booting n15
(compute-0-14.local)
>> n-1<23242> ssi:boot:base:linear: finished
>>
>> LAM 7.1.1/MPI 2 C++/ROMIO - Indiana University
>>
>> n0 aurora.cs.usm.my:1:
>> n1 compute-0-0.local:1:
>> n2 compute-0-1.local:1:
>> n3 compute-0-2.local:1:
>> n4 compute-0-3.local:1:
>> n5 compute-0-4.local:1:
>> n6 compute-0-5.local:1:
>> n7 compute-0-6.local:1:
>> n8 compute-0-7.local:1:
>> n9 compute-0-8.local:1:
>> n10 compute-0-9.local:1:
>> n11 compute-0-10.local:1:
>> n12 compute-0-11.local:1:
>> n13 compute-0-12.local:1:
>> n14 compute-0-13.local:1:
>> n15 compute-0-14.local:1:origin,this_node
>> --------------------------- RUN 1
---------------------------
>> Tue Oct  9 08:48:51 MYT 2007
>> 0.499175 prog1 n=482 nW=8 panel=15 w=1.9584 Itr=305
Theo=263
>> eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue
08:48:51AM
>> 0.494605 prog1 n=482 nW=10 panel=12 w=1.9584
Itr=305 Theo=263
>> eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue
08:48:52AM
>> 0.451702 prog1 n=482 nW=12 panel=10 w=1.9584
Itr=305 Theo=263
>> eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue
08:48:53AM
>> 0.558133 prog1 n=482 nW=15 panel=8 w=1.9584 Itr=305
Theo=263
>> eps=1e-05 maxE=2.3613e-04 09/10/2007 Tue
08:48:55AM
>> Tue Oct  9 08:48:55 MYT 2007
>> 2.700609 prog1 n=962 nW=8 panel=30 w=1.9784 Itr=592
Theo=525
>> eps=1e-05 maxE=5.4471e-04 09/10/2007 Tue
08:48:58AM
>> 2.189707 prog1 n=962 nW=10 panel=24 w=1.9784
Itr=592 Theo=525
>> eps=1e-05 maxE=5.4471e-04 09/10/2007 Tue
08:49:01AM
>> 1.897054 prog1 n=962 nW=12 panel=20 w=1.9784
Itr=592 Theo=525
>> eps=1e-05 maxE=5.4471e-04 09/10/2007 Tue
08:49:04AM
>> bufferd (dtry_send): No child processes
>> mpirun (rpwait): Connection reset by peer
>> Broken pipe
>> Tue Oct  9 08:49:06 MYT 2007
>>
------------------------------------------------------------
---------- 
>> -------
>> It seems that there is no lamd running on the host
compute-0-14.local.
>>
>> This indicates that the LAM/MPI runtime environment
is not operating.
>> The LAM/MPI runtime environment is necessary for
the "mpirun" command.
>>
>> Please run the "lamboot" command the
start the LAM/MPI runtime
>> environment.  See the LAM/MPI documentation for how
to invoke
>> "lamboot" across multiple machines.
>>
------------------------------------------------------------
---------- 
>> -------
>> :
>> :  .... The "there is no lamd running"
message repeats ....
>> :
>>
>> ======================== end of result1
==============================
>>
>>
>>
>>
>> _______________________________________________
>> This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
>
>
> -- 
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> This list is archived at http://www.l
am-mpi.org/MailArchives/lam/ 

_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/

Re: LAM: bufferd (dtry_send): No child processes
country flaguser name
United States
2007-10-09 05:04:19
On Oct 9, 2007, at 11:59 AM, Daniel Ng wrote:

> Actually that instance of unsuccessful run was the last
job of a queue
> (total 8 jobs) that had lasted for about 11 h 23 mins.
> The exact running times for the 7 jobs are (hh:mm:ss)
>       5:18:11 1:14:32 1:36:44 0:28:36 1:17:43 1:03:38
0:24:07
>
> Is there a default total time limit set for a user to
run a series  
> of jobs?

Not within LAM, no.

-- 
Jeff Squyres
Cisco Systems

_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/

[1-4]

about | contact  Other archives ( Real Estate discussion Medical topics )