| More info to help
debug................
==================================
started (7.1.2), uid 24017, gid
1001 kernel: initialized Link 0: node: 0, cpus: 1, type: 0, ip:
158.140.147.7, port 41548 Link 1: node: 1, cpus: 1, type: 384, ip:
158.140.147.91, port 35759 kio_req: new client on fd=14 kouter: attached
process pid=18210, pri=1095, fd=14 flatd: flqload - successfully created file
/tmp/lam-yamend end-cheetah/lam-flatd0 flatd: flqload - file
descriptor 15 flatd: flqload - successfully appended 2059 bytes to
/tmp/lam-yamend end-cheetah/lam-flatd0 kenyad: pqcreating with rtf
0x1b310 kenyad: looking for executable
"/sbox/yamend/r33/amd64_linux24/64/bin/TWTgen" in directory
"/afs/tda.cadence.com/project/tg/12/regression/btv2" kenyad: found
"/sbox/yamend/r33/amd64_linux24/64/bin/TWTgen" kenyad: creating new user
process... kenyad: attempting to receive stdout/stderr file
descriptors kenyad: recv_stdio_fds: happiness kenyad: setting environment
variables to pass to new process kenyad: setting TROLLIUSFD kenyad:
setting TROLLIUSRTF kenyad: setting LAMJOBID kenyad: setting
LAMKENYAPID kenyad: setting LAMWORLD kenyad: setting LAMPARENT kenyad:
setting LAMRANK kenyad: checking for working directory flag kenyad:
working directory set explicitly kenyad: running in directory
/afs/tda.cadence.com/project/tg/12/regression/btv2 kenyad: fork/exec
succeeded, pid 18211, index 11, rtf 0x1b312 kenyad: create succeeded, process
running died: caught child death; trying to detach died: detaching table
entry 10 kouter: kqdetach detached process pid=18210 kouter: kqdetach
calling kio_close kouter: kqdetach calling knuke
==================================
==================================
started (7.1.2), uid 24017, gid
1001 kernel: initialized Link 0: node: 0, cpus: 1, type: 0, ip:
158.140.147.7, port 41548 Link 1: node: 1, cpus: 1, type: 384, ip:
158.140.147.91, port 35759 flatd: flqload - successfully created file
/tmp/lam-yamend end-leopard/lam-flatd0 flatd: flqload - file
descriptor 16 flatd: flqload - successfully appended 2061 bytes to
/tmp/lam-yamend end-leopard/lam-flatd0 kenyad: pqcreating with rtf
0x40b310 kenyad: checking for directory
/afs/tda.cadence.com/project/tg/12/regression/btv2 kenyad: looking for
executable "/sbox/yamend/r33/amd64_linux24/64/bin/TWTgenfm" in directory
"/afs/tda.cadence.com/project/tg/12/regression/btv2" kenyad: found
"/sbox/yamend/r33/amd64_linux24/64/bin/TWTgenfm" kenyad: creating new user
process... kenyad: setting environment variables to pass to new
process kenyad: setting TROLLIUSFD kenyad: setting TROLLIUSRTF kenyad:
setting LAMJOBID kenyad: setting LAMKENYAPID kenyad: setting
LAMWORLD kenyad: setting LAMPARENT kenyad: setting LAMRANK kenyad:
checking for working directory flag kenyad: working directory set
explicitly kenyad: running in directory
/afs/tda.cadence.com/project/tg/12/regression/btv2 kenyad: fork/exec
succeeded, pid 11690, index 11, rtf 0x40b312 kenyad: create succeeded,
process running died: caught child death; trying to detach died: detaching
table entry 10
Can you tell me why the slave is
dieing during MPI_Init???????
Thank you,
YoungHui
I've narrowed this problem to mpirun.c in otb/mpirun
directory.
In this file, there's get_mpi_world function.
After it does nrecv(msg), it does the following check:
if (msg.nh_type == 1)
{ char
node[32]; if
(fl_very_verbose) printf("mpirun:
someone died before MPI_INIT -- rank %d\n",
msg.nh_node); snprintf(node, sizeof(node),
"%d", msg.nh_node); show_help("mpirun",
"no-init", node, NULL); errno =
EMPINOINIT; return
LAMERROR; }
When is nh_type being set to 1 when issuing nsend
command?
One of the differences I noticed was that in 6.3,
PTY_IS_DEFAULT is 0 but in 7.1, it's 1. What is the PTY
support?
I would appreciate any help you can give me.
Thank you for your prompt
attention,
YoungHui Amend
Hi,
I'm in process of
upgrading from version 6.3 to 7.1.
I've got lam
daemons running on my master and slave machines. Then I'm executing
mpirun with application schema and is getting
MPI_Init: LAM
error: Unknown error 471
----------------------------------------------------------------------------- It
seems that [at least] one of the processes that was started with mpirun did
not invoke MPI_INIT before quitting (it is possible that more than one
process did not invoke MPI_INIT -- mpirun was only notified of the first
one, which was on node n0).
mpirun can *only*
be used with MPI programs (i.e., programs that invoke MPI_INIT and
MPI_FINALIZE). You can use the "lamexec" program to run non-MPI
programs over the lambooted
nodes. ----------------------------------------------------------------------------- ///////////////////////////////////////////////////////////////////////////////////////
My master and
slave processes does call MPI_Init. I think the error message for 471 is
coming out of slave processes and therefore is quitting before my master
processes gets a chance to call MPI_Init, which generates the message about
not invoking MPI_INIT before quitting.
This part of the
code works fine with version 6.3. Are there some changes between release
that I'm not aware of?
I've seem some
conflicting documentation saying MPI_Init needs to be called by all processes
and then another help file saying the master or one of slave machine needs to
call MPI_Init. In either case, what is the Unknow error 471 and which
LAM/MPI source code is this coming out of?
Here is my
command:
mpirun -t -c2c
-O -w -x $LAM_EXPORT myapp
where
LAM_EXPORT=PATH,LD_LIBRARY_PATH,DISPLAY,LAMHOME
myapp file contains:
n0 /afs/tda/sti/r33/prod/linux24_64/tools/tb/bin-64/TWTgen parallelprocess=yes
experiment=ya lbist=yes
n1
/afs/tda/sti/r33/prod/linux24_64/tools/tb/bin-64/TWTgenfm experiment=ya
lbist=yes parallelprocess=yes
I'm running TWTgen on master and TWTgenfm on slave. They are the
same programs with different calling
entry. |