List Info

Thread: LAM: RHEL4 and lam 7.0.6 problem




LAM: RHEL4 and lam 7.0.6 problem
user name
2006-05-17 13:21:52
Hi all !

Maybe someone can help me with this issue.

I have a problem with lam-mpi using RHEL4 with lam v.7.0.6
that did not
occour on Redhat9 with lam
v.6.5.9.
We use simulation packages (eg. ls-dyna) and work with
biprocessor machines.
In order to take full advantage of the 64bits architecture
and the OS
using the mentioned software, we need to run it in parallel
mode; thing
that would be done by using lam-mpi (the software has been
compiled on
the purpose of using the lam-mpi on 64bits EM64T
architecture by the
developers).

To use the previous, I need to start the lam-mpi process by
issuing the
"lamboot" command which should start the mpi
process enabling the 2 cpus.

Well, issuing the "lamboot" I get the following
message:

$ lamboot -v

  LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University

n-1<4268> ssi:boot:base:linear: booting n0 (localhost)
n-1<4268> ssi:boot:base:linear: finished

The above means that the process exited without enabling
node1 and
therefore it fails the initialization.
At first I thought it was due to the fact that rsh'ing I
was getting
some messages in return:

$ rsh redhat2
connect to address 192.168.1.11: Connection refused
Trying krb4 rlogin...
connect to address 192.168.1.11: Connection refused
trying normal rlogin (/usr/bin/rlogin)
Last login: Thu May 11 11:22:38 from redhat15

After fiddling enough (  ) and
managing to get rid of the above, the
problem still persists.
Trying to run the job, the latter exits miserably...
The lamboot -d command returns the following:

$ lamboot $HOME/cluster -d
n-1<8737> ssi:boot: Opening
n-1<8737> ssi:boot: opening module globus
n-1<8737> ssi:boot: initializing module globus
n-1<8737> ssi:boot:globus: globus-job-run not found,
globus boot will
not run
n-1<8737> ssi:boot: module not available: globus
n-1<8737> ssi:boot: opening module rsh
n-1<8737> ssi:boot: initializing module rsh
n-1<8737> ssi:boot:rsh: module initializing
n-1<8737> ssi:boot:rsh:agent: /usr/bin/ssh -x -a
n-1<8737> ssi:boot:rsh:username: <same>
n-1<8737> ssi:boot:rsh:verbose: 1000
n-1<8737> ssi:boot:rsh:algorithm: linear
n-1<8737> ssi:boot:rsh:priority: 10
n-1<8737> ssi:boot: module available: rsh, priority:
10
n-1<8737> ssi:boot: finalizing module globus
n-1<8737> ssi:boot:globus: finalizing
n-1<8737> ssi:boot: closing module globus
n-1<8737> ssi:boot: Selected boot module rsh

LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University

n-1<8737> ssi:boot:base: looking for boot schema in
following directories:
n-1<8737> ssi:boot:base:   <current directory>
n-1<8737> ssi:boot:base:   $TROLLIUSHOME/etc
n-1<8737> ssi:boot:base:   $LAMHOME/etc
n-1<8737> ssi:boot:base:   /etc/lam
n-1<8737> ssi:boot:base: looking for boot schema file:
n-1<8737> ssi:boot:base:   /home/catusr/cluster
n-1<8737> ssi:boot:base: found boot schema:
/home/catusr/cluster
n-1<8737> ssi:boot:rsh: found the following hosts:
n-1<8737> ssi:boot:rsh:   n0 redhat2 (cpu=2)
n-1<8737> ssi:boot:rsh: resolved hosts:
n-1<8737> ssi:boot:rsh:   n0 redhat2 -->
192.168.1.11 (origin)
n-1<8737> ssi:boot:rsh: starting RTE procs
n-1<8737> ssi:boot:base:linear: starting
n-1<8737> ssi:boot:base:server: opening server TCP
socket
n-1<8737> ssi:boot:base:server: opened port 33121
n-1<8737> ssi:boot:base:linear: booting n0 (redhat2)
n-1<8737> ssi:boot:rsh: starting lamd on (redhat2)
n-1<8737> ssi:boot:rsh: starting on n0 (redhat2):
hboot -t -c
lam-conf.lamd -d -I -H 192.168.1.11 -P 33121 -n 0 -o 0
n-1<8737> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-catusrredhat2/lam-killfile
tkill: removing socket file ...
tkill: socket file: /tmp/lam-catusrredhat2/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-catusrredhat2/lam-io-socket
tkill: f_kill = "/tmp/lam-catusrredhat2/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 8718 ...
tkill: killed
tkill: all finished
hboot: booting...
hboot: fork /usr/bin/lamd
hboot: attempting to execute
[1]   8740 lamd -H 192.168.1.11 -P 33121 -n 0 -o 0 -d
n-1<8737> ssi:boot:rsh: successfully launched on n0
(redhat2)
n-1<8737> ssi:boot:base:server: expecting connection
from finite list
n-1<8740> ssi:boot: Opening
n-1<8740> ssi:boot: opening module globus
n-1<8740> ssi:boot: initializing module globus
n-1<8740> ssi:boot:globus: globus-job-run not found,
globus boot will
not run
n-1<8740> ssi:boot: module not available: globus
n-1<8740> ssi:boot: opening module rsh
n-1<8740> ssi:boot: initializing module rsh
n-1<8740> ssi:boot:rsh: module initializing
n-1<8740> ssi:boot:rsh:agent: /usr/bin/ssh -x -a
n-1<8740> ssi:boot:rsh:username: <same>
n-1<8740> ssi:boot:rsh:verbose: 1000
n-1<8740> ssi:boot:rsh:algorithm: linear
n-1<8740> ssi:boot:rsh:priority: 10
n-1<8740> ssi:boot: module available: rsh, priority:
10
n-1<8740> ssi:boot: finalizing module globus
n-1<8740> ssi:boot:globus: finalizing
n-1<8740> ssi:boot: closing module globus
n-1<8740> ssi:boot: Selected boot module rsh
n-1<8737> ssi:boot:base:server: got connection from
192.168.1.11
n-1<8737> ssi:boot:base:server: this connection is
expected (n0)
n-1<8737> ssi:boot:base:server: remote lamd is at
192.168.1.11:32772
n-1<8737> ssi:boot:base:server: closing server socket
n-1<8737> ssi:boot:base:server: connecting to lamd at
192.168.1.11:33122
n-1<8737> ssi:boot:base:server: connected
n-1<8737> ssi:boot:base:server: sending number of
links (1)
n-1<8737> ssi:boot:base:server: sending info: n0
(redhat2)
n-1<8737> ssi:boot:base:server: finished sending
n-1<8737> ssi:boot:base:server: disconnected from
192.168.1.11:33122
n-1<8737> ssi:boot:base:linear: finished
n-1<8737> ssi:boot:rsh: all RTE procs started
n-1<8737> ssi:boot:rsh: finalizing
n-1<8737> ssi:boot: Closing
n-1<8740> ssi:boot:rsh: finalizing
n-1<8740> ssi:boot: Closing

And it looks to me that the lam process dies without any
evident reasons.

I had a look at the /tmp/lam-debug-log.txt and I can see
that the
process exits but without letting me know what is wrong with
it
all............ :-(  (The lam-debug-log.txt is inline at the
bottom of
the msg....)

Does anybody have an idea on how to solve the problem ?
Any help will be greatly appreciated !

Thank you
Cheers



started (7.0.6), uid 300, gid 304
kernel:  initialized
Link 0: node: 0, cpus: 2, type: 384, ip: 192.168.1.11
kio_req: new client on fd=13
kouter: attached process pid=4560, pri=1095, fd=13
flatd: flqload - successfully created file
/tmp/lam-catusrredhat2/lam-flatd0
flatd: flqload - file descriptor 14
flatd: flqload - successfully appended 74 bytes to
/tmp/lam-catusrredhat2/lam-flatd0
kenyad: pqcreating with rtf 0x79010
kenyad: looking for executable
"/swcae/ls-dyna/mpp970_s_6763_em64t_linux_lam659_dynam
ic" in directory
"/usr2/CAE"
kenyad: found
"/swcae/ls-dyna/mpp970_s_6763_em64t_linux_lam659_dynam
ic"
kenyad: creating new user process...
kenyad: attempting to receive stdout/stderr file descriptors
kenyad: recv_stdio_fds: happiness
kenyad: setting environment variables to pass to new process
kenyad: setting TROLLIUSFD
kenyad: setting TROLLIUSRTF
kenyad: setting LAMJOBID
kenyad: setting LAMKENYAPID
kenyad: setting LAMWORLD
kenyad: setting LAMPARENT
kenyad: setting LAMRANK
kenyad: checking for working directory flag
kenyad: working directory set explicitly
kenyad: running in directory /usr2/CAE
kenyad: fork/exec succeeded, pid 4561, index 11, rtf 0x79012
kenyad: create succeeded, process running
flatd: flqload - successfully created file
/tmp/lam-catusrredhat2/lam-flatd1
flatd: flqload - file descriptor 14
flatd: flqload - successfully appended 74 bytes to
/tmp/lam-catusrredhat2/lam-flatd1
kenyad: pqcreating with rtf 0x79010
kenyad: looking for executable
"/swcae/ls-dyna/mpp970_s_6763_em64t_linux_lam659_dynam
ic" in directory
"/usr2/CAE"
kenyad: found
"/swcae/ls-dyna/mpp970_s_6763_em64t_linux_lam659_dynam
ic"
kenyad: creating new user process...
kenyad: attempting to receive stdout/stderr file descriptors
kenyad: recv_stdio_fds: happiness
kenyad: setting environment variables to pass to new process
kenyad: setting TROLLIUSFD
kenyad: setting TROLLIUSRTF
kenyad: setting LAMJOBID
kenyad: setting LAMKENYAPID
kenyad: setting LAMWORLD
kenyad: setting LAMPARENT
kenyad: setting LAMRANK
kenyad: checking for working directory flag
kenyad: working directory set explicitly
kenyad: running in directory /usr2/CAE
kenyad: fork/exec succeeded, pid 4562, index 12, rtf 0x79012
kenyad: create succeeded, process running
died: caught child death; trying to detach
kouter: kqdetach detached process pid=4560
kouter: kqdetach calling kio_close
kouter: kqdetach calling knuke
kio_req: new client on fd=13
kouter: attached process pid=4563, pri=1095, fd=13
flatd: flqload - successfully created file
/tmp/lam-catusrredhat2/lam-flatd2
flatd: flqload - file descriptor 14
flatd: flqload - successfully appended 73 bytes to
/tmp/lam-catusrredhat2/lam-flatd2
kenyad: pqcreating with rtf 0x79010
kenyad: looking for executable
"/swcae/ls-dyna/mpp970_s_6763_em64t_linux_lam659_stati
c" in directory
"/usr2/CAE"
kenyad: found
"/swcae/ls-dyna/mpp970_s_6763_em64t_linux_lam659_stati
c"
kenyad: creating new user process...
kenyad: attempting to receive stdout/stderr file descriptors
kenyad: recv_stdio_fds: happiness
kenyad: setting environment variables to pass to new process
kenyad: setting TROLLIUSFD
kenyad: setting TROLLIUSRTF
kenyad: setting LAMJOBID
kenyad: setting LAMKENYAPID
kenyad: setting LAMWORLD
kenyad: setting LAMPARENT
kenyad: setting LAMRANK
kenyad: checking for working directory flag
kenyad: working directory set explicitly
kenyad: running in directory /usr2/CAE
kenyad: fork/exec succeeded, pid 4564, index 11, rtf 0x79012
kenyad: create succeeded, process running
flatd: flqload - successfully created file
/tmp/lam-catusrredhat2/lam-flatd3
flatd: flqload - file descriptor 14
flatd: flqload - successfully appended 73 bytes to
/tmp/lam-catusrredhat2/lam-flatd3
kenyad: pqcreating with rtf 0x79010
kenyad: looking for executable
"/swcae/ls-dyna/mpp970_s_6763_em64t_linux_lam659_stati
c" in directory
"/usr2/CAE"
kenyad: found
"/swcae/ls-dyna/mpp970_s_6763_em64t_linux_lam659_stati
c"
kenyad: creating new user process...
kenyad: attempting to receive stdout/stderr file descriptors
kenyad: recv_stdio_fds: happiness
kenyad: setting environment variables to pass to new process
kenyad: setting TROLLIUSFD
kenyad: setting TROLLIUSRTF
kenyad: setting LAMJOBID
kenyad: setting LAMKENYAPID
kenyad: setting LAMWORLD
kenyad: setting LAMPARENT
kenyad: setting LAMRANK
kenyad: checking for working directory flag
kenyad: working directory set explicitly
kenyad: running in directory /usr2/CAE
kenyad: fork/exec succeeded, pid 4565, index 12, rtf 0x79012
kenyad: create succeeded, process running
died: caught child death; trying to detach
kouter: kqdetach detached process pid=4563
kouter: kqdetach calling kio_close
kouter: kqdetach calling knuke





-- 
--------------------------------
|            __    __          | Valter DAL BO
|           /  \ /| |'-.       | e-mail: dalbotesco.it
|          .\__/ || |   |      |
|       _ /  `._ \|_|_.-'      | Tesco TS S.p.A.
|      | /  \__.`=._) (_       | http://www.tesco.it
|      |/ ._/ 
|""""""""&q
uot;|     |
|      |'.  `\ |         |     | tel.: +390113011711
|      ;"""/ / |         |     | fax :
+390113140362
|       ) /_/| |.-------.|     | mobile: +393357707810
|      '  `-`' "         "     | C.so Tazzoli
10137 Torino ITALY
--------------------------------



_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
LAM: RHEL4 and lam 7.0.6 problem
user name
2006-05-17 15:48:15
Valter Dal Bo wrote:
> Hi all !
> 
> Maybe someone can help me with this issue.
> 
> I have a problem with lam-mpi using RHEL4 with lam
v.7.0.6 that did not
> occour on Redhat9 with lam
> v.6.5.9.
You can't mix an application built for lam-6.5 with
lam-7.0.  I think LSTC
prefers to support lam-7.0.3 (built with --enable-shared
etc) or Intel or
HP MPI for the x86-64 systems.  In most cases, you have to
rebuild lam 
yourself,
using the same version as your application expects,
including matching 
--enable-shared or non-shared, and matching 32- or 64-bit
compilers.  As 
you appear to be using a version built with the ifort
fce/8.1 compiler, 
that may enter into your consideration.  You should be able
to build lam 
with gcc/g77 or gcc/gfortran if you aren't relinking your
application. 
I think you can see how lam support has become more
difficult on 
32/64-bit systems.

_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
LAM: RHEL4 and lam 7.0.6 problem
user name
2006-05-18 03:10:22
On May 17, 2006, at 9:21 AM, Valter Dal Bo wrote:

> Hi all !
>
> Maybe someone can help me with this issue.
>
> I have a problem with lam-mpi using RHEL4 with lam
v.7.0.6 that did  
> not
> occour on Redhat9 with lam
> v.6.5.9.
> We use simulation packages (eg. ls-dyna) and work with
biprocessor  
> machines.
> In order to take full advantage of the 64bits
architecture and the OS
> using the mentioned software, we need to run it in
parallel mode;  
> thing
> that would be done by using lam-mpi (the software has
been compiled on
> the purpose of using the lam-mpi on 64bits EM64T
architecture by the
> developers).
>
> To use the previous, I need to start the lam-mpi
process by issuing  
> the
> "lamboot" command which should start the
mpi process enabling the 2  
> cpus.
>
> Well, issuing the "lamboot" I get the
following message:
>
> $ lamboot -v
>
>   LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University
>
> n-1<4268> ssi:boot:base:linear: booting n0
(localhost)
> n-1<4268> ssi:boot:base:linear: finished
>
> The above means that the process exited without
enabling node1 and
> therefore it fails the initialization.
> At first I thought it was due to the fact that rsh'ing
I was getting
> some messages in return:

Actually, since you didn't give any host file to lamboot,
it did  
exactly what it should.  It defaulted to a hostfile of
"localhost"  
and started a universe there.  So at this point, LAM/MPI
looks ok.

> After fiddling enough (  ) and
managing to get rid of the above,  
> the
> problem still persists.
> Trying to run the job, the latter exits miserably...
> The lamboot -d command returns the following:
>
> $ lamboot $HOME/cluster -d
> n-1<8737> ssi:boot: Opening
> n-1<8737> ssi:boot: opening module globus
> n-1<8737> ssi:boot: initializing module globus
> n-1<8737> ssi:boot:globus: globus-job-run not
found, globus boot will
> not run
> n-1<8737> ssi:boot: module not available: globus
> n-1<8737> ssi:boot: opening module rsh
> n-1<8737> ssi:boot: initializing module rsh
> n-1<8737> ssi:boot:rsh: module initializing
> n-1<8737> ssi:boot:rsh:agent: /usr/bin/ssh -x -a
> n-1<8737> ssi:boot:rsh:username: <same>
> n-1<8737> ssi:boot:rsh:verbose: 1000
> n-1<8737> ssi:boot:rsh:algorithm: linear
> n-1<8737> ssi:boot:rsh:priority: 10
> n-1<8737> ssi:boot: module available: rsh,
priority: 10
> n-1<8737> ssi:boot: finalizing module globus
> n-1<8737> ssi:boot:globus: finalizing
> n-1<8737> ssi:boot: closing module globus
> n-1<8737> ssi:boot: Selected boot module rsh
>
> LAM 7.0.6/MPI 2 C++/ROMIO - Indiana University
>
> n-1<8737> ssi:boot:base: looking for boot schema
in following  
> directories:
> n-1<8737> ssi:boot:base:   <current
directory>
> n-1<8737> ssi:boot:base:   $TROLLIUSHOME/etc
> n-1<8737> ssi:boot:base:   $LAMHOME/etc
> n-1<8737> ssi:boot:base:   /etc/lam
> n-1<8737> ssi:boot:base: looking for boot schema
file:
> n-1<8737> ssi:boot:base:   /home/catusr/cluster
> n-1<8737> ssi:boot:base: found boot schema:
/home/catusr/cluster
> n-1<8737> ssi:boot:rsh: found the following
hosts:
> n-1<8737> ssi:boot:rsh:   n0 redhat2 (cpu=2)
> n-1<8737> ssi:boot:rsh: resolved hosts:
> n-1<8737> ssi:boot:rsh:   n0 redhat2 -->
192.168.1.11 (origin)
> n-1<8737> ssi:boot:rsh: starting RTE procs
> n-1<8737> ssi:boot:base:linear: starting
> n-1<8737> ssi:boot:base:server: opening server
TCP socket
> n-1<8737> ssi:boot:base:server: opened port 33121
> n-1<8737> ssi:boot:base:linear: booting n0
(redhat2)
> n-1<8737> ssi:boot:rsh: starting lamd on
(redhat2)
> n-1<8737> ssi:boot:rsh: starting on n0 (redhat2):
hboot -t -c
> lam-conf.lamd -d -I -H 192.168.1.11 -P 33121 -n 0 -o 0
> n-1<8737> ssi:boot:rsh: launching locally
> hboot: performing tkill
> hboot: tkill -d
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-catusrredhat2/lam-killfile
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-catusrredhat2/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-catusrredhat2/lam-io-socket
> tkill: f_kill = "/tmp/lam-catusrredhat2/lam-killfile"
> tkill: killing LAM...
> tkill: killing PID (SIGHUP) 8718 ...
> tkill: killed
> tkill: all finished
> hboot: booting...
> hboot: fork /usr/bin/lamd
> hboot: attempting to execute
> [1]   8740 lamd -H 192.168.1.11 -P 33121 -n 0 -o 0 -d
> n-1<8737> ssi:boot:rsh: successfully launched on
n0 (redhat2)
> n-1<8737> ssi:boot:base:server: expecting
connection from finite list
> n-1<8740> ssi:boot: Opening
> n-1<8740> ssi:boot: opening module globus
> n-1<8740> ssi:boot: initializing module globus
> n-1<8740> ssi:boot:globus: globus-job-run not
found, globus boot will
> not run
> n-1<8740> ssi:boot: module not available: globus
> n-1<8740> ssi:boot: opening module rsh
> n-1<8740> ssi:boot: initializing module rsh
> n-1<8740> ssi:boot:rsh: module initializing
> n-1<8740> ssi:boot:rsh:agent: /usr/bin/ssh -x -a
> n-1<8740> ssi:boot:rsh:username: <same>
> n-1<8740> ssi:boot:rsh:verbose: 1000
> n-1<8740> ssi:boot:rsh:algorithm: linear
> n-1<8740> ssi:boot:rsh:priority: 10
> n-1<8740> ssi:boot: module available: rsh,
priority: 10
> n-1<8740> ssi:boot: finalizing module globus
> n-1<8740> ssi:boot:globus: finalizing
> n-1<8740> ssi:boot: closing module globus
> n-1<8740> ssi:boot: Selected boot module rsh
> n-1<8737> ssi:boot:base:server: got connection
from 192.168.1.11
> n-1<8737> ssi:boot:base:server: this connection
is expected (n0)
> n-1<8737> ssi:boot:base:server: remote lamd is at
192.168.1.11:32772
> n-1<8737> ssi:boot:base:server: closing server
socket
> n-1<8737> ssi:boot:base:server: connecting to
lamd at  
> 192.168.1.11:33122
> n-1<8737> ssi:boot:base:server: connected
> n-1<8737> ssi:boot:base:server: sending number of
links (1)
> n-1<8737> ssi:boot:base:server: sending info: n0
(redhat2)
> n-1<8737> ssi:boot:base:server: finished sending
> n-1<8737> ssi:boot:base:server: disconnected from
192.168.1.11:33122
> n-1<8737> ssi:boot:base:linear: finished
> n-1<8737> ssi:boot:rsh: all RTE procs started
> n-1<8737> ssi:boot:rsh: finalizing
> n-1<8737> ssi:boot: Closing
> n-1<8740> ssi:boot:rsh: finalizing
> n-1<8740> ssi:boot: Closing
>
> And it looks to me that the lam process dies without
any evident  
> reasons.
>
> I had a look at the /tmp/lam-debug-log.txt and I can
see that the
> process exits but without letting me know what is wrong
with it
> all............ :-(  (The lam-debug-log.txt is inline
at the bottom of
> the msg....)
>
> Does anybody have an idea on how to solve the problem ?
> Any help will be greatly appreciated !

I'm not sure what you mean - the above information looks
perfect.   
Lamboot should exit when it's done, and it looks like it
finished as  
expected.  It started a univers on the node
"redhat2" with a "cpu  
count" of 2.  Once lamboot is finished, you can run
lamnodes to see  
what nodes are in the newly booted environment, mpirun to
run  
processes, and lamhalt to take down the environment.  Are
one of  
these commands not working properly?

Brian


-- 
   Brian Barrett
   LAM/MPI developer and all around nice guy
   Have a LAM/MPI day: http://www.lam-mpi.org/


_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )