List Info

Thread: Re: LAM: Please help: lamboot call back problem




Re: LAM: Please help: lamboot call back problem
country flaguser name
Japan
2007-04-02 08:35:49
    Hello Jeff Squyres,

    Thank you for your quick response. That is really odd! I
spend some 
time to check about the trouble.
 
 (1) You are right about the configuration without
"memory-manager";
 
 (2) There is no firewall software running;

 (3)  Instead of using the multiprocessor machine, I
installed the 
Lam-MPI on a single processor machine
with the same processor Itanium. Then  I make the lamboot
call with only 
the Itanium station alone (with
two work stations, it results into the same error):
 it results into the same error  message as before, as you
can see on 
the following file:


============================================================
==== output 
on the screen:
biscaye 173 :  lamboot -v -d -ssi boot rsh hostfile
n-1<28363> ssi:boot:open: opening
n-1<28363> ssi:boot:open: looking for boot module
named rsh
n-1<28363> ssi:boot:open: opening boot module rsh
n-1<28363> ssi:boot:open: opened boot module rsh
n-1<28363> ssi:boot:select: initializing boot module
rsh
n-1<28363> ssi:boot:rsh: module initializing
n-1<28363> ssi:boot:rsh:agent: /usr/bin/remsh
n-1<28363> ssi:boot:rsh:username: <same>
n-1<28363> ssi:boot:rsh:verbose: 1000
n-1<28363> ssi:boot:rsh:algorithm: linear
n-1<28363> ssi:boot:rsh:no_n: 0
n-1<28363> ssi:boot:rsh:no_profile: 0
n-1<28363> ssi:boot:rsh:fast: 0
n-1<28363> ssi:boot:rsh:ignore_stderr: 0
n-1<28363> ssi:boot:rsh:priority: 10
n-1<28363> ssi:boot:select: boot module available:
rsh, priority: 10
n-1<28363> ssi:boot:select: selected boot module rsh

LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University

n-1<28363> ssi:boot:base: looking for boot schema in
following directories:
n-1<28363> ssi:boot:base:   <current directory>
n-1<28363> ssi:boot:base:   $TROLLIUSHOME/etc
n-1<28363> ssi:boot:base:   $LAMHOME/etc
n-1<28363> ssi:boot:base:  
/homi/truong/Local/lam-mpi_itanium-inst/etc
n-1<28363> ssi:boot:base: looking for boot schema
file:
n-1<28363> ssi:boot:base:   hostfile
n-1<28363> ssi:boot:base: found boot schema: hostfile
n-1<28363> ssi:boot:rsh: found the following hosts:
n-1<28363> ssi:boot:rsh:   n0 biscaye (cpu=1)
n-1<28363> ssi:boot:rsh: resolved hosts:
n-1<28363> ssi:boot:rsh:   n0 biscaye -->
125.1.2.17 (origin)
n-1<28363> ssi:boot:rsh: starting RTE procs
n-1<28363> ssi:boot:base:linear: starting
n-1<28363> ssi:boot:base:server: opening server TCP
socket
n-1<28363> ssi:boot:base:server: opened port 62915
n-1<28363> ssi:boot:base:linear: booting n0 (biscaye)
n-1<28363> ssi:boot:rsh: starting lamd on (biscaye)
n-1<28363> ssi:boot:rsh: starting on n0 (biscaye):
hboot -t -c 
lam-conf.lamd -d -v -I -H 125.1.2.17 -P 62915 -n 0 -o 0
n-1<28363> ssi:boot:rsh: launching locally
hboot: performing tkill
hboot: tkill -d
tkill: setting prefix to (null)
tkill: setting suffix to (null)
tkill: got killname back: /tmp/lam-truongbiscaye/lam-killfile
tkill: f_kill = "/tmp/lam-truongbiscaye/lam-killfile"
tkill: killing LAM...
tkill: killing PID (SIGHUP) 28275 ...
tkill:  already dead
tkill: removing socket file ...
tkill: socket file: /tmp/lam-truongbiscaye/lam-kernel-socketd
tkill: removing IO daemon socket file ...
tkill: IO daemon socket file: /tmp/lam-truongbiscaye/lam-io-socket
tkill: all finished
hboot: booting...
hboot: fork
/homi/truong/Local/lam-mpi_itanium-inst/bin/lamd
[1]  28366 lamd -H 125.1.2.17 -P 62915 -n 0 -o 0 -d
n-1<28363> ssi:boot:rsh: successfully launched on n0
(biscaye)
n-1<28363> ssi:boot:base:server: expecting connection
from finite list
hboot: attempting to execute
------------------------------------------------------------
-----------------
The lamboot agent timed out while waiting for the
newly-booted process
to call back and indicated that it had successfully booted.

*** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS SUGGESTIONS,
AND
*** CONSULT THE "BOOTING LAM" SECTION OF THE
LAM/MPI FAQ
*** (http://www.lam-mpi.org/fa
q/) BEFORE POSTING TO THE LAM/MPI USER'S
*** MAILING LIST.

As far as LAM could tell, the remote process started
properly, but
then never called back.  Possible reasons that this may
happen:

        - There are network filters between the lamboot
agent host and
          the remote host such that communication on random
TCP ports
          is blocked
        - Network routing from the remote host to the local
host isn't
          properly configured (this is uncommon)

You can check these things by watching the output from
"lamboot -d".

1. On the command line for hboot, there are two important
parameters:
   one is the IP address of where the lamboot agent was
invoked, the
   other is the port number that the lamboot agent is
expecting the
   newly-booted process to call back on (this will be a
random
   integer).

2. Manually login to the remote machine and try to telnet to
the port
   indicated on the hboot command line.  For example,
       telnet <ipnumber> <portnumber>
   If all goes well, you should get a "Connection
refused" error.  If
   you get any other kind of error, it could indicate either
of the
   two conditions above.  Consult with your system/network
   administrator.
------------------------------------------------------------
-----------------
n-1<28363> ssi:boot:base:server: failed to connect to
remote lamd!
n-1<28363> ssi:boot:base:server: closing server
socket
n-1<28363> ssi:boot:base:linear: aborted!
lamboot did NOT complete successfully

============================================================
======================

    I use the compilation flags on the Itanium station 
equivalent to 
those on the PA-Risc station. It
seems that the command hboot doesn't work. About the
suggestion of 
making telnet: if I open a socket on
another station, I can telnet from the Itanium station using
this socket 
port.

    Would you have any suggestion for testing further?

     Best regards,

  V.Khiem Truong
  Onera - France
    




 >On Mar 29, 2007, at 4:01 AM, Van-Khiem Truong wrote:
 >
 >> I would like to get help for the
"lamboot" procedure. I have
 >> installed the code LAM-MPI on two machines HP-UX,
the first one is a
 >> PA-Risc 2.0, the second
 >> one is a multiprocessor HP Itanium.
 >>
 >> The installation seems to be fine, except for the
module ptmalloc2
 >> (/share/memory/ptmalloc2) where I need to change
the "Makefile " to
 >> remove the file
 >> malloc.c, otherwise the code tells me that
variables are already
 > declared.
 >
 >You should configure with --without-memory-manager and
then you won't
 >have this problem.
 >
 >> So on the machine HP PA-Risc , I can start the
procedure
 >> "lamboot"
 >> and connect to another PA-Risc HP. However for the
machine HP
 >> multiprocessor, it tells me that it boots but the
call back doesn't
 >> work. I attach hereby the file containing the
error message.

 >See below.
 >
 >> [snip]
 >> n-1<1698> ssi:boot:base: looking for boot
schema file:
 >> n-1<1698> ssi:boot:base: hostfile
 >> n-1<1698> ssi:boot:base: found boot schema:
hostfile
 >> n-1<1698> ssi:boot:rsh: found the following
hosts:
 >> n-1<1698> ssi:boot:rsh: n0 nanopus (cpu=1)
n-1<1698> ssi:boot:rsh:
 >> n1 hudson (cpu=1) n-1<1698> ssi:boot:rsh:
resolved hosts:
 >> n-1<1698> ssi:boot:rsh: n0 nanopus -->
125.1.5.218 (origin)
 >> n-1<1698> ssi:boot:rsh: n1 hudson -->
125.1.7.17
 >> [snip]
 >> n-1<1698> ssi:boot:rsh: starting on n0
(nanopus): hboot -t -c
 >> lam-conf.lamd -d -v -I -H 125.1.5.218 -P 49939 -n
0 -o 0
 >> n-1<1698> ssi:boot:rsh: launching locally
 >> [snip]
 >> hboot: attempting to execute [1] 1701 lamd -H
125.1.5.218 -P
 >> 49939 -n
 >> 0 -o 0 -d
 >> n-1<1698> ssi:boot:rsh: successfully
launched on n0 (nanopus)
 >> n-1<1698> ssi:boot:base:server: expecting
connection from finite list
 >>
------------------------------------------------------------
----------
 >> -------
 >> The lamboot agent timed out while waiting for the
newly-booted process
 >> to call back and indicated that it had
successfully booted.
 >> [snip]
 >
 >What is truly odd here is that the lamd that lamboot is
waiting for
 >is the *local* lamd.
 >
 >Did you check that you have no TCP filtering / firewall
software
 >running?
 >
 >--
 >Jeff Squyres
 >Cisco Systems

_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/

Re: LAM: Please help: lamboot call back problem
user name
2007-04-03 07:20:43
On Apr 2, 2007, at 9:35 AM, Van-Khiem Truong wrote:

>     Hello Jeff Squyres,
>
>     Thank you for your quick response. That is really
odd! I spend  
> some
> time to check about the trouble.
>
>  (1) You are right about the configuration without
"memory-manager";
>
>  (2) There is no firewall software running;
>
>  (3)  Instead of using the multiprocessor machine, I
installed the
> Lam-MPI on a single processor machine
> with the same processor Itanium. Then  I make the
lamboot call with  
> only
> the Itanium station alone (with
> two work stations, it results into the same error):
>  it results into the same error  message as before, as
you can see on
> the following file:

It's actually not hboot that is failing, but the lamd (hboot
is  
mainly a wrapper around fork/exec'ing the lamd).  The lamd
is trying  
to open a socket back to 125.1.2.17 port 62915 (which
*should* be the  
same as the local host).

Do you, perchance, have multiple IP addresses on this
machine?  I'm  
wondering if LAM is using the "wrong" IP address
such that it can't  
open a socket back to 125.1.2.17 properly.

>
>
============================================================
====  
> output
> on the screen:
> biscaye 173 :  lamboot -v -d -ssi boot rsh hostfile
> n-1<28363> ssi:boot:open: opening
> n-1<28363> ssi:boot:open: looking for boot module
named rsh
> n-1<28363> ssi:boot:open: opening boot module
rsh
> n-1<28363> ssi:boot:open: opened boot module rsh
> n-1<28363> ssi:boot:select: initializing boot
module rsh
> n-1<28363> ssi:boot:rsh: module initializing
> n-1<28363> ssi:boot:rsh:agent: /usr/bin/remsh
> n-1<28363> ssi:boot:rsh:username: <same>
> n-1<28363> ssi:boot:rsh:verbose: 1000
> n-1<28363> ssi:boot:rsh:algorithm: linear
> n-1<28363> ssi:boot:rsh:no_n: 0
> n-1<28363> ssi:boot:rsh:no_profile: 0
> n-1<28363> ssi:boot:rsh:fast: 0
> n-1<28363> ssi:boot:rsh:ignore_stderr: 0
> n-1<28363> ssi:boot:rsh:priority: 10
> n-1<28363> ssi:boot:select: boot module
available: rsh, priority: 10
> n-1<28363> ssi:boot:select: selected boot module
rsh
>
> LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University
>
> n-1<28363> ssi:boot:base: looking for boot schema
in following  
> directories:
> n-1<28363> ssi:boot:base:   <current
directory>
> n-1<28363> ssi:boot:base:   $TROLLIUSHOME/etc
> n-1<28363> ssi:boot:base:   $LAMHOME/etc
> n-1<28363> ssi:boot:base:  
/homi/truong/Local/lam-mpi_itanium-inst/ 
> etc
> n-1<28363> ssi:boot:base: looking for boot schema
file:
> n-1<28363> ssi:boot:base:   hostfile
> n-1<28363> ssi:boot:base: found boot schema:
hostfile
> n-1<28363> ssi:boot:rsh: found the following
hosts:
> n-1<28363> ssi:boot:rsh:   n0 biscaye (cpu=1)
> n-1<28363> ssi:boot:rsh: resolved hosts:
> n-1<28363> ssi:boot:rsh:   n0 biscaye -->
125.1.2.17 (origin)
> n-1<28363> ssi:boot:rsh: starting RTE procs
> n-1<28363> ssi:boot:base:linear: starting
> n-1<28363> ssi:boot:base:server: opening server
TCP socket
> n-1<28363> ssi:boot:base:server: opened port
62915
> n-1<28363> ssi:boot:base:linear: booting n0
(biscaye)
> n-1<28363> ssi:boot:rsh: starting lamd on
(biscaye)
> n-1<28363> ssi:boot:rsh: starting on n0
(biscaye): hboot -t -c
> lam-conf.lamd -d -v -I -H 125.1.2.17 -P 62915 -n 0 -o
0
> n-1<28363> ssi:boot:rsh: launching locally
> hboot: performing tkill
> hboot: tkill -d
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-truongbiscaye/lam-killfile
> tkill: f_kill = "/tmp/lam-truongbiscaye/lam-killfile"
> tkill: killing LAM...
> tkill: killing PID (SIGHUP) 28275 ...
> tkill:  already dead
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-truongbiscaye/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-truongbiscaye/lam-io-socket
> tkill: all finished
> hboot: booting...
> hboot: fork
/homi/truong/Local/lam-mpi_itanium-inst/bin/lamd
> [1]  28366 lamd -H 125.1.2.17 -P 62915 -n 0 -o 0 -d
> n-1<28363> ssi:boot:rsh: successfully launched on
n0 (biscaye)
> n-1<28363> ssi:boot:base:server: expecting
connection from finite list
> hboot: attempting to execute
>
------------------------------------------------------------
---------- 
> -------
> The lamboot agent timed out while waiting for the
newly-booted process
> to call back and indicated that it had successfully
booted.
>
> *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS
SUGGESTIONS, AND
> *** CONSULT THE "BOOTING LAM" SECTION OF THE
LAM/MPI FAQ
> *** (http://www.lam-mpi.org/fa
q/) BEFORE POSTING TO THE LAM/MPI USER'S
> *** MAILING LIST.
>
> As far as LAM could tell, the remote process started
properly, but
> then never called back.  Possible reasons that this may
happen:
>
>         - There are network filters between the lamboot
agent host and
>           the remote host such that communication on
random TCP ports
>           is blocked
>         - Network routing from the remote host to the
local host isn't
>           properly configured (this is uncommon)
>
> You can check these things by watching the output from
"lamboot -d".
>
> 1. On the command line for hboot, there are two
important parameters:
>    one is the IP address of where the lamboot agent was
invoked, the
>    other is the port number that the lamboot agent is
expecting the
>    newly-booted process to call back on (this will be a
random
>    integer).
>
> 2. Manually login to the remote machine and try to
telnet to the port
>    indicated on the hboot command line.  For example,
>        telnet <ipnumber> <portnumber>
>    If all goes well, you should get a "Connection
refused" error.  If
>    you get any other kind of error, it could indicate
either of the
>    two conditions above.  Consult with your
system/network
>    administrator.
>
------------------------------------------------------------
---------- 
> -------
> n-1<28363> ssi:boot:base:server: failed to
connect to remote lamd!
> n-1<28363> ssi:boot:base:server: closing server
socket
> n-1<28363> ssi:boot:base:linear: aborted!
> lamboot did NOT complete successfully
>
>
============================================================
========== 
> ============
>
>     I use the compilation flags on the Itanium station 
equivalent to
> those on the PA-Risc station. It
> seems that the command hboot doesn't work. About the
suggestion of
> making telnet: if I open a socket on
> another station, I can telnet from the Itanium station
using this  
> socket
> port.
>
>     Would you have any suggestion for testing further?
>
>      Best regards,
>
>   V.Khiem Truong
>   Onera - France
>
>
>
>
>
>> On Mar 29, 2007, at 4:01 AM, Van-Khiem Truong
wrote:
>>
>>> I would like to get help for the
"lamboot" procedure. I have
>>> installed the code LAM-MPI on two machines
HP-UX, the first one is a
>>> PA-Risc 2.0, the second
>>> one is a multiprocessor HP Itanium.
>>>
>>> The installation seems to be fine, except for
the module ptmalloc2
>>> (/share/memory/ptmalloc2) where I need to
change the "Makefile " to
>>> remove the file
>>> malloc.c, otherwise the code tells me that
variables are already
>> declared.
>>
>> You should configure with --without-memory-manager
and then you won't
>> have this problem.
>>
>>> So on the machine HP PA-Risc , I can start the
procedure
>>> "lamboot"
>>> and connect to another PA-Risc HP. However for
the machine HP
>>> multiprocessor, it tells me that it boots but
the call back doesn't
>>> work. I attach hereby the file containing the
error message.
>
>> See below.
>>
>>> [snip]
>>> n-1<1698> ssi:boot:base: looking for boot
schema file:
>>> n-1<1698> ssi:boot:base: hostfile
>>> n-1<1698> ssi:boot:base: found boot
schema: hostfile
>>> n-1<1698> ssi:boot:rsh: found the
following hosts:
>>> n-1<1698> ssi:boot:rsh: n0 nanopus
(cpu=1) n-1<1698> ssi:boot:rsh:
>>> n1 hudson (cpu=1) n-1<1698> ssi:boot:rsh:
resolved hosts:
>>> n-1<1698> ssi:boot:rsh: n0 nanopus -->
125.1.5.218 (origin)
>>> n-1<1698> ssi:boot:rsh: n1 hudson -->
125.1.7.17
>>> [snip]
>>> n-1<1698> ssi:boot:rsh: starting on n0
(nanopus): hboot -t -c
>>> lam-conf.lamd -d -v -I -H 125.1.5.218 -P 49939
-n 0 -o 0
>>> n-1<1698> ssi:boot:rsh: launching
locally
>>> [snip]
>>> hboot: attempting to execute [1] 1701 lamd -H
125.1.5.218 -P
>>> 49939 -n
>>> 0 -o 0 -d
>>> n-1<1698> ssi:boot:rsh: successfully
launched on n0 (nanopus)
>>> n-1<1698> ssi:boot:base:server: expecting
connection from finite  
>>> list
>>>
------------------------------------------------------------
-------- 
>>> --
>>> -------
>>> The lamboot agent timed out while waiting for
the newly-booted  
>>> process
>>> to call back and indicated that it had
successfully booted.
>>> [snip]
>>
>> What is truly odd here is that the lamd that
lamboot is waiting for
>> is the *local* lamd.
>>
>> Did you check that you have no TCP filtering /
firewall software
>> running?
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>
> _______________________________________________
> This list is archived at http://www.l
am-mpi.org/MailArchives/lam/


-- 
Jeff Squyres
Cisco Systems

_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )