On Apr 2, 2007, at 9:35 AM, Van-Khiem Truong wrote:
> Hello Jeff Squyres,
>
> Thank you for your quick response. That is really
odd! I spend
> some
> time to check about the trouble.
>
> (1) You are right about the configuration without
"memory-manager";
>
> (2) There is no firewall software running;
>
> (3) Instead of using the multiprocessor machine, I
installed the
> Lam-MPI on a single processor machine
> with the same processor Itanium. Then I make the
lamboot call with
> only
> the Itanium station alone (with
> two work stations, it results into the same error):
> it results into the same error message as before, as
you can see on
> the following file:
It's actually not hboot that is failing, but the lamd (hboot
is
mainly a wrapper around fork/exec'ing the lamd). The lamd
is trying
to open a socket back to 125.1.2.17 port 62915 (which
*should* be the
same as the local host).
Do you, perchance, have multiple IP addresses on this
machine? I'm
wondering if LAM is using the "wrong" IP address
such that it can't
open a socket back to 125.1.2.17 properly.
>
>
============================================================
====
> output
> on the screen:
> biscaye 173 : lamboot -v -d -ssi boot rsh hostfile
> n-1<28363> ssi:boot:open: opening
> n-1<28363> ssi:boot:open: looking for boot module
named rsh
> n-1<28363> ssi:boot:open: opening boot module
rsh
> n-1<28363> ssi:boot:open: opened boot module rsh
> n-1<28363> ssi:boot:select: initializing boot
module rsh
> n-1<28363> ssi:boot:rsh: module initializing
> n-1<28363> ssi:boot:rsh:agent: /usr/bin/remsh
> n-1<28363> ssi:boot:rsh:username: <same>
> n-1<28363> ssi:boot:rsh:verbose: 1000
> n-1<28363> ssi:boot:rsh:algorithm: linear
> n-1<28363> ssi:boot:rsh:no_n: 0
> n-1<28363> ssi:boot:rsh:no_profile: 0
> n-1<28363> ssi:boot:rsh:fast: 0
> n-1<28363> ssi:boot:rsh:ignore_stderr: 0
> n-1<28363> ssi:boot:rsh:priority: 10
> n-1<28363> ssi:boot:select: boot module
available: rsh, priority: 10
> n-1<28363> ssi:boot:select: selected boot module
rsh
>
> LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University
>
> n-1<28363> ssi:boot:base: looking for boot schema
in following
> directories:
> n-1<28363> ssi:boot:base: <current
directory>
> n-1<28363> ssi:boot:base: $TROLLIUSHOME/etc
> n-1<28363> ssi:boot:base: $LAMHOME/etc
> n-1<28363> ssi:boot:base:
/homi/truong/Local/lam-mpi_itanium-inst/
> etc
> n-1<28363> ssi:boot:base: looking for boot schema
file:
> n-1<28363> ssi:boot:base: hostfile
> n-1<28363> ssi:boot:base: found boot schema:
hostfile
> n-1<28363> ssi:boot:rsh: found the following
hosts:
> n-1<28363> ssi:boot:rsh: n0 biscaye (cpu=1)
> n-1<28363> ssi:boot:rsh: resolved hosts:
> n-1<28363> ssi:boot:rsh: n0 biscaye -->
125.1.2.17 (origin)
> n-1<28363> ssi:boot:rsh: starting RTE procs
> n-1<28363> ssi:boot:base:linear: starting
> n-1<28363> ssi:boot:base:server: opening server
TCP socket
> n-1<28363> ssi:boot:base:server: opened port
62915
> n-1<28363> ssi:boot:base:linear: booting n0
(biscaye)
> n-1<28363> ssi:boot:rsh: starting lamd on
(biscaye)
> n-1<28363> ssi:boot:rsh: starting on n0
(biscaye): hboot -t -c
> lam-conf.lamd -d -v -I -H 125.1.2.17 -P 62915 -n 0 -o
0
> n-1<28363> ssi:boot:rsh: launching locally
> hboot: performing tkill
> hboot: tkill -d
> tkill: setting prefix to (null)
> tkill: setting suffix to (null)
> tkill: got killname back: /tmp/lam-truong biscaye/lam-killfile
> tkill: f_kill = "/tmp/lam-truong biscaye/lam-killfile"
> tkill: killing LAM...
> tkill: killing PID (SIGHUP) 28275 ...
> tkill: already dead
> tkill: removing socket file ...
> tkill: socket file: /tmp/lam-truong biscaye/lam-kernel-socketd
> tkill: removing IO daemon socket file ...
> tkill: IO daemon socket file: /tmp/lam-truong biscaye/lam-io-socket
> tkill: all finished
> hboot: booting...
> hboot: fork
/homi/truong/Local/lam-mpi_itanium-inst/bin/lamd
> [1] 28366 lamd -H 125.1.2.17 -P 62915 -n 0 -o 0 -d
> n-1<28363> ssi:boot:rsh: successfully launched on
n0 (biscaye)
> n-1<28363> ssi:boot:base:server: expecting
connection from finite list
> hboot: attempting to execute
>
------------------------------------------------------------
----------
> -------
> The lamboot agent timed out while waiting for the
newly-booted process
> to call back and indicated that it had successfully
booted.
>
> *** PLEASE READ THIS ENTIRE MESSAGE, FOLLOW ITS
SUGGESTIONS, AND
> *** CONSULT THE "BOOTING LAM" SECTION OF THE
LAM/MPI FAQ
> *** (http://www.lam-mpi.org/fa
q/) BEFORE POSTING TO THE LAM/MPI USER'S
> *** MAILING LIST.
>
> As far as LAM could tell, the remote process started
properly, but
> then never called back. Possible reasons that this may
happen:
>
> - There are network filters between the lamboot
agent host and
> the remote host such that communication on
random TCP ports
> is blocked
> - Network routing from the remote host to the
local host isn't
> properly configured (this is uncommon)
>
> You can check these things by watching the output from
"lamboot -d".
>
> 1. On the command line for hboot, there are two
important parameters:
> one is the IP address of where the lamboot agent was
invoked, the
> other is the port number that the lamboot agent is
expecting the
> newly-booted process to call back on (this will be a
random
> integer).
>
> 2. Manually login to the remote machine and try to
telnet to the port
> indicated on the hboot command line. For example,
> telnet <ipnumber> <portnumber>
> If all goes well, you should get a "Connection
refused" error. If
> you get any other kind of error, it could indicate
either of the
> two conditions above. Consult with your
system/network
> administrator.
>
------------------------------------------------------------
----------
> -------
> n-1<28363> ssi:boot:base:server: failed to
connect to remote lamd!
> n-1<28363> ssi:boot:base:server: closing server
socket
> n-1<28363> ssi:boot:base:linear: aborted!
> lamboot did NOT complete successfully
>
>
============================================================
==========
> ============
>
> I use the compilation flags on the Itanium station
equivalent to
> those on the PA-Risc station. It
> seems that the command hboot doesn't work. About the
suggestion of
> making telnet: if I open a socket on
> another station, I can telnet from the Itanium station
using this
> socket
> port.
>
> Would you have any suggestion for testing further?
>
> Best regards,
>
> V.Khiem Truong
> Onera - France
>
>
>
>
>
>> On Mar 29, 2007, at 4:01 AM, Van-Khiem Truong
wrote:
>>
>>> I would like to get help for the
"lamboot" procedure. I have
>>> installed the code LAM-MPI on two machines
HP-UX, the first one is a
>>> PA-Risc 2.0, the second
>>> one is a multiprocessor HP Itanium.
>>>
>>> The installation seems to be fine, except for
the module ptmalloc2
>>> (/share/memory/ptmalloc2) where I need to
change the "Makefile " to
>>> remove the file
>>> malloc.c, otherwise the code tells me that
variables are already
>> declared.
>>
>> You should configure with --without-memory-manager
and then you won't
>> have this problem.
>>
>>> So on the machine HP PA-Risc , I can start the
procedure
>>> "lamboot"
>>> and connect to another PA-Risc HP. However for
the machine HP
>>> multiprocessor, it tells me that it boots but
the call back doesn't
>>> work. I attach hereby the file containing the
error message.
>
>> See below.
>>
>>> [snip]
>>> n-1<1698> ssi:boot:base: looking for boot
schema file:
>>> n-1<1698> ssi:boot:base: hostfile
>>> n-1<1698> ssi:boot:base: found boot
schema: hostfile
>>> n-1<1698> ssi:boot:rsh: found the
following hosts:
>>> n-1<1698> ssi:boot:rsh: n0 nanopus
(cpu=1) n-1<1698> ssi:boot:rsh:
>>> n1 hudson (cpu=1) n-1<1698> ssi:boot:rsh:
resolved hosts:
>>> n-1<1698> ssi:boot:rsh: n0 nanopus -->
125.1.5.218 (origin)
>>> n-1<1698> ssi:boot:rsh: n1 hudson -->
125.1.7.17
>>> [snip]
>>> n-1<1698> ssi:boot:rsh: starting on n0
(nanopus): hboot -t -c
>>> lam-conf.lamd -d -v -I -H 125.1.5.218 -P 49939
-n 0 -o 0
>>> n-1<1698> ssi:boot:rsh: launching
locally
>>> [snip]
>>> hboot: attempting to execute [1] 1701 lamd -H
125.1.5.218 -P
>>> 49939 -n
>>> 0 -o 0 -d
>>> n-1<1698> ssi:boot:rsh: successfully
launched on n0 (nanopus)
>>> n-1<1698> ssi:boot:base:server: expecting
connection from finite
>>> list
>>>
------------------------------------------------------------
--------
>>> --
>>> -------
>>> The lamboot agent timed out while waiting for
the newly-booted
>>> process
>>> to call back and indicated that it had
successfully booted.
>>> [snip]
>>
>> What is truly odd here is that the lamd that
lamboot is waiting for
>> is the *local* lamd.
>>
>> Did you check that you have no TCP filtering /
firewall software
>> running?
>>
>> --
>> Jeff Squyres
>> Cisco Systems
>
> _______________________________________________
> This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
--
Jeff Squyres
Cisco Systems
_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
|