List Info

Thread: LAM: mpirun get stuck (lamboot works fine)




LAM: mpirun get stuck (lamboot works fine)
user name
2006-07-30 04:40:57
I am having a strange problem to which I could not find answer on the list or the web.

I just replaced a compute node in a cluster with a new machine. The cluster is behind a head node (that does not compute). Jobs are run by logging into one of the compute nodes, then changing directories to the executable directory on the head node, following by lambooting a machinefile and then mpirun.

The problem is that while I can do mpirun from the machine in question, any boot schema that contains other nodes hangs. I must mention that a boot schema with any combination of the other machines works fine. It is only the new node (node22) that gives problems. Here is what I get when i run code with boot schema containing just node 22.

node22.local.net 27: lamboot -v -d mac

LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University

lamboot: boot schema file: mac
lamboot: opening hostfile mac
lamboot: found the following hosts:
lamboot: ;  n0 node22
lamboot: resolved hosts:
lamboot:   n0 node22 --> 192.168.0.22
lamboot: found 1 host node(s)
lamboot: origin node is 0 (node22)
Executing hboot on n0 (node22 - 2 CPUs)...
lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -H 192.168.0.22 -P 33629 -n 0 -o 0     ""
hboot: process schema = "/etc/lam/lam-conf.lam "
hboot: found /usr/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /usr/bin/lamd
hboot: attempting to execute
[1]  10912 lamd -H 192.168.0.22 -P 33629 -n 0 -o 0 -d
topology done   ;  
lamboot completed successfully

and when i do mpirun i get:

node22.local.net 28: mpirun -np 4 a.out
 hello world from processor 3
 hello world from processor 0
 hello world from processor 1
 hello world from processor 2

However, a bootschema with node08, node19 and node22 followed by mpirun does the following (node08 and node19 are an example here; other nodes are fine too.. it is just node22 that causes problems).

node08.local.net 31: lamboot -v -d machinefile

LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University

lamboot: boot schema file: machinefile
lamboot: opening hostfile machinefile
lamboot: found the following hosts:
lamboot:   n0 node08
lamboot:   n1 node19
lamboot:   n2 node22
lamboot: resolved hosts:
lamboot:   n0 node08 --> 192.168.0.8
lamboot:  ; n1 node19 --> 192.168.0.19
lamboot:   n2 node22 --> 192.168.0.22
lamboot: found 3 host node(s)
lamboot: origin node is 0 (node08)
Executing hboot on n0 (node08 - 1 CPU)...
lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -H 192.168.0.8 -P 33101 -n 0 -o 0     ""
hboot: process schema = "/etc/lam/lam-conf.lam";
hboot: found /usr/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /usr/bin/lamd
hboot: attempting to execute
[1]  18080 lamd -H 192.168.0.8 -P 33101 -n 0 -o 0 -d
Executing hboot on n1 (node19 - 1 CPU)...
lamboot: attempting to execute "rsh node19 -n echo $SHELL"
lamboot: got remote shell /bin/tcsh
lamboot: attempting to execute "rsh node19 -n hboot -t -c lam-conf.lam -d -v -s -I "-H 192.168.0.8 -P 33101 -n 1 -o 0    ""
hboot: process schema = "/etc/lam/lam-conf.lam";
hboot: found /usr/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /usr/bin/lamd
[1]  15483 lamd -H 192.168.0.8 -P 33101 -n 1 -o 0 -d
Executing hboot on n2 (node22 - 1 CPU)...
lamboot: attempting to execute "rsh node22 -n echo $SHELL"
lamboot: got remote shell /bin/tcsh
lamboot: attempting to execute "rsh node22 -n hboot -t -c lam-conf.lam -d -v -s -I "-H 192.168.0.8 -P 33101 -n 2 -o 0    ""
hboot: process schema = "/etc/lam/lam-conf.lam";
hboot: found /usr/bin/lamd
hboot: performing tkill
hboot: tkill
hboot: booting...
hboot: fork /usr/bin/lamd
[1]  10965 lamd -H 192.168.0.8 -P 33101 -n 2 -o 0 -d
topology done   ;  
lamboot completed successfully
node08.local.net 33: mpirun -v -np 2 a.out
18088 a.out running on n0 (o)
15485 a.out running on n1
 hello world from processor 0
 hello world from processor 1
node08.local.net 34: mpirun -v -np 3 a.out
18090 a.out running on n0 (o)
15486 a.out running on n1

Suspended
node08.local.net 35:

I had to do Ctrl+Z to abort. I can rsh back and forth and also do tping before the run. Any ideas what's going wrong?
[1]

about | contact  Other archives ( Real Estate discussion Medical topics )