|
List Info
Thread: LAM: mpirun get stuck (lamboot works fine)
|
|
| LAM: mpirun get stuck (lamboot works
fine) |

|
2006-07-30 04:40:57 |
|
I am having a strange problem to which I could not find answer on the list or the web.
I just replaced a compute node in a cluster with a new machine. The cluster is behind a head node (that does not compute). Jobs are run by logging into one of the compute nodes, then changing directories to the executable directory on the head node, following by lambooting a machinefile and then mpirun.
The problem is that while I can do mpirun from the machine in question, any boot schema that contains other nodes hangs. I must mention that a boot schema with any combination of the other machines works fine. It is only the new node (node22) that gives problems. Here is what I get when i run code with boot schema containing just node 22.
node22.local.net 27: lamboot -v -d mac
LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
lamboot: boot schema file: mac lamboot: opening hostfile mac lamboot: found the following hosts:
lamboot: n0 node22 lamboot: resolved hosts: lamboot: n0 node22 --> 192.168.0.22 lamboot: found 1 host node(s) lamboot: origin node is 0 (node22) Executing hboot on n0 (node22 - 2 CPUs)...
lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -H 192.168.0.22 -P 33629 -n 0 -o 0 "" hboot: process schema = "/etc/lam/lam-conf.lam
" hboot: found /usr/bin/lamd hboot: performing tkill hboot: tkill hboot: booting... hboot: fork /usr/bin/lamd hboot: attempting to execute [1] 10912 lamd -H 192.168.0.22
-P 33629 -n 0 -o 0 -d topology done lamboot completed successfully
and when i do mpirun i get:
node22.local.net 28: mpirun -np 4 a.out hello world from processor 3
hello world from processor 0 hello world from processor 1 hello world from processor 2
However, a bootschema with node08, node19 and node22 followed by mpirun does the following (node08 and node19 are an example here; other nodes are fine too.. it is just node22 that causes problems).
node08.local.net 31: lamboot -v -d machinefile
LAM 6.5.9/MPI 2 C++/ROMIO - Indiana University
lamboot: boot schema file: machinefile lamboot: opening hostfile machinefile
lamboot: found the following hosts: lamboot: n0 node08 lamboot: n1 node19 lamboot: n2 node22 lamboot: resolved hosts: lamboot: n0 node08 --> 192.168.0.8 lamboot: n1 node19 -->
192.168.0.19 lamboot: n2 node22 --> 192.168.0.22 lamboot: found 3 host node(s) lamboot: origin node is 0 (node08) Executing hboot on n0 (node08 - 1 CPU)...
lamboot: attempting to execute "hboot -t -c lam-conf.lam -d -v -I " -H 192.168.0.8 -P 33101 -n 0 -o 0 "" hboot: process schema = "/etc/lam/lam-conf.lam"
hboot: found /usr/bin/lamd hboot: performing tkill hboot: tkill hboot: booting... hboot: fork /usr/bin/lamd hboot: attempting to execute [1] 18080 lamd -H 192.168.0.8
-P 33101 -n 0 -o 0 -d Executing hboot on n1 (node19 - 1 CPU)... lamboot: attempting to execute "rsh node19 -n echo $SHELL" lamboot: got remote shell /bin/tcsh lamboot: attempting to execute "rsh node19 -n hboot -t -c
lam-conf.lam -d -v -s -I "-H 192.168.0.8 -P 33101 -n 1 -o 0 "" hboot: process schema = "/etc/lam/lam-conf.lam" hboot: found /usr/bin/lamd hboot: performing tkill
hboot: tkill hboot: booting... hboot: fork /usr/bin/lamd [1] 15483 lamd -H 192.168.0.8 -P 33101 -n 1 -o 0 -d Executing hboot on n2 (node22 - 1 CPU)... lamboot: attempting to execute "rsh node22 -n echo $SHELL"
lamboot: got remote shell /bin/tcsh lamboot: attempting to execute "rsh node22 -n hboot -t -c lam-conf.lam -d -v -s -I "-H 192.168.0.8 -P 33101 -n 2 -o 0 ""
hboot: process schema = "/etc/lam/lam-conf.lam" hboot: found /usr/bin/lamd hboot: performing tkill hboot: tkill hboot: booting... hboot: fork /usr/bin/lamd [1] 10965 lamd -H
192.168.0.8 -P 33101 -n 2 -o 0 -d topology done lamboot completed successfully node08.local.net 33: mpirun -v -np 2 a.out 18088 a.out running on n0 (o) 15485
a.out running on n1 hello world from processor 0 hello world from processor 1 node08.local.net 34: mpirun -v -np 3 a.out 18090 a.out running on n0 (o) 15486 a.out running on n1
Suspended node08.local.net 35:
I had to do Ctrl+Z to abort. I can rsh back and forth and also do tping before the run. Any ideas what's going wrong?
|
[1]
|
|
|
about | contact Other archives ( Real Estate discussion Medical topics )
|