|
List Info
Thread: Re: LAM: lamboot is ok, mpirun is not
|
|
| Re: LAM: lamboot is ok, mpirun is not |

|
2007-05-23 22:32:29 |
|
Try some simple tests:
- Does "tping -c 3" run successfully? (It should ping all the lamd's)
[ter  uftoscar test]$ tping -c 3 n0-13 &nbs p; 1 byte from 13 remote nodes and 1 local node: 0.006 secs
1 byte from 13 remote nodes and 1 local node: 0.005 secs 1 byte from 13 remote nodes and 1 local node: 0.005 secs
3 messages, 3 bytes (0.003K), 0.016 secs (0.368K/sec) roundtrip min/avg/max: 0.005/0.005/0.006
- Does "lamexec N hostname" run successfully? (It should run
"hostname" on all the booted nodes)
No, it doesn't work. It only show headnode39;s hostname. See below:
[ter uftoscar ~]$ lamexec N hostname uftoscar.latech <freeze>
I, however, can execute "cexec hostname" with no problem.
- When you "mpirun -np 15 ring.out", do you see ring.out executing on
all the nodes? (i.e., if you ssh into each of the nodes and run ps,
do you see it running?
I only see one ring.out run on headnode, no ring.out running on other nodes.
Thanks Kulathep
|
| Re: LAM: lamboot is ok, mpirun is not |
  United States |
2007-05-24 09:04:10 |
That is just weird -- I don't think I've seen a case where
tping
worked (implying that inter-lamd communication is working),
but
running applications did not.
The only thing that I can think of is that there is some
firewalling
in place that only allows arbitrary UDP traffic through...?
(inter-
lamd traffic is UDP, not TCP) That doesn't seem to make
sense,
though, if MPICH works (cexec uses ssh, which is most
certainly
allowed). But can you triple check that there are no
firewalls tcp
rules in place that restrict UDP/TCP traffic? (e.g.,
iptables)
Also try running tping / mpirun / lamexec from a node other
than the
origin (i.e., the node you lambooted from).
On May 23, 2007, at 11:32 PM, K. Charoenpornwattana Ter
wrote:
> Try some simple tests:
>
> - Does "tping -c 3" run successfully? (It
should ping all the lamd's)
>
> [ter uftoscar test]$ tping -c 3 n0-13
> 1 byte from 13 remote nodes and 1 local node: 0.006
secs
> 1 byte from 13 remote nodes and 1 local node: 0.005
secs
> 1 byte from 13 remote nodes and 1 local node: 0.005
secs
>
> 3 messages, 3 bytes (0.003K), 0.016 secs (0.368K/sec)
> roundtrip min/avg/max: 0.005/0.005/0.006
>
>
> - Does "lamexec N hostname" run successfully?
(It should run
> "hostname" on all the booted nodes)
>
> No, it doesn't work. It only show headnode's hostname.
See below:
>
> [ter uftoscar ~]$ lamexec N hostname
> uftoscar.latech
> <freeze>
>
> I, however, can execute "cexec hostname" with
no problem.
>
> - When you "mpirun -np 15 ring.out", do you
see ring.out executing on
> all the nodes? (i.e., if you ssh into each of the nodes
and run ps,
> do you see it running?
>
> I only see one ring.out run on headnode, no ring.out
running on
> other nodes.
>
>
> Thanks
> Kulathep
> _______________________________________________
> This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
--
Jeff Squyres
Cisco Systems
_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
|
|
| Re: LAM: lamboot is ok, mpirun is not |

|
2007-05-24 14:47:26 |
|
On 5/24/07, Jeff Squyres < jsquyres cisco.com">jsquyres cisco.com> wrote:
That is just weird -- I don't think I've seen a case where tping worked (implying that inter-lamd communication is working), but running applications did not. The only thing that I can think of is that there is some firewalling
in place that only allows arbitrary UDP traffic through...? (inter- lamd traffic is UDP, not TCP) That doesn't seem to make sense, though, if MPICH works (cexec uses ssh, which is most certainly allowed). But can you triple check that there are no firewalls tcp
rules in place that restrict UDP/TCP traffic? (e.g., iptables)
Also try running tping / mpirun / lamexec from a node other than the origin (i.e., the node you lambooted from). I did. same problem.
On May 23, 2007, at 11:32 PM, K. Charoenpornwattana Ter wrote:
> Try some simple tests: > > - Does "tping -c 3" run successfully? (It should ping all the lamd's) > > [ter uftoscar
test]$ tping -c 3 n0-13 > 1 byte from 13 remote nodes and 1 local node: 0.006 secs > 1 byte from 13 remote nodes and 1 local node: 0.005 secs > 1 byte from 13 remote nodes and 1 local node: 0.005 secs
> > 3 messages, 3 bytes (0.003K), 0.016 secs (0.368K/sec) > roundtrip min/avg/max: 0.005/0.005/0.006 > > > - Does "lamexec N hostname" run successfully? (It should run > "hostname" on all the booted nodes)
> > No, it doesn't work. It only show headnode39;s hostname. See below: > > [ter uftoscar ~]$ lamexec N hostname > uftoscar.latech > <freeze> > > I, however, can execute "cexec hostname" with no problem.
> > - When you "mpirun -np 15 ring.out", do you see ring.out executing on > all the nodes? (i.e., if you ssh into each of the nodes and run ps, > do you see it running? > > I only see one
ring.out run on headnode, no ring.out running on > other nodes. > > > Thanks > Kulathep > _______________________________________________ > This list is archived at
http://www.lam-mpi.org/MailArchives/lam/
-- Jeff Squyres Cisco Systems
_______________________________________________ This list is archived at
http://www.lam-mpi.org/MailArchives/lam/
|
| Re: LAM: lamboot is ok, mpirun is not |

|
2007-05-24 15:05:45 |
Check to see if the lamd's are still running on all nodes
when this
problem occurs. If they are dying for some reason (or being
killed),
that could explain this behavior.
On May 24, 2007, at 3:47 PM, K. Charoenpornwattana Ter
wrote:
> On 5/24/07, Jeff Squyres <jsquyres cisco.com> wrote:
> That is just weird -- I don't think I've seen a case
where tping
> worked (implying that inter-lamd communication is
working), but
> running applications did not.
>
> Yes, it's kinda weird. I just noticed something, After
running
> mpirun, tping doesn't work anymore, See below.
>
> [ter uftoscar test]$ lamboot -v host
> LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University
>
> n-1<12514> ssi:boot:base:linear: booting n0
(uftoscar)
> ...
> n-1<12514> ssi:boot:base:linear: finished
> [ter uftoscar test]$ tping -c 3 n0-13
> 1 byte from 13 remote nodes and 1 local node: 0.007
secs
> 1 byte from 13 remote nodes and 1 local node: 0.005
secs
> 1 byte from 13 remote nodes and 1 local node: 0.006
secs
>
> 3 messages, 3 bytes (0.003K), 0.017 secs (0.340K/sec)
> roundtrip min/avg/max: 0.005/0.006/0.007
> [ter uftoscar test]$ mpicc ring.c -o ring.out
<---
> LAM's mpicc
> [ter uftoscar test]$ mpirun -np 13 ring.out
> <freeze> (so I pressed Ctrl-C to cancel)
>
> ********************* WARNING ***********************
> This is a vulnerable region. Exiting the application
> now may lead to improper cleanup of temporary objects
> To exit the application, press Ctrl-C again
> ********************* WARNING ************************
> [ter uftoscar test]$ tping -c 3 n0-13
> <freeze> :-(
>
> The only thing that I can think of is that there is
some firewalling
> in place that only allows arbitrary UDP traffic
through...? (inter-
> lamd traffic is UDP, not TCP) That doesn't seem to
make sense,
> though, if MPICH works (cexec uses ssh, which is most
certainly
> allowed). But can you triple check that there are no
firewalls tcp
> rules in place that restrict UDP/TCP traffic? (e.g.,
iptables)
>
> I did. no firewall is running on any nodes.
>
> [root uftoscar ~]# service iptables status
> Firewall is stopped.
> [root uftoscar ~]# service pfilter status
> pfilter is stopped
> [root uftoscar ~]# cexec service iptables status
> ************************* oscar_cluster
*************************
> --------- oscarnode1---------
> Firewall is stopped.
> .....
> --------- oscarnode13---------
> Firewall is stopped.
>
> [root uftoscar ~]# cexec service pfilter status <-- I
already
> removed pfilter.
> ************************* oscar_cluster
*************************
> --------- oscarnode1---------
> pfilter: unrecognized service
> ....
> --------- oscarnode13---------
> pfilter: unrecognized service
>
> Also try running tping / mpirun / lamexec from a node
other than the
> origin (i.e., the node you lambooted from).
>
> I did. same problem.
>
> On May 23, 2007, at 11:32 PM, K. Charoenpornwattana Ter
wrote:
>
> > Try some simple tests:
> >
> > - Does "tping -c 3" run successfully?
(It should ping all the
> lamd's)
> >
> > [ter uftoscar test]$ tping -c 3 n0-13
> > 1 byte from 13 remote nodes and 1 local node:
0.006 secs
> > 1 byte from 13 remote nodes and 1 local node:
0.005 secs
> > 1 byte from 13 remote nodes and 1 local node:
0.005 secs
> >
> > 3 messages, 3 bytes (0.003K), 0.016 secs
(0.368K/sec)
> > roundtrip min/avg/max: 0.005/0.005/0.006
> >
> >
> > - Does "lamexec N hostname" run
successfully? (It should run
> > "hostname" on all the booted nodes)
> >
> > No, it doesn't work. It only show headnode's
hostname. See below:
> >
> > [ter uftoscar ~]$ lamexec N hostname
> > uftoscar.latech
> > <freeze>
> >
> > I, however, can execute "cexec hostname"
with no problem.
> >
> > - When you "mpirun -np 15 ring.out", do
you see ring.out
> executing on
> > all the nodes? (i.e., if you ssh into each of the
nodes and run ps,
> > do you see it running?
> >
> > I only see one ring.out run on headnode, no
ring.out running on
> > other nodes.
> >
> >
> > Thanks
> > Kulathep
> > _______________________________________________
> > This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
>
>
> --
> Jeff Squyres
> Cisco Systems
>
> _______________________________________________
> This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
>
> _______________________________________________
> This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
--
Jeff Squyres
Cisco Systems
_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
|
|
| Re: LAM: lamboot is ok, mpirun is not |

|
2007-05-24 16:32:46 |
|
Hi,
lamd on every nodes in the cluster are running (before and after running mpirun on head node). see below:
[ter uftoscar ~]$ cexec "ps -ef | grep lamd | grep -v grep" ************************* oscar_cluster *************************
--------- oscarnode1--------- ter 5292 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H 192.168.99.1 -P 57625 -n 1 -o 0 --------- oscarnode2--------- ter 5002 1 0 16:22 ? 00:00:00 /opt/lam-
7.1.3/bin//lamd -H 192.168.99.1 -P 57625 -n 2 -o 0 --------- oscarnode3--------- ter 5002 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H
192.168.99.1 -P 57625 -n 3 -o 0 --------- oscarnode4--------- ter 5002 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H 192.168.99.1 -P 57625 -n 4 -o 0 --------- oscarnode5---------
ter 5002 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H 192.168.99.1 -P 57625 -n 5 -o 0 --------- oscarnode6--------- ter 5058 1 0 16:22 ? 00:00:00 /opt/lam-
7.1.3/bin//lamd -H 192.168.99.1 -P 57625 -n 6 -o 0 --------- oscarnode7--------- ter 5016 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H
192.168.99.1 -P 57625 -n 7 -o 0 --------- oscarnode8--------- ter 4950 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H 192.168.99.1 -P 57625 -n 8 -o 0 --------- oscarnode9---------
ter 4950 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H 192.168.99.1 -P 57625 -n 9 -o 0 --------- oscarnode10--------- ter 4950 1 0 16:22 ? 00:00:00 /opt/lam-
7.1.3/bin//lamd -H 192.168.99.1 -P 57625 -n 10 -o 0 --------- oscarnode11--------- ter 4950 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H
192.168.99.1 -P 57625 -n 11 -o 0 --------- oscarnode12--------- ter 4950 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H 192.168.99.1 -P 57625 -n 12 -o 0 --------- oscarnode13---------
ter 4950 1 0 16:22 ? 00:00:00 /opt/lam-7.1.3/bin//lamd -H 192.168.99.1 -P 57625 -n 13 -o 0
[ter uftoscar ~]$ ps -ef | grep lamd | grep -v grep ter 13808 1 0 16:23 ? 00:00:00 /opt/lam-
7.1.3/bin//lamd -H 192.168.99.1 -P 57625 -n 0 -o 0
Thanks Kulathep
On 5/24/07, Jeff Squyres <
jsquyres cisco.com">jsquyres cisco.com> wrote:Check to see if the lamd's are still running on all nodes when this
problem occurs. If they are dying for some reason (or being killed), that could explain this behavior.
On May 24, 2007, at 3:47 PM, K. Charoenpornwattana Ter wrote:
> On 5/24/07, Jeff Squyres <
jsquyres cisco.com">jsquyres cisco.com> wrote: > That is just weird -- I don't think I've seen a case where tping > worked (implying that inter-lamd communication is working), but
> running applications did not. > > Yes, it's kinda weird. I just noticed something, After running > mpirun, tping doesn't work anymore, See below. > > [ter uftoscar test]$ lamboot -v host
> LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University > > n-1<12514> ssi:boot:base:linear: booting n0 (uftoscar) > ... > n-1<12514> ssi:boot:base:linear: finished > [ter uftoscar
test]$ tping -c 3 n0-13 > 1 byte from 13 remote nodes and 1 local node: 0.007 secs > 1 byte from 13 remote nodes and 1 local node: 0.005 secs > 1 byte from 13 remote nodes and 1 local node: 0.006 secs
> > 3 messages, 3 bytes (0.003K), 0.017 secs (0.340K/sec) > roundtrip min/avg/max: 0.005/0.006/0.007 > [ter | |