List Info

Thread: Re: LAM: lamboot is ok, mpirun is not




Re: LAM: lamboot is ok, mpirun is not
user name
2007-05-24 20:36:11
Sorry, i see you did that earlier.  have you tried the mpirun with -v parameter as well? ;

From: lam-bounceslam-mpi.org [mailto:lam-bounceslam-mpi.org] On Behalf Of K. Charoenpornwattana Ter
Sent: 24 May 2007 19:57
To: General LAM/MPI mailing list
Subject: Re: LAM: lamboot is ok, mpirun is not

[teruftoscar ~]$ which mpirun
/opt/lam-7.1.3/bin/mpirun
[teruftoscar ~]$ cexec which mpirun
************************* oscar_cluster *************************
--------- oscarnode1---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode2---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode3---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode4---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode5---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode6---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode7---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode8---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode9---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode10---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode11---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode12---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode13---------
/opt/lam-7.1.3/bin/mpirun

Thanks

On 5/24/07, McCalla, Mac < macmccallahess.com">macmccallahess.com > wrote:
Hi,
 &nbsp;  just for grins, what does "which mpirun" show? ......
 
mac mccalla


From: lam-mpi.org" target=_blank>lam-bounceslam-mpi.org [mailto:lam-mpi.org" target=_blank>lam-bounceslam-mpi.org] On Behalf Of K. Charoenpornwattana Ter
Sent: 24 May 2007 14:47
To: General LAM/MPI mailing list
Subject: Re: LAM: lamboot is ok, mpirun is not

On 5/24/07, Jeff Squyres <cisco.com" target=_blank>jsquyrescisco.com> wrote:
That is just weird -- I don't think I've seen a case where tping
worked (implying that inter-lamd communication is working), but
running applications did not.

Yes, it's kinda weird. I just noticed something, After running mpirun, tping doesn't work anymore, See below.

[teruftoscar test]$ lamboot -v host
LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University

n-1<12514> ssi:boot:base:linear: booting n0 (uftoscar)
...
n-1<;12514> ssi:boot:base:linear: finished
[teruftoscar test]$ tping -c 3 n0-13
&nbsp; 1 byte from 13 remote nodes and 1 local node: 0.007 secs
 ; 1 byte from 13 remote nodes and 1 local node: 0.005 secs
 ; 1 byte from 13 remote nodes and 1 local node: 0.006 secs

3 messages, 3 bytes (0.003K), 0.017 secs (0.340K/sec)
roundtrip min/avg/max: 0.005/0.006/0.007
[teruftoscar test]$ mpicc ring.c -o ring.out&nbsp; &nbsp;   ; &nbsp; &nbsp; &nbsp;  <---LAM's mpicc
[teruftoscar test]$ mpirun -np 13 ring.out
&lt;freeze&gt; (so I pressed Ctrl-C to cancel)

********************* WARNING ***********************
This is a vulnerable region. Exiting the application
now may lead to improper cleanup of temporary objects
To exit the application, press Ctrl-C again
********************* WARNING ************************
[teruftoscar test]$ tping -c 3 n0-13
<freeze> :-(

The only thing that I can think of is that there is some firewalling
in place that only allows arbitrary UDP traffic through...? &nbsp;(inter-
lamd traffic is UDP, not TCP)  That doesn't seem to make sense,
though, if MPICH works (cexec uses ssh, which is most certainly
allowed).&nbsp; But can you triple check that there are no firewalls tcp
rules in place that restrict UDP/TCP traffic?&nbsp; (e.g., iptables)

I did. no firewall is running on any nodes.

[rootuftoscar ~]# service iptables status
Firewall is stopped.
[rootuftoscar ~]# service pfilter status
pfilter is stopped
[rootuftoscar ~]# cexec service iptables status
************************* oscar_cluster *************************
--------- oscarnode1---------
Firewall is stopped.
.....
--------- oscarnode13---------
Firewall is stopped.

[rootuftoscar ~]# cexec service pfilter status  &nbsp; <-- I already removed pfilter.
************************* oscar_cluster *************************
--------- oscarnode1---------
pfilter: unrecognized service
....
--------- oscarnode13---------
pfilter: unrecognized service
&nbsp;
Also try running tping / mpirun / lamexec from a node other than the
origin (i.e., the node you lambooted from).

I did. same problem.

On May 23, 2007, at 11:32 PM, K. Charoenpornwattana Ter wrote:

&gt; Try some simple tests:
>;
> - Does "tping -c 3" run successfully? (It should ping all the lamd's)
&gt;
> [teruftoscar test]$ tping -c 3 n0-13
>  ; 1 byte from 13 remote nodes and 1 local node: 0.006 secs
>&nbsp;  1 byte from 13 remote nodes and 1 local node: 0.005 secs
>&nbsp;  1 byte from 13 remote nodes and 1 local node: 0.005 secs
>
> 3 messages, 3 bytes (0.003K), 0.016 secs (0.368K/sec)
> roundtrip min/avg/max: 0.005/0.005/0.006
>;
>
> - Does "lamexec N hostname" run successfully? (It should run
> "hostname" on all the booted nodes)
>
> No, it doesn't work. It only show headnode's hostname. See below:
>;
> [teruftoscar ~]$ lamexec N hostname
&gt; uftoscar.latech
> <freeze>
>
&gt; I, however, can execute "cexec hostname" with no problem.
>
> - When you "mpirun -np 15 ring.out", do you see ring.out executing on
> all the nodes? (i.e., if you ssh into each of the nodes and run ps,
> do you see it running?
&gt;
> I only see one ring.out run on headnode, no ring.out running on
> other nodes.
>;
>
> Thanks
>; Kulathep
&gt; _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/


--
Jeff Squyres
Cisco Systems

_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/


_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/

Re: LAM: lamboot is ok, mpirun is not
user name
2007-05-24 20:43:55
Yes,

[teruftoscar test]$ mpirun -np 14 -v ring.out
17119 ring.out running on n0 (o)
<freeze>

Ummm, I guess, I will just remove everything and install it again.

Thanks anyway,
Kulathep

On 5/24/07, McCalla, Mac < macmccallahess.com">macmccallahess.com&gt; wrote:
Sorry, i see you did that earlier.&nbsp; have you tried the mpirun with -v parameter as well? ;


From: lam-bounceslam-mpi.org" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">lam-bounceslam-mpi.org [mailto: lam-bounceslam-mpi.org" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">lam-bounceslam-mpi.org] On Behalf Of K. Charoenpornwattana Ter
Sent: 24 May 2007 19:57

To: General LAM/MPI mailing list
Subject: Re: LAM: lamboot is ok, mpirun is not

[teruftoscar ~]$ which mpirun
/opt/lam-7.1.3/bin/mpirun
[teruftoscar ~]$ cexec which mpirun
************************* oscar_cluster *************************
--------- oscarnode1---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode2---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode3---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode4---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode5---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode6---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode7---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode8---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode9---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode10---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode11---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode12---------
/opt/lam-7.1.3/bin/mpirun
--------- oscarnode13---------
/opt/lam-7.1.3/bin/mpirun

Thanks

On 5/24/07, McCalla, Mac < macmccallahess.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">macmccallahess.com > wrote:
Hi,
 &nbsp;  just for grins, what does "which mpirun&quot; show? ......
 
mac mccalla


From: lam-bounceslam-mpi.org" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">lam-bounceslam-mpi.org [mailto: lam-bounceslam-mpi.org" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)"> lam-bounceslam-mpi.org] On Behalf Of K. Charoenpornwattana Ter
Sent: 24 May 2007 14:47
To: General LAM/MPI mailing list
Subject: Re: LAM: lamboot is ok, mpirun is not

On 5/24/07, Jeff Squyres < jsquyrescisco.com" target="_blank" onclick="return top.js.OpenExtLink(window,event,this)">jsquyrescisco.com> wrote:
That is just weird -- I don't think I've seen a case where tping
worked (implying that inter-lamd communication is working), but
running applications did not.

Yes, it's kinda weird. I just noticed something, After running mpirun, tping doesn't work anymore, See below.

[teruftoscar test]$ lamboot -v host
LAM 7.1.3/MPI 2 C++/ROMIO - Indiana University

n-1<12514> ssi:boot:base:linear: booting n0 (uftoscar)
...
n-1<;12514> ssi:boot:base:linear: finished
[teruftoscar test]$ tping -c 3 n0-13
&nbsp; 1 byte from 13 remote nodes and 1 local node: 0.007 secs
 ; 1 byte from 13 remote nodes and 1 local node: 0.005 secs
 ; 1 byte from 13 remote nodes and 1 local node: 0.006 secs

3 messages, 3 bytes (0.003K), 0.017 secs (0.340K/sec)
roundtrip min/avg/max: 0.005/0.006/0.007
[teruftoscar test]$ mpicc ring.c -o ring.out&nbsp; &nbsp;   ; &nbsp; &nbsp; &nbsp;  <---LAM's mpicc
[teruftoscar test]$ mpirun -np 13 ring.out
&lt;freeze&gt; (so I pressed Ctrl-C to cancel)

********************* WARNING ***********************
This is a vulnerable region. Exiting the application
now may lead to improper cleanup of temporary objects
To exit the application, press Ctrl-C again
********************* WARNING ************************
[teruftoscar test]$ tping -c 3 n0-13
<freeze> :-(

The only thing that I can think of is that there is some firewalling
in place that only allows arbitrary UDP traffic through...? &nbsp;(inter-
lamd traffic is UDP, not TCP)  That doesn't seem to make sense,
though, if MPICH works (cexec uses ssh, which is most certainly
allowed).&nbsp; But can you triple check that there are no firewalls tcp
rules in place that restrict UDP/TCP traffic?&nbsp; (e.g., iptables)

I did. no firewall is running on any nodes.

[rootuftoscar ~]# service iptables status
Firewall is stopped.
[rootuftoscar ~]# service pfilter status
pfilter is stopped
[rootuftoscar ~]# cexec service iptables status
************************* oscar_cluster *************************
--------- oscarnode1---------
Firewall is stopped.
.....
--------- oscarnode13---------
Firewall is stopped.

[rootuftoscar ~]# cexec service pfilter status  &nbsp; <-- I already removed pfilter.
************************* oscar_cluster *************************
--------- oscarnode1---------
pfilter: unrecognized service
....
--------- oscarnode13---------
pfilter: unrecognized service
&nbsp;
Also try running tping / mpirun / lamexec from a node other than the
origin (i.e., the node you lambooted from).

I did. same problem.

On May 23, 2007, at 11:32 PM, K. Charoenpornwattana Ter wrote:

&gt; Try some simple tests:
>;
> - Does "tping -c 3" run successfully? (It should ping all the lamd's)
>
> [teruftoscar test]$ tping -c 3 n0-13
>  ; 1 byte from 13 remote nodes and 1 local node: 0.006 secs
>&nbsp;  1 byte from 13 remote nodes and 1 local node: 0.005 secs
>&nbsp;  1 byte from 13 remote nodes and 1 local node: 0.005 secs
>
> 3 messages, 3 bytes (0.003K), 0.016 secs (0.368K/sec)
> roundtrip min/avg/max: 0.005/0.005/0.006
>;
>
> - Does "lamexec N hostname&quot; run successfully? (It should run
> "hostname" on all the booted nodes)
>
> No, it doesn't work. It only show headnode&#39;s hostname. See below:
>;
> [teruftoscar ~]$ lamexec N hostname
&gt; uftoscar.latech
> <freeze>
>
&gt; I, however, can execute "cexec hostname&quot; with no problem.
>
> - When you "mpirun -np 15 ring.out&quot;, do you see ring.out executing on
> all the nodes? (i.e., if you ssh into each of the nodes and run ps,
> do you see it running?
&gt;
> I only see one ring.out run on headnode, no ring.out running on
> other nodes.
>;
>
> Thanks
>; Kulathep
&gt; _______________________________________________
> This list is archived at http://www.lam-mpi.org/MailArchives/lam/


--
Jeff Squyres
Cisco Systems

_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/


_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/


_______________________________________________
This list is archived at http://www.lam-mpi.org/MailArchives/lam/

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )