|
List Info
Thread: LAM: caused collective abort of all ranks
|
|
| LAM: caused collective abort of all
ranks |
  United States |
2008-02-11 21:32:28 |
|
Hello All,
I am a MPI newbie and having a problem.I intended to run a binary on different processors, with different data as inputs....i.e.diff files are for processing using same binary..... My binary takes these arguments.... binary -in file1 -out file 2
file 1 and file 2 change for each node......
I wrote a small program( small is what i can write rite now :( ...)....The code seems to work fine as
intended(till it runs...) but exits in between with an error;
In the program, the command for each node is given using a small bash file, 1-exec 2-exec etc....
error: ########################## rank 3 in job 83 sapphire.bw01.uic.edu_56332 caused collective abort of all ranks exit status of rank 3: return code 0 ######################################################
The code:
#include "mpi.h" #include "stdio.h" #include <unistd.h> int main( int argc, char *argv[] ) { int numprocs, myrank,work,namelen; char *file; char processor_name[MPI_MAX_PROCESSOR_NAME];
MPI_Init(&argc, &argv ); MPI_Comm_size(MPI_COMM_WORLD,&numprocs); MPI_Comm_rank(MPI_COMM_WORLD,&myrank);
if (myrank) printf("My process rank ==> %dn",myrank);
if (myrank==1)execl("/bin/bash","bash","1-exec",0); if (myrank==2)execl("/bin/bash","bash","2-exec",0); if (myrank==3)execl("/bin/bash","bash","3-exec",0); if (myrank==4)execl("/bin/bash","bash","4-exec",0); if (myrank==5)execl("/bin/bash","bash","5-exec",0); if (myrank==6)execl("/bin/bash","bash","6-exec",0); if (myrank==7)execl("/bin/bash","bash","7-exec",0); if (myrank==8)execl("/bin/bash","bash","8-exec",0); if (myrank==9)execl("/bin/bash","bash","9-exec",0); if (myrank==10)execl("/bin/bash","bash","10-exec",0); if (myrank==11)execl("/bin/bash","bash","11-exec",0); if (myrank==12)execl("/bin/bash","bash","12-exec",0); if (myrank==13)execl("/bin/bash","bash","13-exec",0); if (myrank==14)execl("/bin/bash","bash","14-exec",0); if (myrank==15)execl("/bin/bash","bash","15-exec",0); if (myrank==16)execl("/bin/bash","bash","16-exec",0);
MPI_Finalize(); }
****************************
Any suggestions.??I tried to find out the possible cause of the error but it was discussed very less.All I could learn was that it might be due to less number of processing units, but I cannot run the above program with even less number of processes for example I have atleast 6 processing units and running the program for even 1 would exit with the error.
Thanks alot,
Fahad Saeed
Helping your favorite cause is as easy as instant messaging. You IM, we give. Learn more. |
| Re: LAM: caused collective abort of all
ranks |

|
2008-02-13 13:45:38 |
It looks like you're trying to use MPI as a parallel
launcher, which
isn't quite what MPI is for.
Indeed, you're calling execl(), which will replace the MPI
process
with your new /bin/bash process. Therefore, MPI_FINALIZE
will not be
executed. As such, LAM treats that as an error.
Note, too, that system() is technically not supported.
It'll work
fine on TCP and shared memory, but will not work properly on
Myrinet
or IB networks.
You might simply want to use a resource manager to launch
your serial
applications in parallel. That might be a bit easier than
using MPI.
On Feb 11, 2008, at 10:32 PM, fahad saeed wrote:
> Hello All,
>
> I am a MPI newbie and having a problem.I intended to
run a binary on
> different processors, with different data as
inputs....i.e.diff
> files are for processing using same binary.....
> My binary takes these arguments....
> binary -in file1 -out file 2
>
> file 1 and file 2 change for each node......
>
> I wrote a small program( small is what i can write rite
now :
> ( ...)....The code seems to work fine as
> intended(till it runs...) but exits in between with an
error;
>
> In the program, the command for each node is given
using a small
> bash file, 1-exec
> 2-exec etc....
>
> error:
> ##########################
> rank 3 in job 83 sapphire.bw01.uic.edu_56332 caused
collective
> abort of
> all ranks exit status of rank 3: return code 0
> ######################################################
>
> The code:
>
> #include "mpi.h"
> #include "stdio.h"
> #include <unistd.h>
> int main( int argc, char *argv[] )
> {
> int numprocs, myrank,work,namelen;
> char *file;
> char processor_name[MPI_MAX_PROCESSOR_NAME];
>
>
>
> MPI_Init(&argc, &argv );
> MPI_Comm_size(MPI_COMM_WORLD,&numprocs);
> MPI_Comm_rank(MPI_COMM_WORLD,&myrank);
>
>
> if (myrank) printf("My process rank ==>
%dn",myrank);
>
> if
(myrank==1)execl("/bin/bash","bash",&quo
t;1-exec",0);
> if
(myrank==2)execl("/bin/bash","bash",&quo
t;2-exec",0);
> if
(myrank==3)execl("/bin/bash","bash",&quo
t;3-exec",0);
> if
(myrank==4)execl("/bin/bash","bash",&quo
t;4-exec",0);
> if
(myrank==5)execl("/bin/bash","bash",&quo
t;5-exec",0);
> if
(myrank==6)execl("/bin/bash","bash",&quo
t;6-exec",0);
> if
(myrank==7)execl("/bin/bash","bash",&quo
t;7-exec",0);
> if
(myrank==8)execl("/bin/bash","bash",&quo
t;8-exec",0);
> if
(myrank==9)execl("/bin/bash","bash",&quo
t;9-exec",0);
> if
(myrank==10)execl("/bin/bash","bash",&qu
ot;10-exec",0);
> if
(myrank==11)execl("/bin/bash","bash",&qu
ot;11-exec",0);
> if
(myrank==12)execl("/bin/bash","bash",&qu
ot;12-exec",0);
> if
(myrank==13)execl("/bin/bash","bash",&qu
ot;13-exec",0);
> if
(myrank==14)execl("/bin/bash","bash",&qu
ot;14-exec",0);
> if
(myrank==15)execl("/bin/bash","bash",&qu
ot;15-exec",0);
> if
(myrank==16)execl("/bin/bash","bash",&qu
ot;16-exec",0);
>
> MPI_Finalize();
> }
>
>
> ****************************
>
> Any suggestions.??I tried to find out the possible
cause of the
> error but
> it was discussed very less.All I could learn was that
it might be
> due to
> less number of processing units, but I cannot run the
above
> program with
> even less number of processes for example I have
atleast 6 processing
> units and running the program for even 1 would exit
with the error.
>
> Thanks alot,
>
> Fahad Saeed
>
> Helping your favorite cause is as easy as instant
messaging. You IM,
> we give. Learn more.
_______________________________________________
> This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
--
Jeff Squyres
Cisco Systems
_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
|
|
| Re: LAM: caused collective abort of all
ranks |
  United States |
2008-02-13 13:51:19 |
|
What would I have to do, if i have to use MPI.I mean isnt this true that MPI is for message passing, and the thing that i am trying to do is a kind of message passing ?
Could you please eloborate more on resource manager, what kind of resource manager.
Thanks alot.. Fahad
> From: jsquyres cisco.com > To: lam lam-mpi.org > Date: Wed, 13 Feb 2008 14:45:38 -0500 > Subject: Re: LAM: caused collective abort of all ranks > > It looks like you're trying to use MPI as a parallel launcher, which > isn't quite what MPI is for. > > Indeed, you're calling execl(), which will replace the MPI process > with your new /bin/bash process. Therefore, MPI_FINALIZE will not be > executed. As such, LAM treats that as an error. > > Note, too, that system() is technically not supported. It'll work > fine on TCP and shared memory, but will not work properly on Myrinet > or IB networks. > > You might simply want to use a resource manager to launch your serial > applications in parallel. That might be a bit easier than using MPI. > > > On Feb 11, 2008, at 10:32 PM, fahad saeed wrote: > > > Hello All, > > > > I am a MPI newbie and having a problem.I intended to run a binary on > > different processors, with different data as inputs....i.e.diff > > files are for processing using same binary..... > > My binary takes these arguments.... > > binary -in file1 -out file 2 > > > > file 1 and file 2 change for each node...... > > > > I wrote a small program( small is what i can write rite now : > > ( ...)....The code seems to work fine as > > intended(till it runs...) but exits in between with an error; > > > > In the program, the command for each node is given using a small > > bash file, 1-exec > > 2-exec etc.... > > > > error: > > ########################## > > rank 3 in job 83 sapphire.bw01.uic.edu_56332 caused collective > > abort of > > all ranks exit status of rank 3: return code 0 > > ###################################################### > > > > The code: > > > > #include "mpi.h" > > #include "stdio.h" > > #include <unistd.h> > > int main( int argc, char *argv[] ) > > { > > int numprocs, myrank,work,namelen; > > char *file; > > char processor_name[MPI_MAX_PROCESSOR_NAME]; > > > > > > > > MPI_Init(&argc, &argv ); > > MPI_Comm_size(MPI_COMM_WORLD,&numprocs); > > MPI_Comm_rank(MPI_COMM_WORLD,&myrank); > > > > > > if (myrank) printf("My process rank ==> %dn",myrank); > > > > if (myrank==1)execl("/bin/bash","bash","1-exec",0); > > if (myrank==2)execl("/bin/bash","bash","2-exec",0); > > if (myrank==3)execl("/bin/bash","bash","3-exec",0); > > if (myrank==4)execl("/bin/bash","bash","4-exec",0); > > if (myrank==5)execl("/bin/bash","bash","5-exec",0); > > if (myrank==6)execl("/bin/bash","bash","6-exec",0); > > if (myrank==7)execl("/bin/bash","bash","7-exec",0); > > if (myrank==8)execl("/bin/bash","bash","8-exec",0); > > if (myrank==9)execl("/bin/bash","bash","9-exec",0); > > if (myrank==10)execl("/bin/bash","bash","10-exec",0); > > if (myrank==11)execl("/bin/bash","bash","11-exec",0); > > if (myrank==12)execl("/bin/bash","bash","12-exec",0); > > if (myrank==13)execl("/bin/bash","bash","13-exec",0); > > if (myrank==14)execl("/bin/bash","bash","14-exec",0); > > if (myrank==15)execl("/bin/bash","bash","15-exec",0); > > if (myrank==16)execl("/bin/bash","bash","16-exec",0); > > > > MPI_Finalize(); > > } > > > > > > **************************** > > > > Any suggestions.??I tried to find out the possible cause of the > > error but > > it was discussed very less.All I could learn was that it might be > > due to > > less number of processing units, but I cannot run the above > > program with > > even less number of processes for example I have atleast 6 processing > > units and running the program for even 1 would exit with the error. > > > > Thanks alot, > > > > Fahad Saeed > > > > Helping your favorite cause is as easy as instant messaging. You IM, > > we give. Learn more. _______________________________________________ > > This list is archived at http://www.lam-mpi.org/MailArchives/lam/ > > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
Shed those extra pounds with MSN and The Biggest Loser! Learn more. |
| Re: LAM: caused collective abort of all
ranks |

|
2008-02-13 15:02:16 |
On Feb 13, 2008, at 2:51 PM, fahad saeed wrote:
> What would I have to do, if i have to use MPI.I mean
isnt this true
> that MPI is for message passing, and the thing that i
am trying to
> do is a kind of message passing ?
I don't see you doing any message passing in your app -- you
call
MPI_INIT and then execl() (thereby replacing the MPI process
with
bash). There's no sending of messages anywhere.
> Could you please eloborate more on resource manager,
what kind of
> resource manager.
A resource manager to allocate the nodes in your cluster,
such as
SLURM or Torque, etc.
--
Jeff Squyres
Cisco Systems
_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
|
|
| Re: LAM: caused collective abort of all
ranks |
  United States |
2008-02-13 15:05:07 |
|
Ok.Let me put it this way.How can i use message passing to execute a binary on different nodes.?
Fahad
> From: jsquyres cisco.com > To: lam lam-mpi.org > Date: Wed, 13 Feb 2008 16:02:16 -0500 > Subject: Re: LAM: caused collective abort of all ranks > > On Feb 13, 2008, at 2:51 PM, fahad saeed wrote: > > > What would I have to do, if i have to use MPI.I mean isnt this true > > that MPI is for message passing, and the thing that i am trying to > > do is a kind of message passing ? > > I don't see you doing any message passing in your app -- you call > MPI_INIT and then execl() (thereby replacing the MPI process with > bash). There's no sending of messages anywhere. > > > Could you please eloborate more on resource manager, what kind of > > resource manager. > > A resource manager to allocate the nodes in your cluster, such as > SLURM or Torque, etc. > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
Connect and share in new ways with Windows Live. Get it now! |
| Re: LAM: caused collective abort of all
ranks |

|
2008-02-13 20:27:19 |
On Feb 13, 2008, at 4:05 PM, fahad saeed wrote:
> Ok.Let me put it this way.How can i use message passing
to execute a
> binary on different nodes.?
Message passing is not the same thing as launching an
executable on
different nodes. Message passing is sending messages from
one process
to another. The fact that mpirun *also* starts processes on
multiple
nodes is a side-effect -- we have to start the processes
before we can
exchange messages between them.
What it sounds like you want to do is use MPI as a bootstrap
to launch
several executables on multiple nodes. While MPI can do
that, it's
not really what it's designed for. And the fact that you
call
MPI_INIT and then don't call MPI_FINALIZE will always cause
errors
with LAM/MPI.
LAM has a launcher explicitly for non-MPI applications; you
probably
want to use that instead. See the man page for lamexec(1).
>
> Fahad
>
> > From: jsquyres cisco.com
> > To: lam lam-mpi.org
> > Date: Wed, 13 Feb 2008 16:02:16 -0500
> > Subject: Re: LAM: caused collective abort of all
ranks
> >
> > On Feb 13, 2008, at 2:51 PM, fahad saeed wrote:
> >
> > > What would I have to do, if i have to use
MPI.I mean isnt this
> true
> > > that MPI is for message passing, and the
thing that i am trying to
> > > do is a kind of message passing ?
> >
> > I don't see you doing any message passing in your
app -- you call
> > MPI_INIT and then execl() (thereby replacing the
MPI process with
> > bash). There's no sending of messages anywhere.
> >
> > > Could you please eloborate more on resource
manager, what kind of
> > > resource manager.
> >
> > A resource manager to allocate the nodes in your
cluster, such as
> > SLURM or Torque, etc.
> >
> > --
> > Jeff Squyres
> > Cisco Systems
> >
> > _______________________________________________
> > This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
>
> Connect and share in new ways with Windows Live. Get it
now!
> _______________________________________________
> This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
--
Jeff Squyres
Cisco Systems
_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
|
|
| Re: LAM: caused collective abort of all
ranks |
  United States |
2008-02-13 21:32:21 |
|
> What it sounds like you want to do is use MPI as a bootstrap to launch > several executables on multiple nodes. While MPI can do that, it's > not really what it's designed for.
How ??? because the way I am doing it, it is giving errors for reason you just explained.
mpiexec or lamexec, are used to execute non-MPI programs, but how would that help in this case.Would a scheduler be used along with mpiexec to do the task under question.
Thanks alot.. Fahad
> From: jsquyres cisco.com > To: lam lam-mpi.org > Date: Wed, 13 Feb 2008 21:27:19 -0500 > Subject: Re: LAM: caused collective abort of all ranks > > On Feb 13, 2008, at 4:05 PM, fahad saeed wrote: > > > Ok.Let me put it this way.How can i use message passing to execute a > > binary on different nodes.? > > Message passing is not the same thing as launching an executable on > different nodes. Message passing is sending messages from one process > to another. The fact that mpirun *also* starts processes on multiple > nodes is a side-effect -- we have to start the processes before we can > exchange messages between them. > > What it sounds like you want to do is use MPI as a bootstrap to launch > several executables on multiple nodes. While MPI can do that, it's > not really what it's designed for. And the fact that you call > MPI_INIT and then don't call MPI_FINALIZE will always cause errors > with LAM/MPI. > > LAM has a launcher explicitly for non-MPI applications; you probably > want to use that instead. See the man page for lamexec(1). > > > > > Fahad > > > > > From: jsquyres cisco.com > > > To: lam lam-mpi.org > > > Date: Wed, 13 Feb 2008 16:02:16 -0500 > > > Subject: Re: LAM: caused collective abort of all ranks > > > > > > On Feb 13, 2008, at 2:51 PM, fahad saeed wrote: > > > > > > > What would I have to do, if i have to use MPI.I mean isnt this > > true > > > > that MPI is for message passing, and the thing that i am trying to > > > > do is a kind of message passing ? > > > > > > I don't see you doing any message passing in your app -- you call > > > MPI_INIT and then execl() (thereby replacing the MPI process with > > > bash). There's no sending of messages anywhere. > > > > > > > Could you please eloborate more on resource manager, what kind of > > > > resource manager. > > > > > > A resource manager to allocate the nodes in your cluster, such as > > > SLURM or Torque, etc. > > > > > > -- > > > Jeff Squyres > > > Cisco Systems > > > > > > _______________________________________________ > > > This list is archived at http://www.lam-mpi.org/MailArchives/lam/ > > > > Connect and share in new ways with Windows Live. Get it now! > > _______________________________________________ > > This list is archived at http://www.lam-mpi.org/MailArchives/lam/ > > > -- > Jeff Squyres > Cisco Systems > > _______________________________________________ > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
Climb to the top of the charts! Play the word scramble challenge with star power. Play now! |
| Re: LAM: caused collective abort of all
ranks |
  Germany |
2008-02-14 04:56:09 |
On Wed, 13 Feb 2008, fahad saeed wrote:
> mpiexec or lamexec, are used to execute non-MPI
programs, but how
> would that help in this case.Would a scheduler be used
along with
> mpiexec to do the task under question.
You seem not to understand what message passing means and
what it can
do, plus you haven't really explained what you want to
achieve, you've
only shown us the errors that you get. So please try to
write down a
description of what your goals are and maybe we can find
together a
solution. Also remember that this is the LAM/MPI list,
dedicated to
issues related to LAM/MPI and not to clustering in general.
--
Bogdan Costescu
IWR, University of Heidelberg, INF 368, D-69120 Heidelberg,
Germany
Phone: +49 6221 54 8869/8240, Fax: +49 6221 54 8868/8850
E-mail: bogdan.costescu iwr.uni-heidelberg.de
_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
|
|
| Re: LAM: caused collective abort of all
ranks |
  United States |
2008-02-14 12:37:08 |
|
OK. This is what I am trying to accomplish. I am trying to run a single binary on different nodes for different data sets.i.e. a single binary has to run on different data sets on different nodes....I understand that it can be done using ssh or rsh i guess....but I want to do this using MPI library.....
For example....
node1 may run --------> ./binary -in file1 -out file1-output node2 may run --------> ./binary -in file2 -out file2-output
so on and so forth.... where in my mpi program this line(./binary -in file1 -out file1-output) is in "1-exec" and so on.....
I have tried to distribute the 'load' on each node using the rank.......but the get errors as I discussed...
Thanks
Fahad
> Date: Thu, 14 Feb 2008 11:56:09 +0100 > From: Bogdan.Costescu iwr.uni-heidelberg.de > To: lam lam-mpi.org > Subject: Re: LAM: caused collective abort of all ranks > > On Wed, 13 Feb 2008, fahad saeed wrote: > > > mpiexec or lamexec, are used to execute non-MPI programs, but how > > would that help in this case.Would a scheduler be used along with > > mpiexec to do the task under question. > > You seem not to understand what message passing means and what it can > do, plus you haven't really explained what you want to achieve, you've > only shown us the errors that you get. So please try to write down a > description of what your goals are and maybe we can find together a > solution. Also remember that this is the LAM/MPI list, dedicated to > issues related to LAM/MPI and not to clustering in general. > > -- > Bogdan Costescu > > IWR, University of Heidelberg, INF 368, D-69120 Heidelberg, Germany > Phone: +49 6221 54 8869/8240, Fax: +49 6221 54 8868/8850 > E-mail: bogdan.costescu iwr.uni-heidelberg.de > _______________________________________________ > This list is archived at http://www.lam-mpi.org/MailArchives/lam/
Climb to the top of the charts! Play the word scramble challenge with star power. Play now! |
| Re: LAM: caused collective abort of all
ranks |

|
2008-02-14 13:59:12 |
On Feb 14, 2008, at 1:37 PM, fahad saeed wrote:
> This is what I am trying to accomplish. I am trying to
run a single
> binary on different nodes for different data sets.i.e.
a single
> binary has to run on different data sets on different
nodes....I
> understand that it can be done using ssh or rsh i
guess....
Yes.
> but I want to do this using MPI library.....
For the code you showed, you're not using any MPI function
calls in a
meaningful way (i.e., you could get the rank in different
ways). It
seems like MPI is not the right tool for what you're trying
to do.
> For example....
>
> node1 may run --------> ./binary -in file1 -out
file1-output
> node2 may run --------> ./binary -in file2 -out
file2-output
If you're running a different command (e.g., different argv)
on every
node, and you're not using MPI for message passing, why not
just use
rsh/ssh or lamexec?
If it comes down to selecting which in/out file to use,
doing it with
> so on and so forth....
> where in my mpi program this line(./binary -in file1
-out file1-
> output) is in "1-exec" and so on.....
>
> I have tried to distribute the 'load' on each node
using the
> rank.......but the get errors as I discussed...
Did you look at the man page for lamexec(1)?
An alternative would be to put MPI_Init / MPI_Finalize in
the ./binary
program itself and have them figure out their argv based on
their
MPI_COMM_WORLD rank. Then you would just do "mpirun C
./binary" and
they would figure out their in and out filenames
themselves.
>
> Thanks
>
> Fahad
>
>
> > Date: Thu, 14 Feb 2008 11:56:09 +0100
> > From: Bogdan.Costescu iwr.uni-heidelberg.de
> > To: lam lam-mpi.org
> > Subject: Re: LAM: caused collective abort of all
ranks
> >
> > On Wed, 13 Feb 2008, fahad saeed wrote:
> >
> > > mpiexec or lamexec, are used to execute
non-MPI programs, but how
> > > would that help in this case.Would a
scheduler be used along with
> > > mpiexec to do the task under question.
> >
> > You seem not to understand what message passing
means and what it
> can
> > do, plus you haven't really explained what you
want to achieve,
> you've
> > only shown us the errors that you get. So please
try to write down a
> > description of what your goals are and maybe we
can find together a
> > solution. Also remember that this is the LAM/MPI
list, dedicated to
> > issues related to LAM/MPI and not to clustering in
general.
> >
> > --
> > Bogdan Costescu
> >
> > IWR, University of Heidelberg, INF 368, D-69120
Heidelberg, Germany
> > Phone: +49 6221 54 8869/8240, Fax: +49 6221 54
8868/8850
> > E-mail: bogdan.costescu iwr.uni-heidelberg.de
> > _______________________________________________
> > This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
>
> Climb to the top of the charts! Play the word scramble
challenge
> with star power. Play now!
> _______________________________________________
> This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
--
Jeff Squyres
Cisco Systems
_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
|
|
|
|