List Info

Thread: Re: LAM: MPI_BARRIER problem




Re: LAM: MPI_BARRIER problem
country flaguser name
Australia
2008-05-07 19:36:39
> > Hi, I'm studying about Parallel processing right
> now. And I'm kinda  
> > new to this world.
> > When I test my cluster implementation using
inverse
> matrix, 1 mill  
> > times 1 mill, the MPI_Barrier always error. Is
there a
> way to remove  
> > this error? I can assure you that the source code
> I've been using  
> > dont have any error, since I used it to test the
same
> inverse matrix  
> > but only until about 6000 times 6000.
> 
> It's kind of hard to help when you don't include
> the error message  
> that MPI_BARRIER caused.  That being said, generally
errors
> in barrier  
> are caused by a previous problem, such as memory
corruption
> from  
> overwriting an array.  You might want to use a memory
> debugger such as  
> valgrind to make sure you don't have any issues in
your
> code.  Just  
> because something works at one matrix size does not
mean
> that its  
> correct -- we've seen many times where one matrix size
> works and  
> another doesn't, simply because of what was placed
> directly after the  
> array, depending on the whims of the compiler /
allocator.
> 
> Brian
> 
> -- 
>    Brian Barrett
>    LAM/MPI Developer
>    Make today a LAM/MPI day!

I have attached the code that I'm using. It would be a great
help for me if you guys could help me in this problel.
My error code is as follows:
 MPI_Recv: process in local group is dead (rank 2,
MPI_COMM_WORLD)  
Rank (2, MPI_COMM_WORLD): Call stack within LAM:
 MPI_Recv: process in local group is dead (rank 1,
MPI_COMM_WORLD)
 Rank (1, MPI_COMM_WORLD): Call stack within LAM:
 Rank (2, MPI_COMM_WORLD):  - MPI_Recv()
 Rank (2, MPI_COMM_WORLD):  - MPI_Barrier()
 Rank (2, MPI_COMM_WORLD):  - main()
 Rank (1, MPI_COMM_WORLD):  - MPI_Recv()
 Rank (1, MPI_COMM_WORLD):  - MPI_Barrier()
 Rank (1, MPI_COMM_WORLD):  - main()
 -----------------------------------------------------------
---------------
 One of the processes started by mpirun has exited with a
nonzero exit
 code.  This typically indicates that the process finished
in error.
 If your process did not finish in error, be sure to include
a "return
 0" or "exit(0)" in your C code before
exiting the application.

Thank you,
Sincerly
Richard


     
____________________________________________________________
________________________
Be a better friend, newshound, and 
know-it-all with Yahoo! Mobile.  Try it now.  http://mobile.yahoo.com/;_ylt=Ahu06i62sR8HDtDypao8Wcj9
tAcJ
_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
  
Re: LAM: MPI_BARRIER problem
user name
2008-05-07 19:42:04
On May 7, 2008, at 8:36 PM, richard pan wrote:

> MPI_Recv: process in local group is dead (rank 2,
MPI_COMM_WORLD)
> Rank (2, MPI_COMM_WORLD): Call stack within LAM:
> MPI_Recv: process in local group is dead (rank 1,
MPI_COMM_WORLD)
>
------------------------------------------------------------
--------------
> One of the processes started by mpirun has exited with
a nonzero exit
> code.  This typically indicates that the process
finished in error.
> If your process did not finish in error, be sure to
include a "return
> 0" or "exit(0)" in your C code before
exiting the application.


I believe this error message says it all -- one of your
processes has  
died.

Specifically: MPI_BARRIER isn't what caused your app to die;
 
MPI_BARRIER is the function that noticed that the other
process was  
dead, reported the problem, and then aborted all remaining
MPI  
processes.

Your program is a bit too long for me to debug; Brian's
advice of  
running through debuggers is probably your best bet.  Also
check for  
corefiles that may indicate where your program died.

-- 
Jeff Squyres
Cisco Systems

_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/

[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )