List Info

Thread: LAM: a blcr problem, please help




LAM: a blcr problem, please help
country flaguser name
China
2007-03-13 09:19:21
hi everyone.
i am working with blcr/lammpi, but something goes wrong and i don't know what's the reason.
first i installed blcr and lammpi properly. hence i tested c/r on a mpi program running on a single node. everything went smoothly. the program ran, checkpoint commands were executed successfully, context files were generated, and restart process as well ran properly. later i tried the same experiment on a 2-node cluster, in which i got failed. i started the mpi program with command:
 
mpirun -np 2 -ssi rpi crtcp -ssi cr blcr C ./lamtest
 
while the program was running, i did checkpoints using ;command:
 
lamcheckpoint -ssi cr blcr -pid 10411 ;  (*10411 is the pid of mpirun.)
 
thus the command stopped there and never returned until ctrl-c.
i checked the working directory, i. e., my home directory, and no context file was found. however, some temporary files named as .context-xxxxx-xx.tmp presented. 
so someone please tell me what's the problem and i will be much appreciated.
thanks .


עŻ-3.5G20M
Re: LAM: a blcr problem, please help
country flaguser name
United States
2007-03-22 12:24:33
Interesting. What version of LAM/MPI and BLCR are you using?
Can you  
checkpoint/restart a non-MPI application on both of these
machines  
you are using individually?

If you can do all of that I'd be interested in seeing a
debugging  
backtrace (say from gdb) of mpirun, and the processes
launched. That  
should tell us where they got stuck or what they are waiting
on.

Cheers,
Josh

On Mar 13, 2007, at 10:19 AM, Fu HongYi wrote:

> hi everyone.
> i am working with blcr/lammpi, but something goes wrong
and i don't  
> know what's the reason.
> first i installed blcr and lammpi properly. hence i
tested c/r on a  
> mpi program running on a single node. everything went
smoothly. the  
> program ran, checkpoint commands were executed
successfully,  
> context files were generated, and restart process as
well ran  
> properly. later i tried the same experiment on a 2-node
cluster, in  
> which i got failed. i started the mpi program with
command:
>
> mpirun -np 2 -ssi rpi crtcp -ssi cr blcr C ./lamtest
>
> while the program was running, i did checkpoints using
command:
>
> lamcheckpoint -ssi cr blcr -pid 10411   (*10411 is the
pid of mpirun.)
>
> thus the command stopped there and never returned until
ctrl-c.
> i checked the working directory, i. e., my home
directory, and no  
> context file was found. however, some temporary files
named  
> as .context-xxxxx-xx.tmp presented.
> so someone please tell me what's the problem and i will
be much  
> appreciated.
> thanks .
>
> 抢注雅虎免费邮箱-3.5G容量,20M附件!
> _______________________________________________
> This list is archived at http://www.l
am-mpi.org/MailArchives/lam/

----
Josh Hursey
jjhurseyopen-mpi.org
http://www.open-mpi.org/


_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
[1-2]

about | contact  Other archives ( Real Estate discussion Medical topics )