Interesting. What version of LAM/MPI and BLCR are you using?
Can you
checkpoint/restart a non-MPI application on both of these
machines
you are using individually?
If you can do all of that I'd be interested in seeing a
debugging
backtrace (say from gdb) of mpirun, and the processes
launched. That
should tell us where they got stuck or what they are waiting
on.
Cheers,
Josh
On Mar 13, 2007, at 10:19 AM, Fu HongYi wrote:
> hi everyone.
> i am working with blcr/lammpi, but something goes wrong
and i don't
> know what's the reason.
> first i installed blcr and lammpi properly. hence i
tested c/r on a
> mpi program running on a single node. everything went
smoothly. the
> program ran, checkpoint commands were executed
successfully,
> context files were generated, and restart process as
well ran
> properly. later i tried the same experiment on a 2-node
cluster, in
> which i got failed. i started the mpi program with
command:
>
> mpirun -np 2 -ssi rpi crtcp -ssi cr blcr C ./lamtest
>
> while the program was running, i did checkpoints using
command:
>
> lamcheckpoint -ssi cr blcr -pid 10411 (*10411 is the
pid of mpirun.)
>
> thus the command stopped there and never returned until
ctrl-c.
> i checked the working directory, i. e., my home
directory, and no
> context file was found. however, some temporary files
named
> as .context-xxxxx-xx.tmp presented.
> so someone please tell me what's the problem and i will
be much
> appreciated.
> thanks .
>
> 抢注雅虎免费邮箱-3.5G容量,20M附件!
> _______________________________________________
> This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
----
Josh Hursey
jjhursey open-mpi.org
http://www.open-mpi.org/
_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/ |