I now have a build of LAM/MPI 7.1.3 w/ BLCR on a old RedHat
9 box.
I am unable to reproduce any problems w/ BLCR 0.5.0 or
0.5.1.
However, I can warn LAM/MPI users away from 0.5.2. It
appears that
while I fixed one mmap() bug, I simultaneously created a new
one. The
symptom is the "Bad address" (an indication of
errno=EFAULT), show below:
$ lamrestart -ssi cr blcr -ssi cr_blcr_context_file
context.mpirun.24543
Restart failed: Bad address
------------------------------------------------------------
-----------------
It seems that [at least] one of the processes that was
started with
mpirun did not invoke MPI_INIT before quitting (it is
possible that
more than one process did not invoke MPI_INIT -- mpirun was
only
notified of the first one, which was on node n0).
mpirun can *only* be used with MPI programs (i.e., programs
that
invoke MPI_INIT and MPI_FINALIZE). You can use the
"lamexec" program
to run non-MPI programs over the lambooted nodes.
------------------------------------------------------------
-----------------
Restart failed: Bad address
I've already identified the source of this new problem, and
it is fixed
in CVS. The 0.5.3 release of BLCR (whenever that comes out)
will
include that fix.
In the meantime, I recommend LAM/MPI users stick to 0.5.0 or
0.5.1.
-Paul
Josh Hursey wrote:
> Unfortunately I cannot reproduce this. I am using the
latest build of
> LAM/MPI (7.1.3) and BLCR (0.5.0), and all seems well.
:(
>
> Can you upgrade to the latest LAM/MPI doing a fresh
build? This will
> help reduce the number of variables that could be the
issue.
>
> The failure of the SSI types upon checkpoint worries me
a bit, since
> I have never seen that error thrown before. It makes me
think that
> some memory is getting corrupted across the
checkpoint.
> Can you try checkpointing/restarting a simple MPI
program to see if
> it has the same problem? Something like a hello world
program in a
> wait loop. This will give you enough time to checkpoint
the process,
> terminate it, and restart it.
>
> -- Josh
>
> On Mar 27, 2007, at 12:39 PM, Paul H. Hargrove wrote:
>
>> Yuan,
>>
>> I've certainly not seen anything like that before.
The fact that the
>> error message changed after adding "-ssi rpi
crtcp" suggests to me
>> that
>> Josh was on the right track. However, the new
failure mode looks even
>> more ominous.
>>
>> My best guess would be that something changed in
either BLCR or FC6
>> that
>> has broken the assumptions being made by the crtcp
rpi module in
>> LAM/MPI. I don't currently have a system on which
to test LAM/MPI
>> +BLCR,
>> so I can't verify this.
>>
>> Depending on what has broken, the fix might belong
in either LAM/
>> MPI or
>> BLCR. I am afraid I probably won't have any chance
to look at this in
>> detail for a couple weeks at least.
>>
>> Not sure about the 2 mpirun instances, but would
guess that one of
>> them
>> might be internal to lamcheckpoint operation.
Passing an option
>> such as
>> "-f" or "-l" to ps would give
the parent id (PPID) and make it clear
>> who/what started the 2nd mpirun. As for the the 3
cr_checkpoint
>> instances, they correspond to the 3 context files
you would eventually
>> get: one for the mpirun and one for each of the two
"rotating"
>> processes.
>>
>> -Paul
>>
>> Yuan Wan wrote:
>>> On Mon, 26 Mar 2007, Paul H. Hargrove wrote:
>>>
>>> Hi Paul,
>>>
>>> Thanks for your reply.
>>>
>>> I have tried to explicitly use
"crtcp" module, but it caused a
>>> failure on checkpoint:
>>>
>>> $ mpirun -np 2 -ssi cr blcr -ssi rpi crtcp
./rotating
>>> $ lamcheckpoint -ssi cr blcr -pid 17256
>>>
>>>
------------------------------------------------------------
---------
>>> --------
>>>
>>> Encountered a failure in the SSI types while
continuing from
>>> checkpoint. Aborting in despair :-(
>>>
------------------------------------------------------------
---------
>>> --------
>>>
>>> And The code never exit after it getting the
end.
>>> I check the 'ps' list and found there are two
'mpirun' and
>>> three'checkpoint'processes running:
>>> ---------------------------------------
>>> 17255 ? 00:00:00 lamd
>>> 17256 pts/2 00:00:00 mpirun
>>> 17257 ? 00:00:15 rotating
>>> 17258 ? 00:00:15 rotating
>>> 17263 pts/3 00:00:00 lamcheckpoint
>>> 17264 pts/3 00:00:00 cr_checkpoint
>>> 17265 pts/2 00:00:00 mpirun
>>> 17266 ? 00:00:00 cr_checkpoint
>>> 17267 ? 00:00:00 cr_checkpoint
>>> ---------------------------------------
>>>
>>> --Yuan
>>>
>>>
>>>
>>>> Yuan,
>>>>
>>>> I've not encountered this problem before.
It looks as if
>>>> something is
>>>> triggering a LAM-internal error message.
It is possible that
>>>> this is
>>>> a result of a BLCR problem, or it could be
a LAM/MPI problem. If
>>>> the
>>>> problem *is* in BLCR, then there is not
enough information here
>>>> to try
>>>> to find it.
>>>> I see that you have also asked on the
LAM/MPI mailing list, and that
>>>> Josh Hursey made a suggestion there. I am
monitoring that thread
>>>> and
>>>> will make any BLCR-specific comments if I
can. However, at this
>>>> point
>>>> I don't have any ideas beyond Josh's
suggestion to explicitly set
>>>> the
>>>> rpi module to crtcp.
>>>>
>>>> -Paul
>>>>
>>>> Yuan Wan wrote:
>>>>> Hi all,
>>>>>
>>>>> I got some problem when checkpointing
lam/mpi code using blcr.
>>>>>
>>>>> My platform is a 2-cpu machine running
Fedora Core 6 (kernel
>>>>> 2.6.19)
>>>>> I have built blcr-0.5.0 and it works
well with serial codes.
>>>>>
>>>>> I built LAM/MPI 7.1.2
>>>>>
---------------------------------------------
>>>>> $ ./configure --prefix=/home/pst/lam
>>>>> --with-rsh="ssh
-x"
>>>>>
--with-cr-blcr=/home/pst/blcr $ make
>>>>> $ make install
>>>>>
---------------------------------------------
>>>>>
>>>>> The laminfo output is
>>>>>
-----------------------------------------------------
>>>>> LAM/MPI: 7.1.2
>>>>> Prefix: /home/pst/lam
>>>>> Architecture:
i686-pc-linux-gnu
>>>>> Configured by: pst
>>>>> Configured on: Sat Mar 24
00:40:42 GMT 2007
>>>>> Configure host: master00
>>>>> Memory manager: ptmalloc2
>>>>> C bindings: yes
>>>>> C++ bindings: yes
>>>>> Fortran bindings: yes
>>>>> C compiler: gcc
>>>>> C++ compiler: g++
>>>>> Fortran compiler: g77
>>>>> Fortran symbols:
double_underscore
>>>>> C profiling: yes
>>>>> C++ profiling: yes
>>>>> Fortran profiling: yes
>>>>> C++ exceptions: no
>>>>> Thread support: yes
>>>>> ROMIO support: yes
>>>>> IMPI support: no
>>>>> Debug support: no
>>>>> Purify clean: no
>>>>> SSI boot: globus (API v1.1,
Module v0.6)
>>>>> SSI boot: rsh (API v1.1,
Module v1.1)
>>>>> SSI boot: slurm (API v1.1,
Module v1.0)
>>>>> SSI coll: lam_basic (API
v1.1, Module v7.1)
>>>>> SSI coll: shmem (API v1.1,
Module v1.0)
>>>>> SSI coll: smp (API v1.1,
Module v1.2)
>>>>> SSI rpi: crtcp (API v1.1,
Module v1.1)
>>>>> SSI rpi: lamd (API v1.0,
Module v7.1)
>>>>> SSI rpi: sysv (API v1.0,
Module v7.1)
>>>>> SSI rpi: tcp (API v1.0,
Module v7.1)
>>>>> SSI rpi: usysv (API v1.0,
Module v7.1)
>>>>> SSI cr: blcr (API v1.0,
Module v1.1)
>>>>> SSI cr: self (API v1.0,
Module v1.0)
>>>>>
--------------------------------------------------------
>>>>>
>>>>>
>>>>> My parallel code works well with lam
without any checkpoint
>>>>> $ mpirun -np 2 ./job
>>>>>
>>>>> Then I run my parallel job in
checkpointable way
>>>>> $ mpirun -np 2 -ssi cr blcr ./rotating
>>>>>
>>>>> And checkpoint this job in another
window
>>>>> $ lamcheckpoint -ssi cr blcr -pid
11928
>>>>>
>>>>> This operation produces a context file
for mpirun
>>>>>
>>>>> "context.mpirun.11928"
>>>>>
>>>>> plus two context files for the job
>>>>>
>>>>> "context.11928-n0-11929"
>>>>> "context.11928-n0-11930"
>>>>>
>>>>> Seems so far so good
>>>>>
-------------------------------------------------------
>>>>>
>>>>> However, when I restart the job with
the context file:
>>>>> $ lamrestart -ssi cr blcr -ssi
cr_blcr_context_file
>>>>> ~/context.mpirun.11928
>>>>>
>>>>> I got the following error:
>>>>>
>>>>> Results CORRECT on rank 0 ["This
line is the output in code"]
>>>>>
>>>>> MPI_Finalize: internal MPI error:
Invalid argument (rank 137389200,
>>>>> MPI_COMM_WORLD)
>>>>> Rank (0, MPI_COMM_WORLD): Call stack
within LAM:
>>>>> Rank (0, MPI_COMM_WORLD): -
MPI_Finalize()
>>>>> Rank (0, MPI_COMM_WORLD): - main()
>>>>>
>>>>>
------------------------------------------------------------
-------
>>>>> ----------
>>>>> It seems that [at least] one of the
processes that was started with
>>>>> mpirun did not invoke MPI_INIT before
quitting (it is possible that
>>>>> more than one process did not invoke
MPI_INIT -- mpirun was only
>>>>> notified of the first one, which was on
node n0).
>>>>>
>>>>> mpirun can *only* be used with MPI
programs (i.e., programs that
>>>>> invoke MPI_INIT and MPI_FINALIZE). You
can use the "lamexec"
>>>>> program
>>>>> to run non-MPI programs over the
lambooted nodes.
>>>>>
>>>>>
------------------------------------------------------------
-------
>>>>> ----------
>>>>>
>>>>> Anyone met this problem before and know
how to solve it?
>>>>>
>>>>> Many Thanks
>>>>>
>>>>>
>>>>> --Yuan
>>>>>
>>>>>
>>>>> Yuan Wan
>>>>
>>>>
>>
>> --
>> Paul H. Hargrove
PHHargrove lbl.gov
>> Future Technologies Group
>> HPC Research Department Tel:
+1-510-495-2352
>> Lawrence Berkeley National Laboratory Fax:
+1-510-486-6900
>> _______________________________________________
>> This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
>
> ----
> Josh Hursey
> jjhursey open-mpi.org
> http://www.open-mpi.org/
>
> _______________________________________________
> This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
--
Paul H. Hargrove PHHargrove lbl.gov
Future Technologies Group
HPC Research Department Tel:
+1-510-495-2352
Lawrence Berkeley National Laboratory Fax:
+1-510-486-6900
_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
|