List Info

Thread: LAM: Error message from LAM-7.1.2




LAM: Error message from LAM-7.1.2
user name
2006-10-11 15:40:22
Jeff Squyres wrote:
>
> These error messages mean that processes 2-7 tried to
do a receive from
> someone who they later found out were dead, so they
aborted.
>
What would be a "standard" (that is, a portable)
way for one of the peer
processes to get notified about such a death? For example,
if one of
processes dies, I'd like the process of rank 0 to know it in
order to
change the strategy.

Cheers,
-- Sasha

_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
LAM: Error message from LAM-7.1.2
user name
2006-10-13 20:41:33
LAM is -- at best -- only pseudo-able to handle the death of
an MPI  
process.  Specifically, I wouldn't recommend trying to write
a fault  
tolerant MPI application using LAM/MPI that could withstand
the death  
of a process in MPI_COMM_WORLD.

Keep in mind that MPI [quite intentionally] does not specify
what  
happens when a process dies, so it's totally up to the
implementation  
as to what to do.  Most MPI's, LAM/MPI included, simply kill
the rest  
of the job.  FT-MPI out of the University of Tennessee
allows you to  
do some interesting things, but you need to specifically
write code  
to their API, etc.

Work is ongoing in Open MPI to be able to handle these kinds
of  
errors.  The first step is adding checkpoint/restart
capabilities in  
Open MPI (the hardest part of which is all the
infrastructure needed  
to make that possible), and then we'll do more interesting
things  
after that (to include FT-MPI-like things).


On Oct 11, 2006, at 8:40 AM, Alexander L. Belikoff wrote:

> Jeff Squyres wrote:
>>
>> These error messages mean that processes 2-7 tried
to do a receive  
>> from
>> someone who they later found out were dead, so they
aborted.
>>
> What would be a "standard" (that is, a
portable) way for one of the  
> peer
> processes to get notified about such a death? For
example, if one of
> processes dies, I'd like the process of rank 0 to know
it in order to
> change the strategy.
>
> Cheers,
> -- Sasha
>
> _______________________________________________
> This list is archived at http://www.l
am-mpi.org/MailArchives/lam/


-- 
Jeff Squyres
Server Virtualization Business Unit
Cisco Systems

_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
LAM: Error message from LAM-7.1.2
user name
2006-10-17 17:36:13
Hmm... checkpoints and restarts is good stuff in general but
they
require a lot of redesign and a significant amount of added
complexity.

The reason I raised the question in the first place was
because I was
toying with an idea of transitioning a distributed
application that
currently uses PVM to MPI. In PVM, handling peer process
termination as
well as control over the application topology in the VM is
fairly easy.
Since we are obsessive (for a good reason) about fault
tolerance, our
application (that is, it's "rank 0" master
process) knows when a "slave"
dies and resubmits the job to another one. Moreover, we can
also do cool
things like "blacklisting" some slaves on a
certain machine, when we are
confident the machine is not doing well, and get those
slaves restarted
after some period of time - all from within the master
process!

Unfortunately, I don't see how this level of service can be
achieved in
MPI (at least in a fairly standard-compliant implementation)
-
especially given your response. Which is somewhat sad, since
in a
distributed application (which is MPI's "raison
d'etre") there are
plenty of points of failure and many failures are not
critical enough to
justify the full application restart. It would be great to
see a fairly
simple API (no need for transaction/restarts/checkpoints)
achieving just
that - in my opinion, it would make MPI much more suitable
for
reasonably fault-tolerant applications (a requirement for
many large
systems, including the one, I'm dealing with.

Regards,
-- Sasha

Jeff Squyres wrote:
>
> LAM is -- at best -- only pseudo-able to handle the
death of an MPI 
> process.  Specifically, I wouldn't recommend trying to
write a fault 
> tolerant MPI application using LAM/MPI that could
withstand the death 
> of a process in MPI_COMM_WORLD.
>
> Keep in mind that MPI [quite intentionally] does not
specify what 
> happens when a process dies, so it's totally up to the
implementation 
> as to what to do.  Most MPI's, LAM/MPI included, simply
kill the rest 
> of the job.  FT-MPI out of the University of Tennessee
allows you to 
> do some interesting things, but you need to
specifically write code 
> to their API, etc.
>
> Work is ongoing in Open MPI to be able to handle these
kinds of 
> errors.  The first step is adding checkpoint/restart
capabilities in 
> Open MPI (the hardest part of which is all the
infrastructure needed 
> to make that possible), and then we'll do more
interesting things 
> after that (to include FT-MPI-like things).
>

_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )