Hmm... checkpoints and restarts is good stuff in general but
they
require a lot of redesign and a significant amount of added
complexity.
The reason I raised the question in the first place was
because I was
toying with an idea of transitioning a distributed
application that
currently uses PVM to MPI. In PVM, handling peer process
termination as
well as control over the application topology in the VM is
fairly easy.
Since we are obsessive (for a good reason) about fault
tolerance, our
application (that is, it's "rank 0" master
process) knows when a "slave"
dies and resubmits the job to another one. Moreover, we can
also do cool
things like "blacklisting" some slaves on a
certain machine, when we are
confident the machine is not doing well, and get those
slaves restarted
after some period of time - all from within the master
process!
Unfortunately, I don't see how this level of service can be
achieved in
MPI (at least in a fairly standard-compliant implementation)
-
especially given your response. Which is somewhat sad, since
in a
distributed application (which is MPI's "raison
d'etre") there are
plenty of points of failure and many failures are not
critical enough to
justify the full application restart. It would be great to
see a fairly
simple API (no need for transaction/restarts/checkpoints)
achieving just
that - in my opinion, it would make MPI much more suitable
for
reasonably fault-tolerant applications (a requirement for
many large
systems, including the one, I'm dealing with.
Regards,
-- Sasha
Jeff Squyres wrote:
>
> LAM is -- at best -- only pseudo-able to handle the
death of an MPI
> process. Specifically, I wouldn't recommend trying to
write a fault
> tolerant MPI application using LAM/MPI that could
withstand the death
> of a process in MPI_COMM_WORLD.
>
> Keep in mind that MPI [quite intentionally] does not
specify what
> happens when a process dies, so it's totally up to the
implementation
> as to what to do. Most MPI's, LAM/MPI included, simply
kill the rest
> of the job. FT-MPI out of the University of Tennessee
allows you to
> do some interesting things, but you need to
specifically write code
> to their API, etc.
>
> Work is ongoing in Open MPI to be able to handle these
kinds of
> errors. The first step is adding checkpoint/restart
capabilities in
> Open MPI (the hardest part of which is all the
infrastructure needed
> to make that possible), and then we'll do more
interesting things
> after that (to include FT-MPI-like things).
>
_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/
|