List Info

Thread: LAM: Virtual Machine sub-node faillure tolerant




LAM: Virtual Machine sub-node faillure tolerant
country flaguser name
France
2007-05-15 09:23:27
Hi,

I have developed an application with the Lam MPI version.
I have just a little problem.

My application is very long to converge and use many
computer.
I use the "student" computer and this kind of
computer are not
"very" safe ...

I have seen that there is a version of LAM mpi who is
failure tolerant 
but i don't need that (too complex for my little
application).  I want 
just to find to know if it's be a version of LAM that don't
stop to run 
if one sub-node (other as the "0" node) on the
grid reboot or fail. Or 
better as an another distribution, a module or an start
option.

For the moment, i use the stable version who are present in
the debian 
linux distribution (lam4). For the moment, because i have
add a system 
to detect the node failure, i can proceed my work if a
machine crash but 
not if a machine reboot (in the first case, the virtual
machine don't 
known the node failure and than, proceed his work , and in
the second, 
the node warn the global virtual machine,who crash ....  (i
think) )

Thanks for your help

Sincerely,
Marc

_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/

Re: LAM: Virtual Machine sub-node faillure tolerant
country flaguser name
United States
2007-05-26 18:08:39
On May 15, 2007, at 8:23 AM, Sauget Marc wrote:

> I have developed an application with the Lam MPI
version.
> I have just a little problem.
>
> My application is very long to converge and use many
computer.
> I use the "student" computer and this kind of
computer are not
> "very" safe ...
>
> I have seen that there is a version of LAM mpi who is
failure tolerant
> but i don't need that (too complex for my little
application).  I want
> just to find to know if it's be a version of LAM that
don't stop to  
> run
> if one sub-node (other as the "0" node) on
the grid reboot or fail. Or
> better as an another distribution, a module or an start
option.
>
> For the moment, i use the stable version who are
present in the debian
> linux distribution (lam4). For the moment, because i
have add a system
> to detect the node failure, i can proceed my work if a
machine  
> crash but
> not if a machine reboot (in the first case, the virtual
machine don't
> known the node failure and than, proceed his work , and
in the second,
> the node warn the global virtual machine,who crash ....
 (i think) )

There is rudimentary support for what you are trying to do
in LAM/ 
MPI, but it is not well tested and definitely not supported.
 If you  
run lamboot with the -x option, it will enable "fault
tolerance" in  
the LAM universe.  The lam daemons will detect a node
failure and  
fail all communication pending to that node.

LAM's fault tolerance is really only useful for manager
worker codes  
where the worker is launched with MPI_COMM_SPAWN.  Have a
look at  
examples/fault/README in any recent LAM tarball for more  
information.  If you need more fault tolerance than this
provides,  
you might want to look at FT-MPI from the University of
Tennessee,  
Knoxville.

Hope this helps,

Brian

-- 
   Brian Barrett
   LAM/MPI Developer
   Make today a LAM/MPI day!


_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/

Re: LAM: Virtual Machine sub-node faillure tolerant
country flaguser name
France
2007-06-05 08:08:30
Brian Barrett wrote:
> On May 15, 2007, at 8:23 AM, Sauget Marc wrote:
> 
>> I have developed an application with the Lam MPI
version.
>> I have just a little problem.
>>
>> My application is very long to converge and use
many computer.
>> I use the "student" computer and this
kind of computer are not
>> "very" safe ...
>>
>> I have seen that there is a version of LAM mpi who
is failure tolerant
>> but i don't need that (too complex for my little
application).  I want
>> just to find to know if it's be a version of LAM
that don't stop to  
>> run
>> if one sub-node (other as the "0" node)
on the grid reboot or fail. Or
>> better as an another distribution, a module or an
start option.
>>
>> For the moment, i use the stable version who are
present in the debian
>> linux distribution (lam4). For the moment, because
i have add a system
>> to detect the node failure, i can proceed my work
if a machine  
>> crash but
>> not if a machine reboot (in the first case, the
virtual machine don't
>> known the node failure and than, proceed his work ,
and in the second,
>> the node warn the global virtual machine,who crash
....  (i think) )
> 
> There is rudimentary support for what you are trying to
do in LAM/ 
> MPI, but it is not well tested and definitely not
supported.  If you  
> run lamboot with the -x option, it will enable
"fault tolerance" in  
> the LAM universe.  The lam daemons will detect a node
failure and  
> fail all communication pending to that node.
> 
> LAM's fault tolerance is really only useful for manager
worker codes  
> where the worker is launched with MPI_COMM_SPAWN.  Have
a look at  
> examples/fault/README in any recent LAM tarball for
more  
> information.  If you need more fault tolerance than
this provides,  
> you might want to look at FT-MPI from the University of
Tennessee,  
> Knoxville.
> 
> Hope this helps,
> 
> Brian
> 

Sorry for the dealys and thanks for the answer.

I have founded this help previously with the read of
"man page" and
I have used this exemple 
Disgrace for me, for this question 

Thanks

++ Marc

_______________________________________________
This list is archived at http://www.l
am-mpi.org/MailArchives/lam/

[1-3]

about | contact  Other archives ( Real Estate discussion Medical topics )