[mpich-discuss] Question on fault tolerance

WONG Benedict -PICKERING benedict.wong at opg.com
Mon Mar 9 12:55:43 CDT 2009


To all,

I kind of answering my own question after I did a search in FT on MPI,
and I found the following paper to describe the method we could use!

Thanks!


Fault Tolerance in MPI Programs

By William Gropp & Ewing Lusk


By Ben Wong
Benedict.Wong at opg.com
Currently at KG Kipling with no fix office space
(416)231-4111(x5417 to x5419 and ask for Ben)

Permanent office at 
230 Westney Road South, 
Annandale, L1S 7R3
(905)428-4000x5458
(Fax) (905)619-5453


>  -----Original Message-----
> From: 	WONG Benedict -PICKERING  
> Sent:	Monday, March 09, 2009 11:10 AM
> To:	'mpich-discuss at mcs.anl.gov'
> Subject:	Question on fault tolerance
> 
> We are planning to use MPI (Namely MPICH2) to implement our program.
> This will be used as production system with ~30 machines, each with 2
> Xeon L5430 2.66 GHz CPU and 24 GB of ram in each machine.
> 
> We are designing our program so that it contains clients instances,
> manager instances and worker instances.  So this is a basic client and
> server type of structure that clients startup whenever is needed and
> talks to the server (Manager instances), and the manager instance
> assign jobs to worker instances.
> 
> We want to build fault tolerance into our program so we will have a
> backup server (backup manager instance), and for worker instance, if
> any of them goes down, the system will just working with degraded
> performance....
> 
> I read the "Beowulf Cluster Computing with Linux" and "Using Mpi
> Two-volume Set" so I understand most of the details in MPI.
> 
> However, here are the questions that I haven't find any answer yet:
> 
> After I setup a communicator with n processes, and if one (or more)
> died, can the rest of the processes continue to communicate and
> function?  Function such as MPI_Bcast, and MPI_Barrier, will they halt
> all processes?  How about the simple MPI_Send & MPI_Recv, could they
> return some meaningful error  so that the rest of the processes and
> continue?
> 
> 
> Regards,
> 
> By Ben Wong
> Benedict.Wong at opg.com
> Currently at KG Kipling with no fix office space
> (416)231-4111(x5417 to x5419 and ask for Ben)
> 
> Permanent office at 
> 230 Westney Road South, 
> Annandale, L1S 7R3
> (905)428-4000x5458
> (Fax) (905)619-5453
> 
> 
-----------------------------------------
THIS MESSAGE IS ONLY INTENDED FOR THE USE OF THE INTENDED
RECIPIENT(S) AND MAY CONTAIN INFORMATION THAT IS PRIVILEGED,
PROPRIETARY AND/OR CONFIDENTIAL. If you are not the intended
recipient, you are hereby notified that any review, retransmission,
dissemination, distribution, copying, conversion to hard copy or
other use of this communication is strictly prohibited. If you are
not the intended recipient and have received this message in error,
please notify me by return e-mail and delete this message from your
system. Ontario Power Generation Inc.


More information about the mpich-discuss mailing list