[mpich-discuss] Question on fault tolerance

Rajeev Thakur thakur at mcs.anl.gov
Mon Mar 9 13:17:14 CDT 2009


Ben,
    Yes, that paper describes the scenarios well. However, the current
version of MPICH2 doesn't support that level of fault tolerance yet,
although we plan to do so in the near term.

Rajeev
 

> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov 
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of WONG 
> Benedict -PICKERING
> Sent: Monday, March 09, 2009 12:56 PM
> To: WONG Benedict -PICKERING; mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Question on fault tolerance
> 
> 
> To all,
> 
> I kind of answering my own question after I did a search in FT on MPI,
> and I found the following paper to describe the method we could use!
> 
> Thanks!
> 
> 
> Fault Tolerance in MPI Programs
> 
> By William Gropp & Ewing Lusk
> 
> 
> By Ben Wong
> Benedict.Wong at opg.com
> Currently at KG Kipling with no fix office space
> (416)231-4111(x5417 to x5419 and ask for Ben)
> 
> Permanent office at 
> 230 Westney Road South, 
> Annandale, L1S 7R3
> (905)428-4000x5458
> (Fax) (905)619-5453
> 
> 
> >  -----Original Message-----
> > From: 	WONG Benedict -PICKERING  
> > Sent:	Monday, March 09, 2009 11:10 AM
> > To:	'mpich-discuss at mcs.anl.gov'
> > Subject:	Question on fault tolerance
> > 
> > We are planning to use MPI (Namely MPICH2) to implement our program.
> > This will be used as production system with ~30 machines, 
> each with 2
> > Xeon L5430 2.66 GHz CPU and 24 GB of ram in each machine.
> > 
> > We are designing our program so that it contains clients instances,
> > manager instances and worker instances.  So this is a basic 
> client and
> > server type of structure that clients startup whenever is needed and
> > talks to the server (Manager instances), and the manager instance
> > assign jobs to worker instances.
> > 
> > We want to build fault tolerance into our program so we will have a
> > backup server (backup manager instance), and for worker instance, if
> > any of them goes down, the system will just working with degraded
> > performance....
> > 
> > I read the "Beowulf Cluster Computing with Linux" and "Using Mpi
> > Two-volume Set" so I understand most of the details in MPI.
> > 
> > However, here are the questions that I haven't find any answer yet:
> > 
> > After I setup a communicator with n processes, and if one (or more)
> > died, can the rest of the processes continue to communicate and
> > function?  Function such as MPI_Bcast, and MPI_Barrier, 
> will they halt
> > all processes?  How about the simple MPI_Send & MPI_Recv, could they
> > return some meaningful error  so that the rest of the processes and
> > continue?
> > 
> > 
> > Regards,
> > 
> > By Ben Wong
> > Benedict.Wong at opg.com
> > Currently at KG Kipling with no fix office space
> > (416)231-4111(x5417 to x5419 and ask for Ben)
> > 
> > Permanent office at 
> > 230 Westney Road South, 
> > Annandale, L1S 7R3
> > (905)428-4000x5458
> > (Fax) (905)619-5453
> > 
> > 
> -----------------------------------------
> THIS MESSAGE IS ONLY INTENDED FOR THE USE OF THE INTENDED
> RECIPIENT(S) AND MAY CONTAIN INFORMATION THAT IS PRIVILEGED,
> PROPRIETARY AND/OR CONFIDENTIAL. If you are not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying, conversion to hard copy or
> other use of this communication is strictly prohibited. If you are
> not the intended recipient and have received this message in error,
> please notify me by return e-mail and delete this message from your
> system. Ontario Power Generation Inc.
> 



More information about the mpich-discuss mailing list