[mpich-discuss] mpich-discuss Digest, Vol 6, Issue 12

Rajeev Thakur thakur at mcs.anl.gov
Tue Mar 10 22:09:48 CDT 2009


It supports errors_return, but not in the case of process failure. 

Rajeev 


> -----Original Message-----
> From: WONG Benedict -PICKERING [mailto:benedict.wong at opg.com] 
> Sent: Tuesday, March 10, 2009 6:24 PM
> To: mpich-discuss at mcs.anl.gov
> Cc: thakur at mcs.anl.gov; WONG Benedict -PICKERING
> Subject: RE: mpich-discuss Digest, Vol 6, Issue 12
> 
> Rajeev,
> 
> Thanks for your reply.  Are you saying that the current 
> version (1.0.8) of MPICH2 doesn't support MPI_ERRORS_RETURN?
> 
> 
> We are planning to 'recover' the process by checking the 
> error status on each MPI calls.  By 'recover', we are 
> planning to save the results from processes that are still 
> functioning, and then restart all (Or whatever is possible) 
> processes in that communicator.
> 
> Regards,
> 
> By Ben Wong
> Benedict.Wong at opg.com
> Currently at KG Kipling with no fix office space
> (416)231-4111(x5417 to x5419 and ask for Ben)
> 
> Permanent office at
> 230 Westney Road South,
> Annandale, L1S 7R3
> (905)428-4000x5458
> (Fax) (905)619-5453
> 
> 
> 
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of 
> mpich-discuss-request at mcs.anl.gov
> Sent: Tuesday, March 10, 2009 1:00 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: mpich-discuss Digest, Vol 6, Issue 12
> 
> 
> Send mpich-discuss mailing list submissions to
> 	mpich-discuss at mcs.anl.gov
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> or, via email, send a message with subject or body 'help' to
> 	mpich-discuss-request at mcs.anl.gov
> 
> You can reach the person managing the list at
> 	mpich-discuss-owner at mcs.anl.gov
> 
> When replying, please edit your Subject line so it is more 
> specific than
> "Re: Contents of mpich-discuss digest..."
> 
> 
> Today's Topics:
> 
>    1. Re:  Question on fault tolerance (WONG Benedict -PICKERING)
>    2. Re:  Question on fault tolerance (Rajeev Thakur)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Mon, 9 Mar 2009 13:55:43 -0400
> From: "WONG Benedict -PICKERING" <benedict.wong at opg.com>
> Subject: Re: [mpich-discuss] Question on fault tolerance
> To: "WONG Benedict -PICKERING" <benedict.wong at opg.com>,
> 	<mpich-discuss at mcs.anl.gov>
> Message-ID:
> 	
> <06F6D8CC436B0345904D188806D4BE03568233 at CATOU-OGMAPUW02.corp.opg.com>
> Content-Type: text/plain;	charset="us-ascii"
> 
> 
> To all,
> 
> I kind of answering my own question after I did a search in 
> FT on MPI, and I found the following paper to describe the 
> method we could use!
> 
> Thanks!
> 
> 
> Fault Tolerance in MPI Programs
> 
> By William Gropp & Ewing Lusk
> 
> 
> By Ben Wong
> Benedict.Wong at opg.com
> Currently at KG Kipling with no fix office space 
> (416)231-4111(x5417 to
> x5419 and ask for Ben)
> 
> Permanent office at
> 230 Westney Road South,
> Annandale, L1S 7R3
> (905)428-4000x5458
> (Fax) (905)619-5453
> 
> 
> >  -----Original Message-----
> > From: 	WONG Benedict -PICKERING  
> > Sent:	Monday, March 09, 2009 11:10 AM
> > To:	'mpich-discuss at mcs.anl.gov'
> > Subject:	Question on fault tolerance
> > 
> > We are planning to use MPI (Namely MPICH2) to implement our 
> program. 
> > This will be used as production system with ~30 machines, 
> each with 2 
> > Xeon L5430 2.66 GHz CPU and 24 GB of ram in each machine.
> > 
> > We are designing our program so that it contains clients instances, 
> > manager instances and worker instances.  So this is a basic 
> client and
> 
> > server type of structure that clients startup whenever is 
> needed and 
> > talks to the server (Manager instances), and the manager instance 
> > assign jobs to worker instances.
> > 
> > We want to build fault tolerance into our program so we will have a 
> > backup server (backup manager instance), and for worker 
> instance, if 
> > any of them goes down, the system will just working with degraded 
> > performance....
> > 
> > I read the "Beowulf Cluster Computing with Linux" and "Using Mpi 
> > Two-volume Set" so I understand most of the details in MPI.
> > 
> > However, here are the questions that I haven't find any answer yet:
> > 
> > After I setup a communicator with n processes, and if one (or more) 
> > died, can the rest of the processes continue to communicate and 
> > function?  Function such as MPI_Bcast, and MPI_Barrier, 
> will they halt
> 
> > all processes?  How about the simple MPI_Send & MPI_Recv, 
> could they 
> > return some meaningful error  so that the rest of the processes and 
> > continue?
> > 
> > 
> > Regards,
> > 
> > By Ben Wong
> > Benedict.Wong at opg.com
> > Currently at KG Kipling with no fix office space 
> (416)231-4111(x5417 
> > to x5419 and ask for Ben)
> > 
> > Permanent office at
> > 230 Westney Road South,
> > Annandale, L1S 7R3
> > (905)428-4000x5458
> > (Fax) (905)619-5453
> > 
> > 
> -----------------------------------------
> THIS MESSAGE IS ONLY INTENDED FOR THE USE OF THE INTENDED
> RECIPIENT(S) AND MAY CONTAIN INFORMATION THAT IS PRIVILEGED, 
> PROPRIETARY AND/OR CONFIDENTIAL. If you are not the intended 
> recipient, you are hereby notified that any review, 
> retransmission, dissemination, distribution, copying, 
> conversion to hard copy or other use of this communication is 
> strictly prohibited. If you are not the intended recipient 
> and have received this message in error, please notify me by 
> return e-mail and delete this message from your system. 
> Ontario Power Generation Inc.
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Mon, 9 Mar 2009 13:17:14 -0500
> From: "Rajeev Thakur" <thakur at mcs.anl.gov>
> Subject: Re: [mpich-discuss] Question on fault tolerance
> To: <mpich-discuss at mcs.anl.gov>
> Message-ID: <55E609AE364D44E49E845BF51DBB6C1D at mcs.anl.gov>
> Content-Type: text/plain;	charset="us-ascii"
> 
> Ben,
>     Yes, that paper describes the scenarios well. However, 
> the current version of MPICH2 doesn't support that level of 
> fault tolerance yet, although we plan to do so in the near term.
> 
> Rajeev
>  
> 
> > -----Original Message-----
> > From: mpich-discuss-bounces at mcs.anl.gov 
> > [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of 
> WONG Benedict 
> > -PICKERING
> > Sent: Monday, March 09, 2009 12:56 PM
> > To: WONG Benedict -PICKERING; mpich-discuss at mcs.anl.gov
> > Subject: Re: [mpich-discuss] Question on fault tolerance
> > 
> > 
> > To all,
> > 
> > I kind of answering my own question after I did a search in 
> FT on MPI,
> 
> > and I found the following paper to describe the method we could use!
> > 
> > Thanks!
> > 
> > 
> > Fault Tolerance in MPI Programs
> > 
> > By William Gropp & Ewing Lusk
> > 
> > 
> > By Ben Wong
> > Benedict.Wong at opg.com
> > Currently at KG Kipling with no fix office space 
> (416)231-4111(x5417 
> > to x5419 and ask for Ben)
> > 
> > Permanent office at
> > 230 Westney Road South,
> > Annandale, L1S 7R3
> > (905)428-4000x5458
> > (Fax) (905)619-5453
> > 
> > 
> > >  -----Original Message-----
> > > From: 	WONG Benedict -PICKERING  
> > > Sent:	Monday, March 09, 2009 11:10 AM
> > > To:	'mpich-discuss at mcs.anl.gov'
> > > Subject:	Question on fault tolerance
> > > 
> > > We are planning to use MPI (Namely MPICH2) to implement 
> our program.
> 
> > > This will be used as production system with ~30 machines,
> > each with 2
> > > Xeon L5430 2.66 GHz CPU and 24 GB of ram in each machine.
> > > 
> > > We are designing our program so that it contains clients 
> instances, 
> > > manager instances and worker instances.  So this is a basic
> > client and
> > > server type of structure that clients startup whenever is 
> needed and
> 
> > > talks to the server (Manager instances), and the manager instance 
> > > assign jobs to worker instances.
> > > 
> > > We want to build fault tolerance into our program so we 
> will have a 
> > > backup server (backup manager instance), and for worker 
> instance, if
> 
> > > any of them goes down, the system will just working with degraded 
> > > performance....
> > > 
> > > I read the "Beowulf Cluster Computing with Linux" and "Using Mpi 
> > > Two-volume Set" so I understand most of the details in MPI.
> > > 
> > > However, here are the questions that I haven't find any 
> answer yet:
> > > 
> > > After I setup a communicator with n processes, and if one 
> (or more) 
> > > died, can the rest of the processes continue to communicate and 
> > > function?  Function such as MPI_Bcast, and MPI_Barrier,
> > will they halt
> > > all processes?  How about the simple MPI_Send & MPI_Recv, 
> could they
> 
> > > return some meaningful error  so that the rest of the 
> processes and 
> > > continue?
> > > 
> > > 
> > > Regards,
> > > 
> > > By Ben Wong
> > > Benedict.Wong at opg.com
> > > Currently at KG Kipling with no fix office space 
> (416)231-4111(x5417
> 
> > > to x5419 and ask for Ben)
> > > 
> > > Permanent office at
> > > 230 Westney Road South,
> > > Annandale, L1S 7R3
> > > (905)428-4000x5458
> > > (Fax) (905)619-5453
> > > 
> > > 
> > -----------------------------------------
> > THIS MESSAGE IS ONLY INTENDED FOR THE USE OF THE INTENDED
> > RECIPIENT(S) AND MAY CONTAIN INFORMATION THAT IS PRIVILEGED, 
> > PROPRIETARY AND/OR CONFIDENTIAL. If you are not the intended 
> > recipient, you are hereby notified that any review, retransmission, 
> > dissemination, distribution, copying, conversion to hard 
> copy or other
> 
> > use of this communication is strictly prohibited. If you 
> are not the 
> > intended recipient and have received this message in error, please 
> > notify me by return e-mail and delete this message from your system.
> > Ontario Power Generation Inc.
> > 
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> 
> End of mpich-discuss Digest, Vol 6, Issue 12
> ********************************************
> 
> 



More information about the mpich-discuss mailing list