[mpich-discuss] Question on fault tolerance

Rajeev Thakur thakur at mcs.anl.gov
Wed Mar 11 18:01:28 CDT 2009


> If we setup the communicator with MPI_ERRORS_RETURN and one of the
> machines running MPI node(s) die, would the rest of the nodes in the
> communicator receive error in MPICH2?  

No, right now, the whole job will abort. As I mentioned earlier, we plan to
change that over the course of the year.
 
Rajeev

> -----Original Message-----
> From: WONG Benedict -PICKERING [mailto:benedict.wong at opg.com] 
> Sent: Wednesday, March 11, 2009 5:40 PM
> To: mpich-discuss at mcs.anl.gov; thakur at mcs.anl.gov
> Cc: WONG Benedict -PICKERING
> Subject: Re: [mpich-discuss] Question on fault tolerance
> 
> We are planning to create a 30 machines cluster, and we are worrying
> about hardware failure which force the machine to be shutdown without
> proper shutdown sequences.  Or even network connection problem which
> disconnect that machine from the cluster....
> 
> We are trying to solve is an embarrassingly parallel problem 
> so even if
> a node (i.e. Worker node) disappear, we could still use the 
> results from
> the rest of the nodes.  And we could recover the problem by 
> reassigning
> the job from the disappeared node to others.
> 
> If we setup the communicator with MPI_ERRORS_RETURN and one of the
> machines running MPI node(s) die, would the rest of the nodes in the
> communicator receive error in MPICH2?  And we could 'handle' the error
> after that? (i.e. recreate a new communicator, restart all nodes with
> new machine lists and so on..)
> 
> Regards,
> 
> By Ben Wong
> Benedict.Wong at opg.com
> Currently at KG Kipling with no fix office space 
> (416)231-4111(x5417 to
> x5419 and ask for Ben)
> 
> Permanent office at 
> 230 Westney Road South, 
> Annandale, L1S 7R3
> (905)428-4000x5458
> (Fax) (905)619-5453
> 
> 
> 
> -----Original Message-----
> ...
> ...
> ------------------------------
> Message: 7
> Date: Tue, 10 Mar 2009 22:09:48 -0500
> From: "Rajeev Thakur" <thakur at mcs.anl.gov>
> Subject: Re: [mpich-discuss] mpich-discuss Digest, Vol 6, Issue 12
> To: "'WONG Benedict -PICKERING'" <benedict.wong at opg.com>,
> 	<mpich-discuss at mcs.anl.gov>
> Message-ID: <B396A44D767E4734BC956E2B9DD66832 at thakurlaptop>
> Content-Type: text/plain;	charset="US-ASCII"
> 
> It supports errors_return, but not in the case of process failure. 
> 
> Rajeev 
> 
> 
> > -----Original Message-----
> > From: WONG Benedict -PICKERING [mailto:benedict.wong at opg.com]
> > Sent: Tuesday, March 10, 2009 6:24 PM
> > To: mpich-discuss at mcs.anl.gov
> > Cc: thakur at mcs.anl.gov; WONG Benedict -PICKERING
> > Subject: RE: mpich-discuss Digest, Vol 6, Issue 12
> > 
> > Rajeev,
> > 
> > Thanks for your reply.  Are you saying that the current
> > version (1.0.8) of MPICH2 doesn't support MPI_ERRORS_RETURN?
> > 
> > 
> > We are planning to 'recover' the process by checking the
> > error status on each MPI calls.  By 'recover', we are 
> > planning to save the results from processes that are still 
> > functioning, and then restart all (Or whatever is possible) 
> > processes in that communicator.
> > 
> > Regards,
> > 
> > By Ben Wong
> > Benedict.Wong at opg.com
> > Currently at KG Kipling with no fix office space 
> (416)231-4111(x5417 
> > to x5419 and ask for Ben)
> > 
> > Permanent office at
> > 230 Westney Road South,
> > Annandale, L1S 7R3
> > (905)428-4000x5458
> > (Fax) (905)619-5453
> > 
> > 
> > 
> > -----Original Message-----
> > From: mpich-discuss-bounces at mcs.anl.gov 
> > [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of 
> > mpich-discuss-request at mcs.anl.gov
> > Sent: Tuesday, March 10, 2009 1:00 PM
> > To: mpich-discuss at mcs.anl.gov
> > Subject: mpich-discuss Digest, Vol 6, Issue 12
> > 
> > 
> > Send mpich-discuss mailing list submissions to
> > 	mpich-discuss at mcs.anl.gov
> > 
> > To subscribe or unsubscribe via the World Wide Web, visit
> > 	https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > or, via email, send a message with subject or body 'help' to
> > 	mpich-discuss-request at mcs.anl.gov
> > 
> > You can reach the person managing the list at
> > 	mpich-discuss-owner at mcs.anl.gov
> > 
> > When replying, please edit your Subject line so it is more
> > specific than
> > "Re: Contents of mpich-discuss digest..."
> > 
> > 
> > Today's Topics:
> > 
> >    1. Re:  Question on fault tolerance (WONG Benedict -PICKERING)
> >    2. Re:  Question on fault tolerance (Rajeev Thakur)
> > 
> > 
> > 
> ----------------------------------------------------------------------
> > 
> > Message: 1
> > Date: Mon, 9 Mar 2009 13:55:43 -0400
> > From: "WONG Benedict -PICKERING" <benedict.wong at opg.com>
> > Subject: Re: [mpich-discuss] Question on fault tolerance
> > To: "WONG Benedict -PICKERING" <benedict.wong at opg.com>,
> > 	<mpich-discuss at mcs.anl.gov>
> > Message-ID:
> > 	
> > 
> <06F6D8CC436B0345904D188806D4BE03568233 at CATOU-OGMAPUW02.corp.opg.com>
> > Content-Type: text/plain;	charset="us-ascii"
> > 
> > 
> > To all,
> > 
> > I kind of answering my own question after I did a search in
> > FT on MPI, and I found the following paper to describe the 
> > method we could use!
> > 
> > Thanks!
> > 
> > 
> > Fault Tolerance in MPI Programs
> > 
> > By William Gropp & Ewing Lusk
> > 
> > 
> > By Ben Wong
> > Benedict.Wong at opg.com
> > Currently at KG Kipling with no fix office space
> > (416)231-4111(x5417 to
> > x5419 and ask for Ben)
> > 
> > Permanent office at
> > 230 Westney Road South,
> > Annandale, L1S 7R3
> > (905)428-4000x5458
> > (Fax) (905)619-5453
> > 
> > 
> > >  -----Original Message-----
> > > From: 	WONG Benedict -PICKERING  
> > > Sent:	Monday, March 09, 2009 11:10 AM
> > > To:	'mpich-discuss at mcs.anl.gov'
> > > Subject:	Question on fault tolerance
> > > 
> > > We are planning to use MPI (Namely MPICH2) to implement our
> > program.
> > > This will be used as production system with ~30 machines,
> > each with 2
> > > Xeon L5430 2.66 GHz CPU and 24 GB of ram in each machine.
> > > 
> > > We are designing our program so that it contains clients 
> instances,
> > > manager instances and worker instances.  So this is a basic 
> > client and
> > 
> > > server type of structure that clients startup whenever is
> > needed and
> > > talks to the server (Manager instances), and the manager instance
> > > assign jobs to worker instances.
> > > 
> > > We want to build fault tolerance into our program so we 
> will have a
> > > backup server (backup manager instance), and for worker 
> > instance, if
> > > any of them goes down, the system will just working with degraded
> > > performance....
> > > 
> > > I read the "Beowulf Cluster Computing with Linux" and "Using Mpi
> > > Two-volume Set" so I understand most of the details in MPI.
> > > 
> > > However, here are the questions that I haven't find any 
> answer yet:
> > > 
> > > After I setup a communicator with n processes, and if one 
> (or more)
> > > died, can the rest of the processes continue to communicate and 
> > > function?  Function such as MPI_Bcast, and MPI_Barrier, 
> > will they halt
> > 
> > > all processes?  How about the simple MPI_Send & MPI_Recv,
> > could they
> > > return some meaningful error  so that the rest of the 
> processes and
> > > continue?
> > > 
> > > 
> > > Regards,
> > > 
> > > By Ben Wong
> > > Benedict.Wong at opg.com
> > > Currently at KG Kipling with no fix office space
> > (416)231-4111(x5417
> > > to x5419 and ask for Ben)
> > > 
> > > Permanent office at
> > > 230 Westney Road South,
> > > Annandale, L1S 7R3
> > > (905)428-4000x5458
> > > (Fax) (905)619-5453
> > > 
> > > 
> > -----------------------------------------
> > THIS MESSAGE IS ONLY INTENDED FOR THE USE OF THE INTENDED
> > RECIPIENT(S) AND MAY CONTAIN INFORMATION THAT IS PRIVILEGED,
> > PROPRIETARY AND/OR CONFIDENTIAL. If you are not the intended 
> > recipient, you are hereby notified that any review, 
> > retransmission, dissemination, distribution, copying, 
> > conversion to hard copy or other use of this communication is 
> > strictly prohibited. If you are not the intended recipient 
> > and have received this message in error, please notify me by 
> > return e-mail and delete this message from your system. 
> > Ontario Power Generation Inc.
> > 
> > 
> > ------------------------------
> > 
> > Message: 2
> > Date: Mon, 9 Mar 2009 13:17:14 -0500
> > From: "Rajeev Thakur" <thakur at mcs.anl.gov>
> > Subject: Re: [mpich-discuss] Question on fault tolerance
> > To: <mpich-discuss at mcs.anl.gov>
> > Message-ID: <55E609AE364D44E49E845BF51DBB6C1D at mcs.anl.gov>
> > Content-Type: text/plain;	charset="us-ascii"
> > 
> > Ben,
> >     Yes, that paper describes the scenarios well. However,
> > the current version of MPICH2 doesn't support that level of 
> > fault tolerance yet, although we plan to do so in the near term.
> > 
> > Rajeev
> >  
> > 
> > > -----Original Message-----
> > > From: mpich-discuss-bounces at mcs.anl.gov
> > > [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of 
> > WONG Benedict
> > > -PICKERING
> > > Sent: Monday, March 09, 2009 12:56 PM
> > > To: WONG Benedict -PICKERING; mpich-discuss at mcs.anl.gov
> > > Subject: Re: [mpich-discuss] Question on fault tolerance
> > > 
> > > 
> > > To all,
> > > 
> > > I kind of answering my own question after I did a search in
> > FT on MPI,
> > 
> > > and I found the following paper to describe the method we 
> could use!
> > > 
> > > Thanks!
> > > 
> > > 
> > > Fault Tolerance in MPI Programs
> > > 
> > > By William Gropp & Ewing Lusk
> > > 
> > > 
> > > By Ben Wong
> > > Benedict.Wong at opg.com
> > > Currently at KG Kipling with no fix office space
> > (416)231-4111(x5417
> > > to x5419 and ask for Ben)
> > > 
> > > Permanent office at
> > > 230 Westney Road South,
> > > Annandale, L1S 7R3
> > > (905)428-4000x5458
> > > (Fax) (905)619-5453
> > > 
> > > 
> > > >  -----Original Message-----
> > > > From: 	WONG Benedict -PICKERING  
> > > > Sent:	Monday, March 09, 2009 11:10 AM
> > > > To:	'mpich-discuss at mcs.anl.gov'
> > > > Subject:	Question on fault tolerance
> > > > 
> > > > We are planning to use MPI (Namely MPICH2) to implement
> > our program.
> > 
> > > > This will be used as production system with ~30 machines,
> > > each with 2
> > > > Xeon L5430 2.66 GHz CPU and 24 GB of ram in each machine.
> > > > 
> > > > We are designing our program so that it contains clients
> > instances,
> > > > manager instances and worker instances.  So this is a basic
> > > client and
> > > > server type of structure that clients startup whenever is
> > needed and
> > 
> > > > talks to the server (Manager instances), and the 
> manager instance
> > > > assign jobs to worker instances.
> > > > 
> > > > We want to build fault tolerance into our program so we
> > will have a
> > > > backup server (backup manager instance), and for worker
> > instance, if
> > 
> > > > any of them goes down, the system will just working 
> with degraded
> > > > performance....
> > > > 
> > > > I read the "Beowulf Cluster Computing with Linux" and "Using Mpi
> > > > Two-volume Set" so I understand most of the details in MPI.
> > > > 
> > > > However, here are the questions that I haven't find any
> > answer yet:
> > > > 
> > > > After I setup a communicator with n processes, and if one
> > (or more)
> > > > died, can the rest of the processes continue to communicate and
> > > > function?  Function such as MPI_Bcast, and MPI_Barrier,
> > > will they halt
> > > > all processes?  How about the simple MPI_Send & MPI_Recv,
> > could they
> > 
> > > > return some meaningful error  so that the rest of the
> > processes and
> > > > continue?
> > > > 
> > > > 
> > > > Regards,
> > > > 
> > > > By Ben Wong
> > > > Benedict.Wong at opg.com
> > > > Currently at KG Kipling with no fix office space
> > (416)231-4111(x5417
> > 
> > > > to x5419 and ask for Ben)
> > > > 
> > > > Permanent office at
> > > > 230 Westney Road South,
> > > > Annandale, L1S 7R3
> > > > (905)428-4000x5458
> > > > (Fax) (905)619-5453
> > > > 
> > > > 
> > > -----------------------------------------
> > > THIS MESSAGE IS ONLY INTENDED FOR THE USE OF THE INTENDED
> > > RECIPIENT(S) AND MAY CONTAIN INFORMATION THAT IS PRIVILEGED,
> > > PROPRIETARY AND/OR CONFIDENTIAL. If you are not the intended 
> > > recipient, you are hereby notified that any review, 
> retransmission, 
> > > dissemination, distribution, copying, conversion to hard 
> > copy or other
> > 
> > > use of this communication is strictly prohibited. If you
> > are not the
> > > intended recipient and have received this message in error, please
> > > notify me by return e-mail and delete this message from 
> your system.
> > > Ontario Power Generation Inc.
> > > 
> > 
> > 
> > 
> > ------------------------------
> > 
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov 
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > 
> > 
> > End of mpich-discuss Digest, Vol 6, Issue 12
> > ********************************************
> > 
> > 
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> 
> End of mpich-discuss Digest, Vol 6, Issue 13
> ********************************************
> 
> 



More information about the mpich-discuss mailing list