[mpich-discuss] Question on fault tolerance

WONG Benedict -PICKERING benedict.wong at opg.com
Wed Mar 11 17:39:48 CDT 2009


We are planning to create a 30 machines cluster, and we are worrying
about hardware failure which force the machine to be shutdown without
proper shutdown sequences.  Or even network connection problem which
disconnect that machine from the cluster....

We are trying to solve is an embarrassingly parallel problem so even if
a node (i.e. Worker node) disappear, we could still use the results from
the rest of the nodes.  And we could recover the problem by reassigning
the job from the disappeared node to others.

If we setup the communicator with MPI_ERRORS_RETURN and one of the
machines running MPI node(s) die, would the rest of the nodes in the
communicator receive error in MPICH2?  And we could 'handle' the error
after that? (i.e. recreate a new communicator, restart all nodes with
new machine lists and so on..)

Regards,

By Ben Wong
Benedict.Wong at opg.com
Currently at KG Kipling with no fix office space (416)231-4111(x5417 to
x5419 and ask for Ben)

Permanent office at 
230 Westney Road South, 
Annandale, L1S 7R3
(905)428-4000x5458
(Fax) (905)619-5453



-----Original Message-----
...
...
------------------------------
Message: 7
Date: Tue, 10 Mar 2009 22:09:48 -0500
From: "Rajeev Thakur" <thakur at mcs.anl.gov>
Subject: Re: [mpich-discuss] mpich-discuss Digest, Vol 6, Issue 12
To: "'WONG Benedict -PICKERING'" <benedict.wong at opg.com>,
	<mpich-discuss at mcs.anl.gov>
Message-ID: <B396A44D767E4734BC956E2B9DD66832 at thakurlaptop>
Content-Type: text/plain;	charset="US-ASCII"

It supports errors_return, but not in the case of process failure. 

Rajeev 


> -----Original Message-----
> From: WONG Benedict -PICKERING [mailto:benedict.wong at opg.com]
> Sent: Tuesday, March 10, 2009 6:24 PM
> To: mpich-discuss at mcs.anl.gov
> Cc: thakur at mcs.anl.gov; WONG Benedict -PICKERING
> Subject: RE: mpich-discuss Digest, Vol 6, Issue 12
> 
> Rajeev,
> 
> Thanks for your reply.  Are you saying that the current
> version (1.0.8) of MPICH2 doesn't support MPI_ERRORS_RETURN?
> 
> 
> We are planning to 'recover' the process by checking the
> error status on each MPI calls.  By 'recover', we are 
> planning to save the results from processes that are still 
> functioning, and then restart all (Or whatever is possible) 
> processes in that communicator.
> 
> Regards,
> 
> By Ben Wong
> Benedict.Wong at opg.com
> Currently at KG Kipling with no fix office space (416)231-4111(x5417 
> to x5419 and ask for Ben)
> 
> Permanent office at
> 230 Westney Road South,
> Annandale, L1S 7R3
> (905)428-4000x5458
> (Fax) (905)619-5453
> 
> 
> 
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov 
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of 
> mpich-discuss-request at mcs.anl.gov
> Sent: Tuesday, March 10, 2009 1:00 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: mpich-discuss Digest, Vol 6, Issue 12
> 
> 
> Send mpich-discuss mailing list submissions to
> 	mpich-discuss at mcs.anl.gov
> 
> To subscribe or unsubscribe via the World Wide Web, visit
> 	https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> or, via email, send a message with subject or body 'help' to
> 	mpich-discuss-request at mcs.anl.gov
> 
> You can reach the person managing the list at
> 	mpich-discuss-owner at mcs.anl.gov
> 
> When replying, please edit your Subject line so it is more
> specific than
> "Re: Contents of mpich-discuss digest..."
> 
> 
> Today's Topics:
> 
>    1. Re:  Question on fault tolerance (WONG Benedict -PICKERING)
>    2. Re:  Question on fault tolerance (Rajeev Thakur)
> 
> 
> ----------------------------------------------------------------------
> 
> Message: 1
> Date: Mon, 9 Mar 2009 13:55:43 -0400
> From: "WONG Benedict -PICKERING" <benedict.wong at opg.com>
> Subject: Re: [mpich-discuss] Question on fault tolerance
> To: "WONG Benedict -PICKERING" <benedict.wong at opg.com>,
> 	<mpich-discuss at mcs.anl.gov>
> Message-ID:
> 	
> <06F6D8CC436B0345904D188806D4BE03568233 at CATOU-OGMAPUW02.corp.opg.com>
> Content-Type: text/plain;	charset="us-ascii"
> 
> 
> To all,
> 
> I kind of answering my own question after I did a search in
> FT on MPI, and I found the following paper to describe the 
> method we could use!
> 
> Thanks!
> 
> 
> Fault Tolerance in MPI Programs
> 
> By William Gropp & Ewing Lusk
> 
> 
> By Ben Wong
> Benedict.Wong at opg.com
> Currently at KG Kipling with no fix office space
> (416)231-4111(x5417 to
> x5419 and ask for Ben)
> 
> Permanent office at
> 230 Westney Road South,
> Annandale, L1S 7R3
> (905)428-4000x5458
> (Fax) (905)619-5453
> 
> 
> >  -----Original Message-----
> > From: 	WONG Benedict -PICKERING  
> > Sent:	Monday, March 09, 2009 11:10 AM
> > To:	'mpich-discuss at mcs.anl.gov'
> > Subject:	Question on fault tolerance
> > 
> > We are planning to use MPI (Namely MPICH2) to implement our
> program.
> > This will be used as production system with ~30 machines,
> each with 2
> > Xeon L5430 2.66 GHz CPU and 24 GB of ram in each machine.
> > 
> > We are designing our program so that it contains clients instances,
> > manager instances and worker instances.  So this is a basic 
> client and
> 
> > server type of structure that clients startup whenever is
> needed and
> > talks to the server (Manager instances), and the manager instance
> > assign jobs to worker instances.
> > 
> > We want to build fault tolerance into our program so we will have a
> > backup server (backup manager instance), and for worker 
> instance, if
> > any of them goes down, the system will just working with degraded
> > performance....
> > 
> > I read the "Beowulf Cluster Computing with Linux" and "Using Mpi
> > Two-volume Set" so I understand most of the details in MPI.
> > 
> > However, here are the questions that I haven't find any answer yet:
> > 
> > After I setup a communicator with n processes, and if one (or more)
> > died, can the rest of the processes continue to communicate and 
> > function?  Function such as MPI_Bcast, and MPI_Barrier, 
> will they halt
> 
> > all processes?  How about the simple MPI_Send & MPI_Recv,
> could they
> > return some meaningful error  so that the rest of the processes and
> > continue?
> > 
> > 
> > Regards,
> > 
> > By Ben Wong
> > Benedict.Wong at opg.com
> > Currently at KG Kipling with no fix office space
> (416)231-4111(x5417
> > to x5419 and ask for Ben)
> > 
> > Permanent office at
> > 230 Westney Road South,
> > Annandale, L1S 7R3
> > (905)428-4000x5458
> > (Fax) (905)619-5453
> > 
> > 
> -----------------------------------------
> THIS MESSAGE IS ONLY INTENDED FOR THE USE OF THE INTENDED
> RECIPIENT(S) AND MAY CONTAIN INFORMATION THAT IS PRIVILEGED,
> PROPRIETARY AND/OR CONFIDENTIAL. If you are not the intended 
> recipient, you are hereby notified that any review, 
> retransmission, dissemination, distribution, copying, 
> conversion to hard copy or other use of this communication is 
> strictly prohibited. If you are not the intended recipient 
> and have received this message in error, please notify me by 
> return e-mail and delete this message from your system. 
> Ontario Power Generation Inc.
> 
> 
> ------------------------------
> 
> Message: 2
> Date: Mon, 9 Mar 2009 13:17:14 -0500
> From: "Rajeev Thakur" <thakur at mcs.anl.gov>
> Subject: Re: [mpich-discuss] Question on fault tolerance
> To: <mpich-discuss at mcs.anl.gov>
> Message-ID: <55E609AE364D44E49E845BF51DBB6C1D at mcs.anl.gov>
> Content-Type: text/plain;	charset="us-ascii"
> 
> Ben,
>     Yes, that paper describes the scenarios well. However,
> the current version of MPICH2 doesn't support that level of 
> fault tolerance yet, although we plan to do so in the near term.
> 
> Rajeev
>  
> 
> > -----Original Message-----
> > From: mpich-discuss-bounces at mcs.anl.gov
> > [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of 
> WONG Benedict
> > -PICKERING
> > Sent: Monday, March 09, 2009 12:56 PM
> > To: WONG Benedict -PICKERING; mpich-discuss at mcs.anl.gov
> > Subject: Re: [mpich-discuss] Question on fault tolerance
> > 
> > 
> > To all,
> > 
> > I kind of answering my own question after I did a search in
> FT on MPI,
> 
> > and I found the following paper to describe the method we could use!
> > 
> > Thanks!
> > 
> > 
> > Fault Tolerance in MPI Programs
> > 
> > By William Gropp & Ewing Lusk
> > 
> > 
> > By Ben Wong
> > Benedict.Wong at opg.com
> > Currently at KG Kipling with no fix office space
> (416)231-4111(x5417
> > to x5419 and ask for Ben)
> > 
> > Permanent office at
> > 230 Westney Road South,
> > Annandale, L1S 7R3
> > (905)428-4000x5458
> > (Fax) (905)619-5453
> > 
> > 
> > >  -----Original Message-----
> > > From: 	WONG Benedict -PICKERING  
> > > Sent:	Monday, March 09, 2009 11:10 AM
> > > To:	'mpich-discuss at mcs.anl.gov'
> > > Subject:	Question on fault tolerance
> > > 
> > > We are planning to use MPI (Namely MPICH2) to implement
> our program.
> 
> > > This will be used as production system with ~30 machines,
> > each with 2
> > > Xeon L5430 2.66 GHz CPU and 24 GB of ram in each machine.
> > > 
> > > We are designing our program so that it contains clients
> instances,
> > > manager instances and worker instances.  So this is a basic
> > client and
> > > server type of structure that clients startup whenever is
> needed and
> 
> > > talks to the server (Manager instances), and the manager instance
> > > assign jobs to worker instances.
> > > 
> > > We want to build fault tolerance into our program so we
> will have a
> > > backup server (backup manager instance), and for worker
> instance, if
> 
> > > any of them goes down, the system will just working with degraded
> > > performance....
> > > 
> > > I read the "Beowulf Cluster Computing with Linux" and "Using Mpi
> > > Two-volume Set" so I understand most of the details in MPI.
> > > 
> > > However, here are the questions that I haven't find any
> answer yet:
> > > 
> > > After I setup a communicator with n processes, and if one
> (or more)
> > > died, can the rest of the processes continue to communicate and
> > > function?  Function such as MPI_Bcast, and MPI_Barrier,
> > will they halt
> > > all processes?  How about the simple MPI_Send & MPI_Recv,
> could they
> 
> > > return some meaningful error  so that the rest of the
> processes and
> > > continue?
> > > 
> > > 
> > > Regards,
> > > 
> > > By Ben Wong
> > > Benedict.Wong at opg.com
> > > Currently at KG Kipling with no fix office space
> (416)231-4111(x5417
> 
> > > to x5419 and ask for Ben)
> > > 
> > > Permanent office at
> > > 230 Westney Road South,
> > > Annandale, L1S 7R3
> > > (905)428-4000x5458
> > > (Fax) (905)619-5453
> > > 
> > > 
> > -----------------------------------------
> > THIS MESSAGE IS ONLY INTENDED FOR THE USE OF THE INTENDED
> > RECIPIENT(S) AND MAY CONTAIN INFORMATION THAT IS PRIVILEGED,
> > PROPRIETARY AND/OR CONFIDENTIAL. If you are not the intended 
> > recipient, you are hereby notified that any review, retransmission, 
> > dissemination, distribution, copying, conversion to hard 
> copy or other
> 
> > use of this communication is strictly prohibited. If you
> are not the
> > intended recipient and have received this message in error, please
> > notify me by return e-mail and delete this message from your system.
> > Ontario Power Generation Inc.
> > 
> 
> 
> 
> ------------------------------
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov 
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> 
> End of mpich-discuss Digest, Vol 6, Issue 12
> ********************************************
> 
> 



------------------------------

_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss


End of mpich-discuss Digest, Vol 6, Issue 13
********************************************



More information about the mpich-discuss mailing list