[mpich-discuss] Question on fault tolerance
Rajeev Thakur
thakur at mcs.anl.gov
Wed Mar 11 18:01:28 CDT 2009
> If we setup the communicator with MPI_ERRORS_RETURN and one of the
> machines running MPI node(s) die, would the rest of the nodes in the
> communicator receive error in MPICH2?
No, right now, the whole job will abort. As I mentioned earlier, we plan to
change that over the course of the year.
Rajeev
> -----Original Message-----
> From: WONG Benedict -PICKERING [mailto:benedict.wong at opg.com]
> Sent: Wednesday, March 11, 2009 5:40 PM
> To: mpich-discuss at mcs.anl.gov; thakur at mcs.anl.gov
> Cc: WONG Benedict -PICKERING
> Subject: Re: [mpich-discuss] Question on fault tolerance
>
> We are planning to create a 30 machines cluster, and we are worrying
> about hardware failure which force the machine to be shutdown without
> proper shutdown sequences. Or even network connection problem which
> disconnect that machine from the cluster....
>
> We are trying to solve is an embarrassingly parallel problem
> so even if
> a node (i.e. Worker node) disappear, we could still use the
> results from
> the rest of the nodes. And we could recover the problem by
> reassigning
> the job from the disappeared node to others.
>
> If we setup the communicator with MPI_ERRORS_RETURN and one of the
> machines running MPI node(s) die, would the rest of the nodes in the
> communicator receive error in MPICH2? And we could 'handle' the error
> after that? (i.e. recreate a new communicator, restart all nodes with
> new machine lists and so on..)
>
> Regards,
>
> By Ben Wong
> Benedict.Wong at opg.com
> Currently at KG Kipling with no fix office space
> (416)231-4111(x5417 to
> x5419 and ask for Ben)
>
> Permanent office at
> 230 Westney Road South,
> Annandale, L1S 7R3
> (905)428-4000x5458
> (Fax) (905)619-5453
>
>
>
> -----Original Message-----
> ...
> ...
> ------------------------------
> Message: 7
> Date: Tue, 10 Mar 2009 22:09:48 -0500
> From: "Rajeev Thakur" <thakur at mcs.anl.gov>
> Subject: Re: [mpich-discuss] mpich-discuss Digest, Vol 6, Issue 12
> To: "'WONG Benedict -PICKERING'" <benedict.wong at opg.com>,
> <mpich-discuss at mcs.anl.gov>
> Message-ID: <B396A44D767E4734BC956E2B9DD66832 at thakurlaptop>
> Content-Type: text/plain; charset="US-ASCII"
>
> It supports errors_return, but not in the case of process failure.
>
> Rajeev
>
>
> > -----Original Message-----
> > From: WONG Benedict -PICKERING [mailto:benedict.wong at opg.com]
> > Sent: Tuesday, March 10, 2009 6:24 PM
> > To: mpich-discuss at mcs.anl.gov
> > Cc: thakur at mcs.anl.gov; WONG Benedict -PICKERING
> > Subject: RE: mpich-discuss Digest, Vol 6, Issue 12
> >
> > Rajeev,
> >
> > Thanks for your reply. Are you saying that the current
> > version (1.0.8) of MPICH2 doesn't support MPI_ERRORS_RETURN?
> >
> >
> > We are planning to 'recover' the process by checking the
> > error status on each MPI calls. By 'recover', we are
> > planning to save the results from processes that are still
> > functioning, and then restart all (Or whatever is possible)
> > processes in that communicator.
> >
> > Regards,
> >
> > By Ben Wong
> > Benedict.Wong at opg.com
> > Currently at KG Kipling with no fix office space
> (416)231-4111(x5417
> > to x5419 and ask for Ben)
> >
> > Permanent office at
> > 230 Westney Road South,
> > Annandale, L1S 7R3
> > (905)428-4000x5458
> > (Fax) (905)619-5453
> >
> >
> >
> > -----Original Message-----
> > From: mpich-discuss-bounces at mcs.anl.gov
> > [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of
> > mpich-discuss-request at mcs.anl.gov
> > Sent: Tuesday, March 10, 2009 1:00 PM
> > To: mpich-discuss at mcs.anl.gov
> > Subject: mpich-discuss Digest, Vol 6, Issue 12
> >
> >
> > Send mpich-discuss mailing list submissions to
> > mpich-discuss at mcs.anl.gov
> >
> > To subscribe or unsubscribe via the World Wide Web, visit
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> > or, via email, send a message with subject or body 'help' to
> > mpich-discuss-request at mcs.anl.gov
> >
> > You can reach the person managing the list at
> > mpich-discuss-owner at mcs.anl.gov
> >
> > When replying, please edit your Subject line so it is more
> > specific than
> > "Re: Contents of mpich-discuss digest..."
> >
> >
> > Today's Topics:
> >
> > 1. Re: Question on fault tolerance (WONG Benedict -PICKERING)
> > 2. Re: Question on fault tolerance (Rajeev Thakur)
> >
> >
> >
> ----------------------------------------------------------------------
> >
> > Message: 1
> > Date: Mon, 9 Mar 2009 13:55:43 -0400
> > From: "WONG Benedict -PICKERING" <benedict.wong at opg.com>
> > Subject: Re: [mpich-discuss] Question on fault tolerance
> > To: "WONG Benedict -PICKERING" <benedict.wong at opg.com>,
> > <mpich-discuss at mcs.anl.gov>
> > Message-ID:
> >
> >
> <06F6D8CC436B0345904D188806D4BE03568233 at CATOU-OGMAPUW02.corp.opg.com>
> > Content-Type: text/plain; charset="us-ascii"
> >
> >
> > To all,
> >
> > I kind of answering my own question after I did a search in
> > FT on MPI, and I found the following paper to describe the
> > method we could use!
> >
> > Thanks!
> >
> >
> > Fault Tolerance in MPI Programs
> >
> > By William Gropp & Ewing Lusk
> >
> >
> > By Ben Wong
> > Benedict.Wong at opg.com
> > Currently at KG Kipling with no fix office space
> > (416)231-4111(x5417 to
> > x5419 and ask for Ben)
> >
> > Permanent office at
> > 230 Westney Road South,
> > Annandale, L1S 7R3
> > (905)428-4000x5458
> > (Fax) (905)619-5453
> >
> >
> > > -----Original Message-----
> > > From: WONG Benedict -PICKERING
> > > Sent: Monday, March 09, 2009 11:10 AM
> > > To: 'mpich-discuss at mcs.anl.gov'
> > > Subject: Question on fault tolerance
> > >
> > > We are planning to use MPI (Namely MPICH2) to implement our
> > program.
> > > This will be used as production system with ~30 machines,
> > each with 2
> > > Xeon L5430 2.66 GHz CPU and 24 GB of ram in each machine.
> > >
> > > We are designing our program so that it contains clients
> instances,
> > > manager instances and worker instances. So this is a basic
> > client and
> >
> > > server type of structure that clients startup whenever is
> > needed and
> > > talks to the server (Manager instances), and the manager instance
> > > assign jobs to worker instances.
> > >
> > > We want to build fault tolerance into our program so we
> will have a
> > > backup server (backup manager instance), and for worker
> > instance, if
> > > any of them goes down, the system will just working with degraded
> > > performance....
> > >
> > > I read the "Beowulf Cluster Computing with Linux" and "Using Mpi
> > > Two-volume Set" so I understand most of the details in MPI.
> > >
> > > However, here are the questions that I haven't find any
> answer yet:
> > >
> > > After I setup a communicator with n processes, and if one
> (or more)
> > > died, can the rest of the processes continue to communicate and
> > > function? Function such as MPI_Bcast, and MPI_Barrier,
> > will they halt
> >
> > > all processes? How about the simple MPI_Send & MPI_Recv,
> > could they
> > > return some meaningful error so that the rest of the
> processes and
> > > continue?
> > >
> > >
> > > Regards,
> > >
> > > By Ben Wong
> > > Benedict.Wong at opg.com
> > > Currently at KG Kipling with no fix office space
> > (416)231-4111(x5417
> > > to x5419 and ask for Ben)
> > >
> > > Permanent office at
> > > 230 Westney Road South,
> > > Annandale, L1S 7R3
> > > (905)428-4000x5458
> > > (Fax) (905)619-5453
> > >
> > >
> > -----------------------------------------
> > THIS MESSAGE IS ONLY INTENDED FOR THE USE OF THE INTENDED
> > RECIPIENT(S) AND MAY CONTAIN INFORMATION THAT IS PRIVILEGED,
> > PROPRIETARY AND/OR CONFIDENTIAL. If you are not the intended
> > recipient, you are hereby notified that any review,
> > retransmission, dissemination, distribution, copying,
> > conversion to hard copy or other use of this communication is
> > strictly prohibited. If you are not the intended recipient
> > and have received this message in error, please notify me by
> > return e-mail and delete this message from your system.
> > Ontario Power Generation Inc.
> >
> >
> > ------------------------------
> >
> > Message: 2
> > Date: Mon, 9 Mar 2009 13:17:14 -0500
> > From: "Rajeev Thakur" <thakur at mcs.anl.gov>
> > Subject: Re: [mpich-discuss] Question on fault tolerance
> > To: <mpich-discuss at mcs.anl.gov>
> > Message-ID: <55E609AE364D44E49E845BF51DBB6C1D at mcs.anl.gov>
> > Content-Type: text/plain; charset="us-ascii"
> >
> > Ben,
> > Yes, that paper describes the scenarios well. However,
> > the current version of MPICH2 doesn't support that level of
> > fault tolerance yet, although we plan to do so in the near term.
> >
> > Rajeev
> >
> >
> > > -----Original Message-----
> > > From: mpich-discuss-bounces at mcs.anl.gov
> > > [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of
> > WONG Benedict
> > > -PICKERING
> > > Sent: Monday, March 09, 2009 12:56 PM
> > > To: WONG Benedict -PICKERING; mpich-discuss at mcs.anl.gov
> > > Subject: Re: [mpich-discuss] Question on fault tolerance
> > >
> > >
> > > To all,
> > >
> > > I kind of answering my own question after I did a search in
> > FT on MPI,
> >
> > > and I found the following paper to describe the method we
> could use!
> > >
> > > Thanks!
> > >
> > >
> > > Fault Tolerance in MPI Programs
> > >
> > > By William Gropp & Ewing Lusk
> > >
> > >
> > > By Ben Wong
> > > Benedict.Wong at opg.com
> > > Currently at KG Kipling with no fix office space
> > (416)231-4111(x5417
> > > to x5419 and ask for Ben)
> > >
> > > Permanent office at
> > > 230 Westney Road South,
> > > Annandale, L1S 7R3
> > > (905)428-4000x5458
> > > (Fax) (905)619-5453
> > >
> > >
> > > > -----Original Message-----
> > > > From: WONG Benedict -PICKERING
> > > > Sent: Monday, March 09, 2009 11:10 AM
> > > > To: 'mpich-discuss at mcs.anl.gov'
> > > > Subject: Question on fault tolerance
> > > >
> > > > We are planning to use MPI (Namely MPICH2) to implement
> > our program.
> >
> > > > This will be used as production system with ~30 machines,
> > > each with 2
> > > > Xeon L5430 2.66 GHz CPU and 24 GB of ram in each machine.
> > > >
> > > > We are designing our program so that it contains clients
> > instances,
> > > > manager instances and worker instances. So this is a basic
> > > client and
> > > > server type of structure that clients startup whenever is
> > needed and
> >
> > > > talks to the server (Manager instances), and the
> manager instance
> > > > assign jobs to worker instances.
> > > >
> > > > We want to build fault tolerance into our program so we
> > will have a
> > > > backup server (backup manager instance), and for worker
> > instance, if
> >
> > > > any of them goes down, the system will just working
> with degraded
> > > > performance....
> > > >
> > > > I read the "Beowulf Cluster Computing with Linux" and "Using Mpi
> > > > Two-volume Set" so I understand most of the details in MPI.
> > > >
> > > > However, here are the questions that I haven't find any
> > answer yet:
> > > >
> > > > After I setup a communicator with n processes, and if one
> > (or more)
> > > > died, can the rest of the processes continue to communicate and
> > > > function? Function such as MPI_Bcast, and MPI_Barrier,
> > > will they halt
> > > > all processes? How about the simple MPI_Send & MPI_Recv,
> > could they
> >
> > > > return some meaningful error so that the rest of the
> > processes and
> > > > continue?
> > > >
> > > >
> > > > Regards,
> > > >
> > > > By Ben Wong
> > > > Benedict.Wong at opg.com
> > > > Currently at KG Kipling with no fix office space
> > (416)231-4111(x5417
> >
> > > > to x5419 and ask for Ben)
> > > >
> > > > Permanent office at
> > > > 230 Westney Road South,
> > > > Annandale, L1S 7R3
> > > > (905)428-4000x5458
> > > > (Fax) (905)619-5453
> > > >
> > > >
> > > -----------------------------------------
> > > THIS MESSAGE IS ONLY INTENDED FOR THE USE OF THE INTENDED
> > > RECIPIENT(S) AND MAY CONTAIN INFORMATION THAT IS PRIVILEGED,
> > > PROPRIETARY AND/OR CONFIDENTIAL. If you are not the intended
> > > recipient, you are hereby notified that any review,
> retransmission,
> > > dissemination, distribution, copying, conversion to hard
> > copy or other
> >
> > > use of this communication is strictly prohibited. If you
> > are not the
> > > intended recipient and have received this message in error, please
> > > notify me by return e-mail and delete this message from
> your system.
> > > Ontario Power Generation Inc.
> > >
> >
> >
> >
> > ------------------------------
> >
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> >
> >
> > End of mpich-discuss Digest, Vol 6, Issue 12
> > ********************************************
> >
> >
>
>
>
> ------------------------------
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>
> End of mpich-discuss Digest, Vol 6, Issue 13
> ********************************************
>
>
More information about the mpich-discuss
mailing list