[mpich-discuss] mpich-discuss Digest, Vol 6, Issue 12
Rajeev Thakur
thakur at mcs.anl.gov
Tue Mar 10 22:09:48 CDT 2009
It supports errors_return, but not in the case of process failure.
Rajeev
> -----Original Message-----
> From: WONG Benedict -PICKERING [mailto:benedict.wong at opg.com]
> Sent: Tuesday, March 10, 2009 6:24 PM
> To: mpich-discuss at mcs.anl.gov
> Cc: thakur at mcs.anl.gov; WONG Benedict -PICKERING
> Subject: RE: mpich-discuss Digest, Vol 6, Issue 12
>
> Rajeev,
>
> Thanks for your reply. Are you saying that the current
> version (1.0.8) of MPICH2 doesn't support MPI_ERRORS_RETURN?
>
>
> We are planning to 'recover' the process by checking the
> error status on each MPI calls. By 'recover', we are
> planning to save the results from processes that are still
> functioning, and then restart all (Or whatever is possible)
> processes in that communicator.
>
> Regards,
>
> By Ben Wong
> Benedict.Wong at opg.com
> Currently at KG Kipling with no fix office space
> (416)231-4111(x5417 to x5419 and ask for Ben)
>
> Permanent office at
> 230 Westney Road South,
> Annandale, L1S 7R3
> (905)428-4000x5458
> (Fax) (905)619-5453
>
>
>
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of
> mpich-discuss-request at mcs.anl.gov
> Sent: Tuesday, March 10, 2009 1:00 PM
> To: mpich-discuss at mcs.anl.gov
> Subject: mpich-discuss Digest, Vol 6, Issue 12
>
>
> Send mpich-discuss mailing list submissions to
> mpich-discuss at mcs.anl.gov
>
> To subscribe or unsubscribe via the World Wide Web, visit
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> or, via email, send a message with subject or body 'help' to
> mpich-discuss-request at mcs.anl.gov
>
> You can reach the person managing the list at
> mpich-discuss-owner at mcs.anl.gov
>
> When replying, please edit your Subject line so it is more
> specific than
> "Re: Contents of mpich-discuss digest..."
>
>
> Today's Topics:
>
> 1. Re: Question on fault tolerance (WONG Benedict -PICKERING)
> 2. Re: Question on fault tolerance (Rajeev Thakur)
>
>
> ----------------------------------------------------------------------
>
> Message: 1
> Date: Mon, 9 Mar 2009 13:55:43 -0400
> From: "WONG Benedict -PICKERING" <benedict.wong at opg.com>
> Subject: Re: [mpich-discuss] Question on fault tolerance
> To: "WONG Benedict -PICKERING" <benedict.wong at opg.com>,
> <mpich-discuss at mcs.anl.gov>
> Message-ID:
>
> <06F6D8CC436B0345904D188806D4BE03568233 at CATOU-OGMAPUW02.corp.opg.com>
> Content-Type: text/plain; charset="us-ascii"
>
>
> To all,
>
> I kind of answering my own question after I did a search in
> FT on MPI, and I found the following paper to describe the
> method we could use!
>
> Thanks!
>
>
> Fault Tolerance in MPI Programs
>
> By William Gropp & Ewing Lusk
>
>
> By Ben Wong
> Benedict.Wong at opg.com
> Currently at KG Kipling with no fix office space
> (416)231-4111(x5417 to
> x5419 and ask for Ben)
>
> Permanent office at
> 230 Westney Road South,
> Annandale, L1S 7R3
> (905)428-4000x5458
> (Fax) (905)619-5453
>
>
> > -----Original Message-----
> > From: WONG Benedict -PICKERING
> > Sent: Monday, March 09, 2009 11:10 AM
> > To: 'mpich-discuss at mcs.anl.gov'
> > Subject: Question on fault tolerance
> >
> > We are planning to use MPI (Namely MPICH2) to implement our
> program.
> > This will be used as production system with ~30 machines,
> each with 2
> > Xeon L5430 2.66 GHz CPU and 24 GB of ram in each machine.
> >
> > We are designing our program so that it contains clients instances,
> > manager instances and worker instances. So this is a basic
> client and
>
> > server type of structure that clients startup whenever is
> needed and
> > talks to the server (Manager instances), and the manager instance
> > assign jobs to worker instances.
> >
> > We want to build fault tolerance into our program so we will have a
> > backup server (backup manager instance), and for worker
> instance, if
> > any of them goes down, the system will just working with degraded
> > performance....
> >
> > I read the "Beowulf Cluster Computing with Linux" and "Using Mpi
> > Two-volume Set" so I understand most of the details in MPI.
> >
> > However, here are the questions that I haven't find any answer yet:
> >
> > After I setup a communicator with n processes, and if one (or more)
> > died, can the rest of the processes continue to communicate and
> > function? Function such as MPI_Bcast, and MPI_Barrier,
> will they halt
>
> > all processes? How about the simple MPI_Send & MPI_Recv,
> could they
> > return some meaningful error so that the rest of the processes and
> > continue?
> >
> >
> > Regards,
> >
> > By Ben Wong
> > Benedict.Wong at opg.com
> > Currently at KG Kipling with no fix office space
> (416)231-4111(x5417
> > to x5419 and ask for Ben)
> >
> > Permanent office at
> > 230 Westney Road South,
> > Annandale, L1S 7R3
> > (905)428-4000x5458
> > (Fax) (905)619-5453
> >
> >
> -----------------------------------------
> THIS MESSAGE IS ONLY INTENDED FOR THE USE OF THE INTENDED
> RECIPIENT(S) AND MAY CONTAIN INFORMATION THAT IS PRIVILEGED,
> PROPRIETARY AND/OR CONFIDENTIAL. If you are not the intended
> recipient, you are hereby notified that any review,
> retransmission, dissemination, distribution, copying,
> conversion to hard copy or other use of this communication is
> strictly prohibited. If you are not the intended recipient
> and have received this message in error, please notify me by
> return e-mail and delete this message from your system.
> Ontario Power Generation Inc.
>
>
> ------------------------------
>
> Message: 2
> Date: Mon, 9 Mar 2009 13:17:14 -0500
> From: "Rajeev Thakur" <thakur at mcs.anl.gov>
> Subject: Re: [mpich-discuss] Question on fault tolerance
> To: <mpich-discuss at mcs.anl.gov>
> Message-ID: <55E609AE364D44E49E845BF51DBB6C1D at mcs.anl.gov>
> Content-Type: text/plain; charset="us-ascii"
>
> Ben,
> Yes, that paper describes the scenarios well. However,
> the current version of MPICH2 doesn't support that level of
> fault tolerance yet, although we plan to do so in the near term.
>
> Rajeev
>
>
> > -----Original Message-----
> > From: mpich-discuss-bounces at mcs.anl.gov
> > [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of
> WONG Benedict
> > -PICKERING
> > Sent: Monday, March 09, 2009 12:56 PM
> > To: WONG Benedict -PICKERING; mpich-discuss at mcs.anl.gov
> > Subject: Re: [mpich-discuss] Question on fault tolerance
> >
> >
> > To all,
> >
> > I kind of answering my own question after I did a search in
> FT on MPI,
>
> > and I found the following paper to describe the method we could use!
> >
> > Thanks!
> >
> >
> > Fault Tolerance in MPI Programs
> >
> > By William Gropp & Ewing Lusk
> >
> >
> > By Ben Wong
> > Benedict.Wong at opg.com
> > Currently at KG Kipling with no fix office space
> (416)231-4111(x5417
> > to x5419 and ask for Ben)
> >
> > Permanent office at
> > 230 Westney Road South,
> > Annandale, L1S 7R3
> > (905)428-4000x5458
> > (Fax) (905)619-5453
> >
> >
> > > -----Original Message-----
> > > From: WONG Benedict -PICKERING
> > > Sent: Monday, March 09, 2009 11:10 AM
> > > To: 'mpich-discuss at mcs.anl.gov'
> > > Subject: Question on fault tolerance
> > >
> > > We are planning to use MPI (Namely MPICH2) to implement
> our program.
>
> > > This will be used as production system with ~30 machines,
> > each with 2
> > > Xeon L5430 2.66 GHz CPU and 24 GB of ram in each machine.
> > >
> > > We are designing our program so that it contains clients
> instances,
> > > manager instances and worker instances. So this is a basic
> > client and
> > > server type of structure that clients startup whenever is
> needed and
>
> > > talks to the server (Manager instances), and the manager instance
> > > assign jobs to worker instances.
> > >
> > > We want to build fault tolerance into our program so we
> will have a
> > > backup server (backup manager instance), and for worker
> instance, if
>
> > > any of them goes down, the system will just working with degraded
> > > performance....
> > >
> > > I read the "Beowulf Cluster Computing with Linux" and "Using Mpi
> > > Two-volume Set" so I understand most of the details in MPI.
> > >
> > > However, here are the questions that I haven't find any
> answer yet:
> > >
> > > After I setup a communicator with n processes, and if one
> (or more)
> > > died, can the rest of the processes continue to communicate and
> > > function? Function such as MPI_Bcast, and MPI_Barrier,
> > will they halt
> > > all processes? How about the simple MPI_Send & MPI_Recv,
> could they
>
> > > return some meaningful error so that the rest of the
> processes and
> > > continue?
> > >
> > >
> > > Regards,
> > >
> > > By Ben Wong
> > > Benedict.Wong at opg.com
> > > Currently at KG Kipling with no fix office space
> (416)231-4111(x5417
>
> > > to x5419 and ask for Ben)
> > >
> > > Permanent office at
> > > 230 Westney Road South,
> > > Annandale, L1S 7R3
> > > (905)428-4000x5458
> > > (Fax) (905)619-5453
> > >
> > >
> > -----------------------------------------
> > THIS MESSAGE IS ONLY INTENDED FOR THE USE OF THE INTENDED
> > RECIPIENT(S) AND MAY CONTAIN INFORMATION THAT IS PRIVILEGED,
> > PROPRIETARY AND/OR CONFIDENTIAL. If you are not the intended
> > recipient, you are hereby notified that any review, retransmission,
> > dissemination, distribution, copying, conversion to hard
> copy or other
>
> > use of this communication is strictly prohibited. If you
> are not the
> > intended recipient and have received this message in error, please
> > notify me by return e-mail and delete this message from your system.
> > Ontario Power Generation Inc.
> >
>
>
>
> ------------------------------
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
>
> End of mpich-discuss Digest, Vol 6, Issue 12
> ********************************************
>
>
More information about the mpich-discuss
mailing list