[mpich-discuss] mpich-discuss Digest, Vol 6, Issue 12
WONG Benedict -PICKERING
benedict.wong at opg.com
Tue Mar 10 18:24:26 CDT 2009
Rajeev,
Thanks for your reply. Are you saying that the current version (1.0.8)
of MPICH2 doesn't support MPI_ERRORS_RETURN?
We are planning to 'recover' the process by checking the error status on
each MPI calls. By 'recover', we are planning to save the results from
processes that are still functioning, and then restart all (Or whatever
is possible) processes in that communicator.
Regards,
By Ben Wong
Benedict.Wong at opg.com
Currently at KG Kipling with no fix office space
(416)231-4111(x5417 to x5419 and ask for Ben)
Permanent office at
230 Westney Road South,
Annandale, L1S 7R3
(905)428-4000x5458
(Fax) (905)619-5453
-----Original Message-----
From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of
mpich-discuss-request at mcs.anl.gov
Sent: Tuesday, March 10, 2009 1:00 PM
To: mpich-discuss at mcs.anl.gov
Subject: mpich-discuss Digest, Vol 6, Issue 12
Send mpich-discuss mailing list submissions to
mpich-discuss at mcs.anl.gov
To subscribe or unsubscribe via the World Wide Web, visit
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
or, via email, send a message with subject or body 'help' to
mpich-discuss-request at mcs.anl.gov
You can reach the person managing the list at
mpich-discuss-owner at mcs.anl.gov
When replying, please edit your Subject line so it is more specific than
"Re: Contents of mpich-discuss digest..."
Today's Topics:
1. Re: Question on fault tolerance (WONG Benedict -PICKERING)
2. Re: Question on fault tolerance (Rajeev Thakur)
----------------------------------------------------------------------
Message: 1
Date: Mon, 9 Mar 2009 13:55:43 -0400
From: "WONG Benedict -PICKERING" <benedict.wong at opg.com>
Subject: Re: [mpich-discuss] Question on fault tolerance
To: "WONG Benedict -PICKERING" <benedict.wong at opg.com>,
<mpich-discuss at mcs.anl.gov>
Message-ID:
<06F6D8CC436B0345904D188806D4BE03568233 at CATOU-OGMAPUW02.corp.opg.com>
Content-Type: text/plain; charset="us-ascii"
To all,
I kind of answering my own question after I did a search in FT on MPI,
and I found the following paper to describe the method we could use!
Thanks!
Fault Tolerance in MPI Programs
By William Gropp & Ewing Lusk
By Ben Wong
Benedict.Wong at opg.com
Currently at KG Kipling with no fix office space (416)231-4111(x5417 to
x5419 and ask for Ben)
Permanent office at
230 Westney Road South,
Annandale, L1S 7R3
(905)428-4000x5458
(Fax) (905)619-5453
> -----Original Message-----
> From: WONG Benedict -PICKERING
> Sent: Monday, March 09, 2009 11:10 AM
> To: 'mpich-discuss at mcs.anl.gov'
> Subject: Question on fault tolerance
>
> We are planning to use MPI (Namely MPICH2) to implement our program.
> This will be used as production system with ~30 machines, each with 2
> Xeon L5430 2.66 GHz CPU and 24 GB of ram in each machine.
>
> We are designing our program so that it contains clients instances,
> manager instances and worker instances. So this is a basic client and
> server type of structure that clients startup whenever is needed and
> talks to the server (Manager instances), and the manager instance
> assign jobs to worker instances.
>
> We want to build fault tolerance into our program so we will have a
> backup server (backup manager instance), and for worker instance, if
> any of them goes down, the system will just working with degraded
> performance....
>
> I read the "Beowulf Cluster Computing with Linux" and "Using Mpi
> Two-volume Set" so I understand most of the details in MPI.
>
> However, here are the questions that I haven't find any answer yet:
>
> After I setup a communicator with n processes, and if one (or more)
> died, can the rest of the processes continue to communicate and
> function? Function such as MPI_Bcast, and MPI_Barrier, will they halt
> all processes? How about the simple MPI_Send & MPI_Recv, could they
> return some meaningful error so that the rest of the processes and
> continue?
>
>
> Regards,
>
> By Ben Wong
> Benedict.Wong at opg.com
> Currently at KG Kipling with no fix office space (416)231-4111(x5417
> to x5419 and ask for Ben)
>
> Permanent office at
> 230 Westney Road South,
> Annandale, L1S 7R3
> (905)428-4000x5458
> (Fax) (905)619-5453
>
>
-----------------------------------------
THIS MESSAGE IS ONLY INTENDED FOR THE USE OF THE INTENDED
RECIPIENT(S) AND MAY CONTAIN INFORMATION THAT IS PRIVILEGED, PROPRIETARY
AND/OR CONFIDENTIAL. If you are not the intended recipient, you are
hereby notified that any review, retransmission, dissemination,
distribution, copying, conversion to hard copy or other use of this
communication is strictly prohibited. If you are not the intended
recipient and have received this message in error, please notify me by
return e-mail and delete this message from your system. Ontario Power
Generation Inc.
------------------------------
Message: 2
Date: Mon, 9 Mar 2009 13:17:14 -0500
From: "Rajeev Thakur" <thakur at mcs.anl.gov>
Subject: Re: [mpich-discuss] Question on fault tolerance
To: <mpich-discuss at mcs.anl.gov>
Message-ID: <55E609AE364D44E49E845BF51DBB6C1D at mcs.anl.gov>
Content-Type: text/plain; charset="us-ascii"
Ben,
Yes, that paper describes the scenarios well. However, the current
version of MPICH2 doesn't support that level of fault tolerance yet,
although we plan to do so in the near term.
Rajeev
> -----Original Message-----
> From: mpich-discuss-bounces at mcs.anl.gov
> [mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of WONG
> Benedict -PICKERING
> Sent: Monday, March 09, 2009 12:56 PM
> To: WONG Benedict -PICKERING; mpich-discuss at mcs.anl.gov
> Subject: Re: [mpich-discuss] Question on fault tolerance
>
>
> To all,
>
> I kind of answering my own question after I did a search in FT on MPI,
> and I found the following paper to describe the method we could use!
>
> Thanks!
>
>
> Fault Tolerance in MPI Programs
>
> By William Gropp & Ewing Lusk
>
>
> By Ben Wong
> Benedict.Wong at opg.com
> Currently at KG Kipling with no fix office space (416)231-4111(x5417
> to x5419 and ask for Ben)
>
> Permanent office at
> 230 Westney Road South,
> Annandale, L1S 7R3
> (905)428-4000x5458
> (Fax) (905)619-5453
>
>
> > -----Original Message-----
> > From: WONG Benedict -PICKERING
> > Sent: Monday, March 09, 2009 11:10 AM
> > To: 'mpich-discuss at mcs.anl.gov'
> > Subject: Question on fault tolerance
> >
> > We are planning to use MPI (Namely MPICH2) to implement our program.
> > This will be used as production system with ~30 machines,
> each with 2
> > Xeon L5430 2.66 GHz CPU and 24 GB of ram in each machine.
> >
> > We are designing our program so that it contains clients instances,
> > manager instances and worker instances. So this is a basic
> client and
> > server type of structure that clients startup whenever is needed and
> > talks to the server (Manager instances), and the manager instance
> > assign jobs to worker instances.
> >
> > We want to build fault tolerance into our program so we will have a
> > backup server (backup manager instance), and for worker instance, if
> > any of them goes down, the system will just working with degraded
> > performance....
> >
> > I read the "Beowulf Cluster Computing with Linux" and "Using Mpi
> > Two-volume Set" so I understand most of the details in MPI.
> >
> > However, here are the questions that I haven't find any answer yet:
> >
> > After I setup a communicator with n processes, and if one (or more)
> > died, can the rest of the processes continue to communicate and
> > function? Function such as MPI_Bcast, and MPI_Barrier,
> will they halt
> > all processes? How about the simple MPI_Send & MPI_Recv, could they
> > return some meaningful error so that the rest of the processes and
> > continue?
> >
> >
> > Regards,
> >
> > By Ben Wong
> > Benedict.Wong at opg.com
> > Currently at KG Kipling with no fix office space (416)231-4111(x5417
> > to x5419 and ask for Ben)
> >
> > Permanent office at
> > 230 Westney Road South,
> > Annandale, L1S 7R3
> > (905)428-4000x5458
> > (Fax) (905)619-5453
> >
> >
> -----------------------------------------
> THIS MESSAGE IS ONLY INTENDED FOR THE USE OF THE INTENDED
> RECIPIENT(S) AND MAY CONTAIN INFORMATION THAT IS PRIVILEGED,
> PROPRIETARY AND/OR CONFIDENTIAL. If you are not the intended
> recipient, you are hereby notified that any review, retransmission,
> dissemination, distribution, copying, conversion to hard copy or other
> use of this communication is strictly prohibited. If you are not the
> intended recipient and have received this message in error, please
> notify me by return e-mail and delete this message from your system.
> Ontario Power Generation Inc.
>
------------------------------
_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
End of mpich-discuss Digest, Vol 6, Issue 12
********************************************
More information about the mpich-discuss
mailing list