[mpich-discuss] Question on fault tolerance

WONG Benedict -PICKERING benedict.wong at opg.com
Mon Mar 9 10:09:33 CDT 2009


We are planning to use MPI (Namely MPICH2) to implement our program.
This will be used as production system with ~30 machines, each with 2
Xeon L5430 2.66 GHz CPU and 24 of ram in each machine.

We are designing our program so that it contains clients instances,
manager instances and worker instances.  So this is a basic client and
server type of structure that clients startup whenever is needed and
talks to the server (Manager instances), and the manager instance assign
jobs to worker instances.

We want to build fault tolerance into our program so we will have a
backup server (backup manager instance), and for worker instance, if any
of them goes down, the system will just working with degraded
performance....

I read the "Beowulf Cluster Computing with Linux" and "Using Mpi
Two-volume Set" so I understand more of the details in MPI.

However, here are the questions that I haven't find any answer yet:

After I setup a communicator with n processes, and if one (or more) of
died, can the rest of the processes continue to communicate and
function?  Function such as MPI_Bcast, and MPI_Barrier, will they halt
all processes?  How about the simple MPI_Send & MPI_Recv, could they
return some meaningful error  so that the rest of the processes and
continue?


Regards,

By Ben Wong
Benedict.Wong at opg.com
Currently at KG Kipling with no fix office space
(416)231-4111(x5417 to x5419 and ask for Ben)

Permanent office at 
230 Westney Road South, 
Annandale, L1S 7R3
(905)428-4000x5458
(Fax) (905)619-5453


-----------------------------------------
THIS MESSAGE IS ONLY INTENDED FOR THE USE OF THE INTENDED
RECIPIENT(S) AND MAY CONTAIN INFORMATION THAT IS PRIVILEGED,
PROPRIETARY AND/OR CONFIDENTIAL. If you are not the intended
recipient, you are hereby notified that any review, retransmission,
dissemination, distribution, copying, conversion to hard copy or
other use of this communication is strictly prohibited. If you are
not the intended recipient and have received this message in error,
please notify me by return e-mail and delete this message from your
system. Ontario Power Generation Inc.


More information about the mpich-discuss mailing list