[mpich-discuss] If one process of Cluster crashes

Jayesh Krishna jayesh at mcs.anl.gov
Tue Oct 13 11:37:15 CDT 2009


Hi,
 Did you try using the MPI error handlers (MPI_Comm_create_errhandler() /
MPI_ERRORS_RETURN)?
 
Regards,
Jayesh
 
  _____  

From: abhishek pandey [mailto:hipandey at gmail.com] 
Sent: Tuesday, October 13, 2009 11:02 AM
To: Jayesh Krishna
Subject: Re: [mpich-discuss] If one process of Cluster crashes


Hi Jayesh,

Thanks for reply.

This is an application/network error. I am running several instances of my
application on different machines for very long time. So there is
possibility of either crash of one process or  loss of network
connectivity to any machine. In this case, the cluster would goes down for
now. But I want to ensure the other processes should be running
irrespective of one or more process failure.

Is there any way, I can handle this situation ?  

Thanks,
Abhishek


On Tue, Oct 13, 2009 at 8:20 PM, Jayesh Krishna <jayesh at mcs.anl.gov>
wrote:


Hi,
 We are currently working on adding fault-tolerance to MPICH2. So in
couple of months we might have something that you can work with.
 On a side note, what kind of process crash do you see ? Is this an
application error (which you should fix anyway)? Is it due to an internal
MPICH2 error ? Please provide us more details.
 
Regards,
Jayesh

  _____  

From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of abhishek pandey
Sent: Tuesday, October 13, 2009 7:23 AM
To: mpich-discuss at mcs.anl.gov
Subject: [mpich-discuss] If one process of Cluster crashes


Hi,

I am using MPICH2 on windows and sometime I face the problem of crashing
of one process in cluster. Is there any way to handle this ? I do not want
to start the cluster all over again.
As far as I know, if one process of cluster goes down anyhow then the
cluster also goes down. 


Thanks,
Abhishek.



-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20091013/76f5e499/attachment.htm>


More information about the mpich-discuss mailing list