[mpich-discuss] If one process of Cluster crashes

Jayesh Krishna jayesh at mcs.anl.gov
Tue Oct 13 15:06:29 CDT 2009


Hi,
 You can spawn processes dynamically (MPI_Comm_spawn()) with MPI (Is that
what you mean ?). How many processes are you trying to launch on your
cluster (The number of processes that can be launched would depend on the
capability of the OS to handle them - perf could get affected if you
launch too many procs on a single node)?
 
Regards,
Jayesh

  _____  

From: abhishek pandey [mailto:hipandey at gmail.com] 
Sent: Tuesday, October 13, 2009 12:42 PM
To: Jayesh Krishna
Subject: Re: [mpich-discuss] If one process of Cluster crashes


Hi Jayesh,

I haven't, I'll try it. Thanks.

BTW, can any process be dynamically added/removed from a  cluster ? Is
there any upper limit on the number of processes in cluster on window ?

Thanks,
Abhishek


On Wed, Oct 14, 2009 at 2:37 AM, Jayesh Krishna <jayesh at mcs.anl.gov>
wrote:


Hi,
 Did you try using the MPI error handlers (MPI_Comm_create_errhandler() /
MPI_ERRORS_RETURN)?
 
Regards,
Jayesh
 
  _____  

From: abhishek pandey [mailto:hipandey at gmail.com] 
Sent: Tuesday, October 13, 2009 11:02 AM
To: Jayesh Krishna
Subject: Re: [mpich-discuss] If one process of Cluster crashes


Hi Jayesh,

Thanks for reply.

This is an application/network error. I am running several instances of my
application on different machines for very long time. So there is
possibility of either crash of one process or  loss of network
connectivity to any machine. In this case, the cluster would goes down for
now. But I want to ensure the other processes should be running
irrespective of one or more process failure.

Is there any way, I can handle this situation ?  

Thanks,
Abhishek


On Tue, Oct 13, 2009 at 8:20 PM, Jayesh Krishna <jayesh at mcs.anl.gov>
wrote:


Hi,
 We are currently working on adding fault-tolerance to MPICH2. So in
couple of months we might have something that you can work with.
 On a side note, what kind of process crash do you see ? Is this an
application error (which you should fix anyway)? Is it due to an internal
MPICH2 error ? Please provide us more details.
 
Regards,
Jayesh

  _____  

From: mpich-discuss-bounces at mcs.anl.gov
[mailto:mpich-discuss-bounces at mcs.anl.gov] On Behalf Of abhishek pandey
Sent: Tuesday, October 13, 2009 7:23 AM
To: mpich-discuss at mcs.anl.gov
Subject: [mpich-discuss] If one process of Cluster crashes


Hi,

I am using MPICH2 on windows and sometime I face the problem of crashing
of one process in cluster. Is there any way to handle this ? I do not want
to start the cluster all over again.
As far as I know, if one process of cluster goes down anyhow then the
cluster also goes down. 


Thanks,
Abhishek.




-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20091013/83f751d3/attachment.htm>


More information about the mpich-discuss mailing list