[mpich-discuss] failed processes

Jayesh Krishna jayesh at mcs.anl.gov
Thu Nov 4 11:05:57 CDT 2010


 SMPD (The only process manager available on windows right now) does not support disabling auto cleanup of processes. You can track the progress on implementing this feature at https://trac.mcs.anl.gov/projects/mpich2/ticket/1132 .
 As an alternative you can try installing MPICH2 (configure/make/make install) on Cygwin (when configuring MPICH2 pass "--disable-auto-cleanup" as per Darius's suggestion).

-Jayesh
----- Original Message -----
From: Darius Buntinas <buntinas at mcs.anl.gov>
To: mpich-discuss at mcs.anl.gov
Sent: Thu, 04 Nov 2010 10:34:18 -0500 (CDT)
Subject: Re: [mpich-discuss] failed processes


To set the error handler, do this after MPI_Init:
    MPI_Comm_set_errhandler(MPI_COMM_WORLD, MPI_ERRORS_RETURN);
This will set the error handler for MPI_COMM_WORLD and any communicators created from it.  You can read more about error handling in Section 8.3 of the MPI-2.2 standard: http://www.mpi-forum.org/docs/docs.html

Jayesh:  Do you know how to disable the "auto-cleanup" feature in smpd?

-d

On Nov 4, 2010, at 1:34 AM, Harun Raşit ER wrote:

> Darius Thanks for your help. I am using Windows platform and new to MPI. So I don't know how to pass the "-disable-auto-cleanup" to mpiexec. How can i do that? Can you explain it and send a simple sample code about setting MPI_ERRORS_RETURN?
> 
> On Wed, Nov 3, 2010 at 6:29 PM, Darius Buntinas <buntinas at mcs.anl.gov> wrote:
> 
> Hi Harun,
> 
> If you use MPICH2 1.3, and pass the -disable-auto-cleanup parameter to mpiexec, then your app will not automatically be killed when a process dies before calling MPI_Finalize.  You'll then need to set the default error handler in MPI to MPI_ERRORS_RETURN, so that the application won't abort when an error is detected.
> 
> The MPICH2 library should allow you to continue communicating with other processes if a process dies.  However, collective operations on a communicator that includes a dead process will most likely hang some processes.
> 
> I hope this helps.
> 
> -d
> 
> On Nov 3, 2010, at 4:17 AM, Harun Raşit ER wrote:
> 
> > When one of the processes is failed, all my job is aborted. But there must be a solution that i cannot find! I would like to continue without the failed process and do the job with remaining processes. Is there any idea or solution?
> >
> > thanks for your helps.
> > _______________________________________________
> > mpich-discuss mailing list
> > mpich-discuss at mcs.anl.gov
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

_______________________________________________
mpich-discuss mailing list
mpich-discuss at mcs.anl.gov
https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss



More information about the mpich-discuss mailing list