[MPICH] How do I get the communicator of the spawned group in the spawnee?
Rajeev Thakur
thakur at mcs.anl.gov
Tue Jul 5 11:44:48 CDT 2005
David,
Is
> MPICH an open source project? I mean if I changed the code
> to be "fault tolerent" in this situation would you consider
> adding the changes to the code base? Is there a way to allow
> this kind of behavior within the bounds of the "official" 2.0
> standard? The two behavours I "need" are:
> -----Original Message-----
> From: owner-mpich-discuss at mcs.anl.gov
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of David Minor
> Sent: Tuesday, July 05, 2005 12:05 AM
> To: 'mpich-discuss at mcs.anl.gov'
> Subject: RE: [MPICH] How do I get the communicator of the
> spawned group in the spawnee?
>
> Hello Rajeev,
> It's interesting that if I send signal -9 to any one of
> the children after calling disconnect on the comworld of the
> parent, all the children die gracefully, the parent remains
> alive and I can restart the children, I tried this many times
> and didn't see any problems, so it seems like the
> infrastructure is in place to handle this kind of thing. Is
> MPICH an open source project? I mean if I changed the code
> to be "fault tolerent" in this situation would you consider
> adding the changes to the code base? Is there a way to allow
> this kind of behavior within the bounds of the "official" 2.0
> standard? The two behavours I "need" are:
>
> 1) Ability to kill and restart all the children without
> affecting the parent (this is in case a child goes into a
> near infinite loop on an algorithm).
> 2) That if one child dies all the children will die without
> affecting the parent.
>
> Since our application runs user code not under our control
> these are "essential" features for us. Unfortunetly windows
> compatibility is another "essential" feature so we are
> somewhat limited in our choice of MPI implementations.
>
> Regards,
> David
>
> -----Original Message-----
> From: Rajeev Thakur [mailto:thakur at mcs.anl.gov]
> Sent: Monday, July 04, 2005 6:58 PM
> To: David Minor
> Subject: RE: [MPICH] How do I get the communicator of
> the spawned group in the spawnee?
>
>
> David,
> The communicator passed to MPI_Abort must be a
> valid communicator on the process calling MPI_Abort.
> Therefore, you cannot abort only the child. However, a child
> could die on its own, and one would like this case to be
> handled gracefully, without taking down the parent. This is
> up to the implementation to handle. A "fault tolerant"
> implementation will try to do this. MPICH2 doesn't support it
> yet, but we hope to do it sometime in the future.
>
> Rajeev
>
>
>
> ________________________________
>
> From: David Minor [mailto:david-m at orbotech.com]
> Sent: Monday, July 04, 2005 12:11 AM
> To: 'Rajeev Thakur'
> Subject: RE: [MPICH] How do I get the
> communicator of the spawned group in the spawnee?
>
>
> Hello Rajeev,
> Using the intercommunicator I can communicate
> with the spawned processes, but I cannot call an abort on
> them without aborting the parent. I would like for the
> spawned proceeese to be able to crash, or be aborted, without
> crashing the parent process, which could then spawn them
> again. I thought that if the parent process could get a
> communicator only to the spawned processes I'd be able to do this.
> David
>
> -----Original Message-----
> From: Rajeev Thakur [mailto:thakur at mcs.anl.gov]
> Sent: Sunday, July 03, 2005 6:48 PM
> To: David Minor; mpich-discuss at mcs.anl.gov
> Subject: RE: [MPICH] How do I get the
> communicator of the spawned group in the spawnee?
>
>
> The intercommunicator returned by
> MPI_Comm_spawn is the one you are looking for.
>
> MPI_Comm_get_parent on the spawned
> processes returns an intercommunicator that has the spawned
> processes in one group and the parent processes in the other
> group. MPI_Comm_spawn on the spawnees returns the same
> intercommunicator, which can be used for communication with
> the spawned processes.
>
> Rajeev
>
>
>
> ________________________________
>
> From:
> owner-mpich-discuss at mcs.anl.gov
> [mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of David Minor
> Sent: Sunday, July 03, 2005 9:15 AM
> To: mpich-discuss at mcs.anl.gov
> Subject: [MPICH] How do I get
> the communicator of the spawned group in the spawnee?
>
>
>
> Hello List,
>
> intercomm.Get_parent() from the
> spawned processes returns me the communicator of the spawnee,
> but how do I get the communicator of the
>
> spawned processes from the
> spawnee? intercomm.Get_remote_group() returns me the group,
> but how do I get the communicator?
>
> Thanks,
> David Minor
>
>
More information about the mpich-discuss
mailing list