[MPICH] How do I get the communicator of the spawned group in the spawnee?

David Minor david-m at orbotech.com
Tue Jul 5 00:05:03 CDT 2005


Hello Rajeev,
    It's interesting that if I send signal -9 to any one of the children
after calling disconnect on the comworld of the parent, all the children die
gracefully, the parent remains alive and I can restart the children, I tried
this many times and didn't see any problems, so it seems like the
infrastructure is in place to handle this kind of thing. Is MPICH an open
source project?  I mean if I changed the code to be "fault tolerent" in this
situation would you consider adding the changes to the code base? Is there a
way to allow this kind of behavior within the bounds of the "official" 2.0
standard?  The two behavours I "need" are:
 
1) Ability to kill and restart all the children without affecting the parent
(this is in case a child goes into a near infinite loop on an algorithm).
2) That if one child dies all the children will die without affecting the
parent.
 
Since our application runs user code not under our control these are
"essential" features for us.  Unfortunetly windows compatibility is another
"essential" feature so we are somewhat limited in our choice of MPI
implementations. 
 
Regards,
David

-----Original Message-----
From: Rajeev Thakur [mailto:thakur at mcs.anl.gov]
Sent: Monday, July 04, 2005 6:58 PM
To: David Minor
Subject: RE: [MPICH] How do I get the communicator of the spawned group in
the spawnee?


David,
         The communicator passed to MPI_Abort must be a valid communicator
on the process calling MPI_Abort. Therefore, you cannot abort only the
child. However, a child could die on its own, and one would like this case
to be handled gracefully, without taking down the parent. This is up to the
implementation to handle. A "fault tolerant" implementation will try to do
this. MPICH2 doesn't support it yet, but we hope to do it sometime in the
future.
 
Rajeev
 


  _____  

From: David Minor [mailto:david-m at orbotech.com] 
Sent: Monday, July 04, 2005 12:11 AM
To: 'Rajeev Thakur'
Subject: RE: [MPICH] How do I get the communicator of the spawned group in
the spawnee?


Hello Rajeev,
Using the intercommunicator I can communicate with the spawned processes,
but I cannot call an abort on them without aborting the parent. I would like
for the spawned proceeese to be able to crash, or be aborted, without
crashing the parent process, which could then spawn them again. I thought
that if the parent process could get a communicator only to the spawned
processes I'd be able to do this. 
David

-----Original Message-----
From: Rajeev Thakur [mailto:thakur at mcs.anl.gov]
Sent: Sunday, July 03, 2005 6:48 PM
To: David Minor; mpich-discuss at mcs.anl.gov
Subject: RE: [MPICH] How do I get the communicator of the spawned group in
the spawnee?


The intercommunicator returned by MPI_Comm_spawn is the one you are looking
for.
 
MPI_Comm_get_parent on the spawned processes returns an intercommunicator
that has the spawned processes in one group and the parent processes in the
other group. MPI_Comm_spawn on the spawnees returns the same
intercommunicator, which can be used for communication with the spawned
processes.
 
Rajeev
 


  _____  

From: owner-mpich-discuss at mcs.anl.gov
[mailto:owner-mpich-discuss at mcs.anl.gov] On Behalf Of David Minor
Sent: Sunday, July 03, 2005 9:15 AM
To: mpich-discuss at mcs.anl.gov
Subject: [MPICH] How do I get the communicator of the spawned group in the
spawnee?



Hello List, 

intercomm.Get_parent() from the spawned processes returns me the
communicator of the spawnee, but how do I get the communicator of the

spawned processes from the spawnee?  intercomm.Get_remote_group() returns me
the group, but how do I get the communicator?

Thanks, 
David Minor 

-------------- next part --------------
An HTML attachment was scrubbed...
URL: <http://lists.mcs.anl.gov/pipermail/mpich-discuss/attachments/20050705/74d5ae0f/attachment.htm>


More information about the mpich-discuss mailing list