[mpich-discuss] Detecting process exit

Steve Krueger Steve.Krueger at sas.com
Sat Mar 31 15:14:21 CDT 2012


I added your changes into my 1.4.1p1 version. It helps, and makes the
case work where the ranks are on the same machine, but did not
work when the ranks were on different machines. Does that work
for you? If so, I might have missed one of your changes, or something
else in 1.4.1p1 is not compatible with your changes.

sk

> -----Original Message-----
> From: Pavan Balaji [mailto:balaji at mcs.anl.gov]
> Sent: Friday, March 30, 2012 10:18 PM
> To: mpich-discuss at mcs.anl.gov
> Cc: Steve Krueger
> Subject: Re: [mpich-discuss] Detecting process exit
> 
> Hi Steve,
> 
> Thanks for reporting this issue.
> 
> The root cause of the problem is that mpiexec is only launching the
> shell script and hence only knows about that process, not other
> processes launched by this shell script.  When something goes wrong
> (such as a process terminates badly) and it wants to clean up the
> remaining processes, it sends a signal to this shell script.
> Unfortunately, that signal is not forwarded to the child processes by
> the script, and the actual MPI processes never get the signal (and hence
> are not killed).
> 
> The solution for this would be to send these signals to all processes in
> the tree, rather than just the first child process.  I can't figure out
> a fully portable way to do this, but on linux we can create a new
> process group for the child processes and send signals to the entire
> group.  Here's what I did:
> 
> https://trac.mcs.anl.gov/projects/mpich2/changeset/9660
> 
> For other platforms, mpiexec will fall back to its original behavior.
> 
> If you'd like to try it out, you can download a new mpiexec from here
> (pick r9660 or greater):
> 
> http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/nig
> htly/hydra/
> 
> You can also download a full mpich2 tarball from here (r9660 or higher):
> 
> http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/nig
> htly/trunk/
> 
>   -- Pavan
> 
> On 03/30/2012 03:00 PM, Steve Krueger wrote:
> > I'm using using mpich2 1.4.1 on a Linux x86-64 machine.
> >
> > mpirun -n 2 mpitest
> >
> > where mpitest is a simple stand alone exe that does MPI_Init(),
> sleep(100), MPI_Finialize(),
> >
> > I kill -9 the mpitest process on one of the machines, the whole MPI world
> > comes down as expected.
> >
> > However, if I do:
> >
> > mpirun -n 2 mpitest.sh
> >
> > where mitest.sh is a shell script that just runs the mpitest exe, and then
> kill the
> > exe on one machine, the other machine does not detect this, and the
> other rank
> > stays up.
> >
> > Is the notion of running a .sh from mpirun legal/supported? If so, is there
> an option
> > that I should specify to hydra so that it will detect the death of the mpi
> process
> > launched under a script?
> >
> > sk
> >
> > _______________________________________________
> > mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> > To manage subscription options or unsubscribe:
> > https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
> 
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji




More information about the mpich-discuss mailing list