[mpich-discuss] Detecting process exit
Pavan Balaji
balaji at mcs.anl.gov
Fri Mar 30 21:18:18 CDT 2012
Hi Steve,
Thanks for reporting this issue.
The root cause of the problem is that mpiexec is only launching the
shell script and hence only knows about that process, not other
processes launched by this shell script. When something goes wrong
(such as a process terminates badly) and it wants to clean up the
remaining processes, it sends a signal to this shell script.
Unfortunately, that signal is not forwarded to the child processes by
the script, and the actual MPI processes never get the signal (and hence
are not killed).
The solution for this would be to send these signals to all processes in
the tree, rather than just the first child process. I can't figure out
a fully portable way to do this, but on linux we can create a new
process group for the child processes and send signals to the entire
group. Here's what I did:
https://trac.mcs.anl.gov/projects/mpich2/changeset/9660
For other platforms, mpiexec will fall back to its original behavior.
If you'd like to try it out, you can download a new mpiexec from here
(pick r9660 or greater):
http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/nightly/hydra/
You can also download a full mpich2 tarball from here (r9660 or higher):
http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/nightly/trunk/
-- Pavan
On 03/30/2012 03:00 PM, Steve Krueger wrote:
> I'm using using mpich2 1.4.1 on a Linux x86-64 machine.
>
> mpirun -n 2 mpitest
>
> where mpitest is a simple stand alone exe that does MPI_Init(), sleep(100), MPI_Finialize(),
>
> I kill -9 the mpitest process on one of the machines, the whole MPI world
> comes down as expected.
>
> However, if I do:
>
> mpirun -n 2 mpitest.sh
>
> where mitest.sh is a shell script that just runs the mpitest exe, and then kill the
> exe on one machine, the other machine does not detect this, and the other rank
> stays up.
>
> Is the notion of running a .sh from mpirun legal/supported? If so, is there an option
> that I should specify to hydra so that it will detect the death of the mpi process
> launched under a script?
>
> sk
>
> _______________________________________________
> mpich-discuss mailing list mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
--
Pavan Balaji
http://www.mcs.anl.gov/~balaji
More information about the mpich-discuss
mailing list