[mpich-discuss] Detecting process exit

Pavan Balaji balaji at mcs.anl.gov
Fri Mar 30 21:18:18 CDT 2012


Hi Steve,

Thanks for reporting this issue.

The root cause of the problem is that mpiexec is only launching the 
shell script and hence only knows about that process, not other 
processes launched by this shell script.  When something goes wrong 
(such as a process terminates badly) and it wants to clean up the 
remaining processes, it sends a signal to this shell script. 
Unfortunately, that signal is not forwarded to the child processes by 
the script, and the actual MPI processes never get the signal (and hence 
are not killed).

The solution for this would be to send these signals to all processes in 
the tree, rather than just the first child process.  I can't figure out 
a fully portable way to do this, but on linux we can create a new 
process group for the child processes and send signals to the entire 
group.  Here's what I did:

https://trac.mcs.anl.gov/projects/mpich2/changeset/9660

For other platforms, mpiexec will fall back to its original behavior.

If you'd like to try it out, you can download a new mpiexec from here 
(pick r9660 or greater):

http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/nightly/hydra/

You can also download a full mpich2 tarball from here (r9660 or higher):

http://www.mcs.anl.gov/research/projects/mpich2/downloads/tarballs/nightly/trunk/

  -- Pavan

On 03/30/2012 03:00 PM, Steve Krueger wrote:
> I'm using using mpich2 1.4.1 on a Linux x86-64 machine.
>
> mpirun -n 2 mpitest
>
> where mpitest is a simple stand alone exe that does MPI_Init(), sleep(100), MPI_Finialize(),
>
> I kill -9 the mpitest process on one of the machines, the whole MPI world
> comes down as expected.
>
> However, if I do:
>
> mpirun -n 2 mpitest.sh
>
> where mitest.sh is a shell script that just runs the mpitest exe, and then kill the
> exe on one machine, the other machine does not detect this, and the other rank
> stays up.
>
> Is the notion of running a .sh from mpirun legal/supported? If so, is there an option
> that I should specify to hydra so that it will detect the death of the mpi process
> launched under a script?
>
> sk
>
> _______________________________________________
> mpich-discuss mailing list     mpich-discuss at mcs.anl.gov
> To manage subscription options or unsubscribe:
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list