[mpich-discuss] Process termination with Hydra

Pavan Balaji balaji at mcs.anl.gov
Fri Mar 11 01:30:16 CST 2011


Hi Rohit,

I think I know what's happening here -- earlier versions of Hydra used 
to cleanup all processes from all process groups, but we changed that to 
kill only one process group at a time (each spawn creates a new process 
group). The reason for this is to allow errors in a parent or child to 
not affect the other when disconnected. However, when you don't use 
connect/accept, this is not correct.

I have a patch for this and will commit it in after some more testing.

  -- Pavan

On 03/10/2011 07:17 PM, Jain, Rohit wrote:
> This is continuation to my earlier message with specific example.
>
> When application crashes with mpd, it exits with following message:
>
> rank 0 in job 1 mach1_42736 caused collective abort of all ranks
>
> exit status of rank 0: killed by signal 9
>
> With Hydra, application crashes, it hangs.
>
> When you do Ctrl-C, it show following message, but there are still
> application process around (not cleaned up):
>
> Ctrl-C caught... cleaning up processes
>
> Is it known issue with Hydra?
>
> Regards,
>
> Rohit
>
>
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list