[mpich-discuss] Process termination with Hydra

Pavan Balaji balaji at mcs.anl.gov
Fri Mar 11 23:24:49 CST 2011


This fix has been committed into trunk in r8219 and the 1.4.x branch in 
r8221. The nightly snapshots tonight will have this fix.

  -- Pavan

On 03/11/2011 05:23 PM, Jain, Rohit wrote:
> Great. Thanks Pavan.
>
> Would it improve sub-process termination too?
>
> Regards,
> Rohit
>
>
> -----Original Message-----
> From: Pavan Balaji [mailto:balaji at mcs.anl.gov]
> Sent: Thursday, March 10, 2011 11:30 PM
> To: mpich-discuss at mcs.anl.gov
> Cc: Jain, Rohit
> Subject: Re: [mpich-discuss] Process termination with Hydra
>
> Hi Rohit,
>
> I think I know what's happening here -- earlier versions of Hydra used
> to cleanup all processes from all process groups, but we changed that to
>
> kill only one process group at a time (each spawn creates a new process
> group). The reason for this is to allow errors in a parent or child to
> not affect the other when disconnected. However, when you don't use
> connect/accept, this is not correct.
>
> I have a patch for this and will commit it in after some more testing.
>
>    -- Pavan
>
> On 03/10/2011 07:17 PM, Jain, Rohit wrote:
>> This is continuation to my earlier message with specific example.
>>
>> When application crashes with mpd, it exits with following message:
>>
>> rank 0 in job 1 mach1_42736 caused collective abort of all ranks
>>
>> exit status of rank 0: killed by signal 9
>>
>> With Hydra, application crashes, it hangs.
>>
>> When you do Ctrl-C, it show following message, but there are still
>> application process around (not cleaned up):
>>
>> Ctrl-C caught... cleaning up processes
>>
>> Is it known issue with Hydra?
>>
>> Regards,
>>
>> Rohit
>>
>>
>>
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>

-- 
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list