[mpich-discuss] Hydra handling of non-zero exit codes (1.3.2, 1.4rc2)

Yauheni Zelenko zelenko at cadence.com
Fri Apr 29 16:16:44 CDT 2011


Hi!

I played a little bit more with results form waitpid() in pmip.c

Actually status returned by waitpid() includes exit code (main() return or exit(), see WEXITSTATUS). So it's incorrect to assume that something wrong happened with process only by comparing status with 0.

I think more complicate analysis using WIFSIGNALED, WCOREDUMP, WIFSTOPPED is needed.

I think will be good idea to fix this in 1.4.

Eugene.
________________________________________
From: Pavan Balaji [balaji at mcs.anl.gov]
Sent: Thursday, April 28, 2011 3:23 PM
To: mpich-discuss at mcs.anl.gov
Cc: Yauheni Zelenko
Subject: Re: [mpich-discuss] Hydra handling of non-zero exit codes (1.3.2, 1.4rc2)

Sorry, I misspoke. For cleanup, we don't actually look for the return
code, but rather if an internal (PMI) connection to the MPI processes is
broken. This is only for MPI processes -- for non-MPI processes, we
don't do any of this and let the user clean it up.

So, to go back to your question -- Hydra has no problem with a non-zero
exit codes. It does have a problem with applications aborting without
calling MPI_Finalize. But you can override that by passing
-disable-auto-cleanup.

  -- Pavan

On 04/28/2011 05:16 PM, Yauheni Zelenko wrote:
> Hi, Pavan!
>
> Thank you for help!
>
> Eugene.
> ________________________________________
> From: Pavan Balaji [balaji at mcs.anl.gov]
> Sent: Thursday, April 28, 2011 3:11 PM
> To: mpich-discuss at mcs.anl.gov
> Cc: Yauheni Zelenko
> Subject: Re: [mpich-discuss] Hydra handling of non-zero exit codes (1.3.2, 1.4rc2)
>
> If a process terminates with a non-zero return code, Hydra cleans up the
> remaining processes. Not doing this is bad, because it might cause the
> application to hang. You can disable automatic cleanup by passing the
> -disable-auto-cleanup option. I think this is what you are looking for.
>
> The return code of mpiexec is a bit-wise OR of all the process exit
> codes, so if all processes return the same exit code, mpiexec will
> return the same exit code as well.
>
>    -- Pavan
>
> On 04/28/2011 05:02 PM, Yauheni Zelenko wrote:
>> Hi!
>>
>> Our application could return non-zero exit codes as flag to launching script to make some further post-processing.
>>
>> Hydra prints "BAD TERMINATION OF ONE OF YOUR PROCESSES".
>>
>> I think will be good idea to add command line option to Hydra to allow non-zero exit codes and don't change them if all of them are same from all MPI processes.
>>
>> Problem may be reproduced with any MPICH2 example by returning non-zero from main().
>>
>> I also think will be good idea to print exit codes in Hydra verbose output.
>>
>> Eugene.
>> _______________________________________________
>> mpich-discuss mailing list
>> mpich-discuss at mcs.anl.gov
>> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
>
> --
> Pavan Balaji
> http://www.mcs.anl.gov/~balaji
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss

--
Pavan Balaji
http://www.mcs.anl.gov/~balaji


More information about the mpich-discuss mailing list