[mpich-discuss] Socket closed
Darius Buntinas
buntinas at mcs.anl.gov
Wed Nov 4 10:17:06 CST 2009
It looks like maybe the application exited with a return value of 1
(indicating an error), without calling MPI_Finalize (or MPI_Abort).
-d
On 11/04/2009 10:09 AM, Tim Kroeger wrote:
> Dear Dave,
>
> On Wed, 4 Nov 2009, Dave Goodell wrote:
>
>> When using TCP, the "socket closed" reported on a process A is usually
>> a sign that there was actually a failure in some other process B. An
>> example would be B segfaulting for some reason, anywhere in the code
>> (including your user code) and then crashing. The OS tends to report
>> the broken TCP connection before the MPICH2 process management system
>> realizes that the process has died, killing one or more of B's peers
>> (like A). Then the process management system receives an explicit
>> MPI_Abort from the MPI_ERRORS_ARE_FATAL error handler, still before it
>> has noticed that B is already dead, and reports the failure from
>> process A instead.
>
> Okay, I understand.
>
>> It is unlikely that you are experiencing the same underlying problem
>> as ticket #838, despite the similar symptoms. Are there any messages
>> from the process manager about exit codes for your processes?
>
> I have now attached all messages I got. Note that I am running with 24
> processors. The stderr output contains 22 times the "socket close"
> thing (although it's not always exactly equal), whereas the stdout
> output coutains 4 times the "signal 9" thing and 9 times the "return
> code 1" thing. I find that somehow confusing, because it's not of the
> type "23*x+1*y" that it should be if once process crashed initiatively.
>
>> Do you have core dumps enabled?
>
> No; how can I enable them?
>
> Anyway, to examine whether your idea that one of the processes ran out
> of memory is correct, I'll meanwhile run the application with less
> processes per node (that is more nodes with the same number of total
> processes).
>
> Best Regards,
>
> Tim
>
>
> ------------------------------------------------------------------------
>
> _______________________________________________
> mpich-discuss mailing list
> mpich-discuss at mcs.anl.gov
> https://lists.mcs.anl.gov/mailman/listinfo/mpich-discuss
More information about the mpich-discuss
mailing list